网络
- 1: 使用qbittorrent
- 2: linux bridge 软交换
- 3: open vSwitch
- 3.1: open vSwitch 安装
- 3.2: open vSwitch 简单软交换
- 3.3: TODO
- 4: nfs 网络文件系统
- 5: nfs rdma 支持
1 - 使用qbittorrent
选择版本
让我们快速了解一下 qBittorrent 安装方法之间的一些基本区别:
-
qBittorrent 桌面优势:
- 用户友好界面: 提供直观且易于浏览的图形界面。
- 简洁的体验: 作为开源软件,它没有广告和不必要的捆绑软件。
- 功能丰富:包括顺序下载、带宽调度等功能。
-
qBittorrent-nox 的优势:
- 专为无头系统优化: 专为最小化资源占用而设计,是受限系统的理想选择。
- 网络接口: 允许通过基于网络的界面进行操作。
- 远程管理: 通过网络接口方便管理服务器和远程系统。
自动安装
我用的debian是服务器版本,没有UI界面,因此选择安装 qbittorrent-nox (无图形界面版本):
sudo apt install qbittorrent-nox
为 qBittorrent 创建专用系统用户和组:
sudo adduser --system --group qbittorrent-nox
sudo adduser sky qbittorrent-nox
为 qBittorrent-nox 创建 Systemd 服务文件
sudo vi /etc/systemd/system/qbittorrent-nox.service
内容如下:
[Unit]
Description=qBittorrent Command Line Client
After=network.target
[Service]
Type=forking
User=qbittorrent-nox
Group=qbittorrent-nox
UMask=007
ExecStart=/usr/bin/qbittorrent-nox -d --webui-port=8080
Restart=on-failure
[Install]
WantedBy=multi-user.target
重启 daemon-reload:
sudo systemctl daemon-reload
启动
启动 qbittorrent-nox 准备必要的目录:
sudo mkdir /home/qbittorrent-nox
sudo chown qbittorrent-nox:qbittorrent-nox /home/qbittorrent-nox
sudo usermod -d /home/qbittorrent-nox qbittorrent-nox
启动:
sudo systemctl start qbittorrent-nox
查看启动状态:
sudo systemctl status qbittorrent-nox
看到信息如下:
qbittorrent-nox.service - qBittorrent Command Line Client
Loaded: loaded (/etc/systemd/system/qbittorrent-nox.service; enabled; preset: enabled)
Active: active (running) since Sun 2024-05-05 01:09:48 EDT; 3h 7min ago
Process: 768 ExecStart=/usr/bin/qbittorrent-nox -d --webui-port=8080 (code=exited, status=0/SUCCESS)
Main PID: 779 (qbittorrent-nox)
Tasks: 21 (limit: 9429)
Memory: 6.4G
CPU: 5min 23.810s
CGroup: /system.slice/qbittorrent-nox.service
└─779 /usr/bin/qbittorrent-nox -d --webui-port=8080
May 05 01:09:48 skynas3 systemd[1]: Starting qbittorrent-nox.service - qBittorrent Command Line Client...
May 05 01:09:48 skynas3 systemd[1]: Started qbittorrent-nox.service - qBittorrent Command Line Client.
设置开机自动启动:
sudo systemctl enable qbittorrent-nox
管理
访问 http://192.168.20.2:8080/ ,默认登录用户/密码为 admin
和 adminadmin
。
-
下载
- 默认保存路径:"/mnt/storage2/download"
-
连接
- 用于传入连接的端口:在路由器上增加这个端口的端口映射
-
Web-ui
- 语言:用户语言界面选择"简体中文"
- 修改用户密码
- 勾选 “对本地主机上的客户端跳过身份验证”
- 勾选 “对 IP 子网白名单中的客户端跳过身份验证”:
192.168.0.0/24,192.168.192.0/24
-
高级
- 勾选 “允许来自同一 IP 地址的多个连接”
- 勾选 “总是向同级的所有 Tracker 汇报”
参考资料
手工安装
安装依赖包
先安装依赖包:
sudo apt update
sudo apt install build-essential pkg-config automake libtool git libgeoip-dev python3 python3-dev
sudo apt install libboost-dev libboost-system-dev libboost-chrono-dev libboost-random-dev libssl-dev
sudo apt install qtbase5-dev qttools5-dev-tools libqt5svg5-dev zlib1g-dev
安装 libtorrent 1.2.19
安装libtorrent 1.2.19:
wget https://github.com/arvidn/libtorrent/releases/download/v1.2.19/libtorrent-rasterbar-1.2.19.tar.gz
tar xf libtorrent-rasterbar-1.2.19.tar.gz
cd libtorrent-rasterbar-1.2.19
./configure --disable-debug --enable-encryption --with-libgeoip=system CXXFLAGS=-std=c++14
make -j$(nproc)
sudo make install
sudo ldconfig
如果遇到错误:
checking whether g++ supports C++17 features with -std=c++17... yes
checking whether the Boost::System library is available... no
configure: error: Boost.System library not found. Try using --with-boost-system=lib
需要先安装 libboost-system-dev :
sudo apt install libboost-system-dev
如果遇到错误:
checking whether compiling and linking against OpenSSL works... no
configure: error: OpenSSL library not found. Try using --with-openssl=DIR or disabling encryption at all.
需要先安装 libssl-dev :
sudo apt install libssl-dev
安装 qbittorrent
不敢用太新的版本,还是找个稍微久一点的,继续沿用参考文档里面用的 4.3.9 版本:
wget https://github.com/qbittorrent/qBittorrent/archive/refs/tags/release-4.3.9.tar.gz
tar xf release-4.3.9.tar.gz
cd qBittorrent-release-4.3.9
./configure --disable-gui --disable-debug
make -j$(nproc)
sudo make install
设置开机自启
sudo vi /etc/systemd/system/qbittorrent.service
输入以下内容:
[Unit]
Description=qBittorrent Daemon Service
After=network.target
[Service]
LimitNOFILE=512000
User=root
ExecStart=/usr/local/bin/qbittorrent-nox
ExecStop=/usr/bin/killall -w qbittorrent-nox
[Install]
WantedBy=multi-user.target
启用开机自启:
sudo systemctl enable qbittorrent.service
第一次启动
先手工启动第一次:
sudo qbittorrent-nox
*** Legal Notice ***
qBittorrent is a file sharing program. When you run a torrent, its data will be made available to others by means of upload. Any content you share is your sole responsibility.
No further notices will be issued.
Press 'y' key to accept and continue...
按 Y 后,Ctrl+C退出。
用 systemctl 后台启动 qbittorrent:
sudo systemctl start qbittorrent.service
参考资料
2 - linux bridge 软交换
准备工作
安装好网卡驱动,如 cx3 / cx4 / cx5 的驱动。
搭建软交换
新建 linux bridge
安装 bridge 工具包:
sudo apt install bridge-utils -y
查看当前网卡情况:
$ ip addr
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:97:c4:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.20.25/24 brd 192.168.20.255 scope global enp6s18
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:0b:ff:7c brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:0b:ff:7d brd ff:ff:ff:ff:ff:ff
enp6s18 是对外的网卡(wan),enp1s0f0np0 和 enp1s0f1np1 准备用来连接其他机器(lan)。
修改 network interfaces,删除所有和 enp6s18 / 是对外的网卡(wan),enp1s0f0np0 / enp1s0f1np1 有关的内容:
sudo vi /etc/network/interfaces
最后保留的内容应该非常少,类似于:
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
然后新建一个 br0 文件来创建网桥:
sudo vi /etc/network/interfaces.d/br0
内容为:
auto br0
iface br0 inet static
address 192.168.20.2
broadcast 192.168.20.255
netmask 255.255.255.0
gateway 192.168.20.1
bridge_ports enp6s18 enp1s0f0np0 enp1s0f1np1
bridge_stp off
bridge_waitport 0
bridge_fd 0
注意这里要把所有需要加入网桥的网卡都列举在 bridge_ports 中(包括 wan 和 lan, 会自动识别) 保存后重启网络:
sudo systemctl restart networking
或者直接重启机器。(如果之前有配置wan口网卡的ip地址,并且新网桥的ip地址和wan的ip地址一致,就必须重启机器而不是重启网路。)
查看改动之后的网络:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br0 state UP group default qlen 1000
link/ether bc:24:11:97:c4:00 brd ff:ff:ff:ff:ff:ff
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:0b:ff:7c brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:ce:f6:0b:ff:7d brd ff:ff:ff:ff:ff:ff
6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 6e:38:a0:77:09:dc brd ff:ff:ff:ff:ff:ff
inet 192.168.20.2/24 brd 192.168.20.255 scope global br0
valid_lft forever preferred_lft forever
检查网络是否可以对外/对内访问。如果没有问题,说明第一步成功,网桥创建好了。
$ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.6e38a07709dc no enp1s0f0np0
enp1s0f1np1
enp6s18
$ bridge link
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
3: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 1
3: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 hwmode VEB
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 hwmode VEB
网卡另一头接上其他电脑,通过 dhcp 自动获取了IP地址,而且内外可以互通,测度测试也正常。
软交换就这么搭建好了。
小结和对比
上面这个方案,比之前在 ubuntu server 用的方案要简单很多,当时也用了 linux bridge,但需要建立子网,安装 dnsmaxq 做 dhcp 服务器端,配置内核转发,还要设置静态路由规则。
这个方案什么都不用做,就需要建立一个最简单的 linux bridge 就可以了。
开启内核转发
备注: 发现不做这个设置,也可以正常转发。不过还是加上吧。
sudo vi /etc/sysctl.conf
把这两行注释取消,打开配置:
net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
让改动生效:
sudo sysctl --load /etc/sysctl.conf
设置 MTU
默认 mtu 为 1500,为了最大化发挥网卡的能力,需要修改 mtu 为 9000.
对于机器 B / C,只要执行下面的命令即可:
sudo ip link set enp6s16np0 mtu 9000
对于软交换所在的 A 机器,需要将 linux bridge 和相关的网卡都设置为 mut 9000:
sudo ip link set dev br0 mtu 9000
sudo ip link set dev enp1s0f0np0 mtu 9000
sudo ip link set dev enp1s0f1np1 mtu 9000
这个方式只能临时生效,一旦机器重启就会恢复成默认的 mtu 1500。解决的方法是修改 rc.local
参考资料:
性能测试
cx3测试
cx4测试
25g网卡连接速度25g,用 iperf2 测试出来的带宽有23.5g,接近25g的理论值。
忘了有没有改 mtu了,下次再测试看看。
cx5测试
以 cx5 网卡为例,软交换机器 A (ip 为 192.168.0.100)上一块 cx5 双口网卡,另外两台测试机器 B (ip 为 192.168.0.125) / C (ip 为 192.168.0.127) 上一块单口网卡,用 100G dac线材连接。三台机器都安装好 iperf 和 iperf3。
iperf2 测试
在软交换机器 A 上启动 iperf 服务器端:
iperf -s -p 10001
在 B/C 两台机器上分别启动 iperf 客户端:
iperf -c 192.168.0.100 -P 3 -t 10 -p 10001 -i 1
测试直连速度: 达到 99.0 Gits/sec, 非常接近 100G 的理论最大值。
备注: 如果不设置为 mtu 9000, 采用默认的 mtu 1500,只能达到约 94 G 。
[SUM] 0.0000-10.0001 sec 115 GBytes 99.0 Gbits/sec
测试转发速度: 在其中一台机器 B 上启动 iperf 服务器端,然后从另外一台机器 C 上测试连接机器 B, 速度可以达到 76.9 G,和 100G 理论最大值相比损失还是挺大的。
iperf -c 192.168.0.125 -P 2 -t 10 -p 10001 -i 1
[SUM] 0.0000-6.9712 sec 62.4 GBytes 76.9 Gbits/sec
注意:-P 2
也就是启动两个线程,可以测试出最大值,继续增加线程反而会下降。
特别注意的是,网络速度测试时,在软交换机器 A 上,用 top 命令看似乎几乎没有任何性能消耗:
但实际上,在 pve 虚拟机上,显示占用了块两个核心:
可见此时并没有实现我们期待的在进行软交换时启用 rdma 来减少 cpu 消耗的目的。因此,需要继续考虑如何开启 rdma
iperf3 测试
类似的,测试出来 iperf3 下软交换的速度大概是 63.9 G:
$ iperf3 -c 192.168.0.125 -P 2 -t 10 -i 1
......
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec receiver
[ 7] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec 0 sender
[ 7] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec receiver
[SUM] 0.00-10.00 sec 74.4 GBytes 63.9 Gbits/sec 0 sender
[SUM] 0.00-10.00 sec 74.4 GBytes 63.9 Gbits/sec receiver
top下依然看到 cpu 占用极低,但 pve 页面显示有 5% 的 cpu。
增加 -Z
参数开启 iperf3 的 zero copy 功能再测试,速度大幅提升达到 77.8 G:
iperf3 -c 192.168.0.125 -P 2 -t 10 -i 1 -Z
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 45.3 GBytes 38.9 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 45.3 GBytes 38.9 Gbits/sec receiver
[ 7] 0.00-10.00 sec 45.3 GBytes 38.9 Gbits/sec 0 sender
[ 7] 0.00-10.00 sec 45.3 GBytes 38.9 Gbits/sec receiver
[SUM] 0.00-10.00 sec 90.6 GBytes 77.8 Gbits/sec 0 sender
[SUM] 0.00-10.00 sec 90.6 GBytes 77.8 Gbits/sec receiver
此时 top 下依然看到 cpu 占用极低,但 pve 页面显示有 6.59% 的 cpu。
参考资料
3 - open vSwitch
3.1 - open vSwitch 安装
用 apt 安装
sudo apt install openvswitch-switch
查看Open vSwitch(OVS)的版本:
$ sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
ovs_version: "3.1.0"
从源码开始编译安装
参考官方文档:
https://github.com/openvswitch/ovs/blob/main/Documentation/intro/install/debian.rst
https://docs.openvswitch.org/en/stable/intro/install/debian/
下载源码
mkdir -p ~/temp/
cd ~/temp/
git clone https://github.com/openvswitch/ovs.git
cd ovs
git checkout v3.5.0
准备编译
apt-get install build-essential fakeroot
打开 ovs 源码文件中的 debian/control.in
文件,找到 Build-Depends:
这一段,将这里列出来的依赖都用 apt 安装一遍。
sudo apt install autoconf automake bzip2 debhelper-compat dh-exec dh-python dh-sequence-python3 dh-sequence-sphinxdoc graphviz iproute2 libcap-ng-dev
sudo apt install libdbus-1-dev libnuma-dev libpcap-dev libssl-dev libtool libunbound-dev openssl pkg-config procps python3-all-dev python3-setuptools python3-sortedcontainers python3-sphinx
sudo apt install libdpdk-dev
准备编译:
$ ./boot.sh && ./configure --with-dpdk=shared && make debian
检查依赖是否都满足要求:
$ dpkg-checkbuilddeps
dpkg-checkbuilddeps: error: Unmet build dependencies: libdpdk-dev (>= 24.11)
但 apt 能安装的最新版本也就是 22.11.7 版本:
apt list -a libdpdk-dev
Listing... Done
libdpdk-dev/stable,stable-security,now 22.11.7-1~deb12u1 amd64 [installed]
只好删除这个版本:
sudo apt remove libdpdk-dev
然后手工安装最新版本的 libdpdk-dev
安装 libdpdk-dev
下载页面:
https://core.dpdk.org/download/
找到 DPDK 24.11.2 (LTS) 这个版本,下载下来:
cd ~/temp/
wget https://fast.dpdk.org/rel/dpdk-24.11.2.tar.xz
tar xvf dpdk-24.11.2.tar.xz
cd dpdk-stable-24.11.2
参考文档: https://github.com/openvswitch/ovs/blob/main/Documentation/intro/install/debian.rst 以及参考 deepseek 的安装指导。
sudo apt install build-essential meson ninja-build python3-pyelftools libnuma-dev pkg-config
meson build
ninja -C build
cd build
sudo ninja install
# 更新动态链接库
sudo ldconfig
安装完成之后,验证一下:
pkg-config --modversion libdpdk
输出为:
24.11.2
继续安装
准备编译:
# 进入 ovs 源码目录
cd ovs
$ ./boot.sh && ./configure --with-dpdk=shared && make debian
检查依赖是否都满足要求:
$ dpkg-checkbuilddeps
dpkg-checkbuilddeps: error: Unmet build dependencies: libdpdk-dev (>= 24.11)
这里要求的是 libdpdk-dev,但前面源码安装出来的是 libdpdk
找到编译出来的 libdpdk.pc 文件:
sudo find / -name "libdpdk.pc" 2>/dev/null
/home/sky/temp/dpdk-stable-24.11.2/build/meson-private/libdpdk.pc
/usr/local/lib/x86_64-linux-gnu/pkgconfig/libdpdk.pc
将 libdpdk.pc 文件复制到 /usr/share/pkgconfig/ 目录:
sudo mkdir -p /usr/share/pkgconfig/
sudo cp /usr/local/lib/x86_64-linux-gnu/pkgconfig/libdpdk.pc /usr/share/pkgconfig/
更新 PKG_CONFIG_PATH:
export PKG_CONFIG_PATH=/usr/share/pkgconfig:$PKG_CONFIG_PATH
备注:如果想永久生效,可添加到 ~/.zshrc
sudo mkdir -p /usr/include/dpdk
sudo ln -s /usr/local/include/* /usr/include/dpdk/
sudo ldconfig
继续编译:
make debian-deb
报错:
dpkg-shlibdeps: error: no dependency information found for /usr/local/lib/x86_64-linux-gnu/librte_vhost.so.25 (used by debian/openvswitch-switch-dpdk/usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk)
Hint: check if the library actually comes from a package.
dh_shlibdeps: error: dpkg-shlibdeps -Tdebian/openvswitch-switch-dpdk.substvars debian/openvswitch-switch-dpdk/usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk returned exit code 255
dh_shlibdeps: error: Aborting due to earlier error
make[1]: *** [debian/rules:8: binary] Error 255
make[1]: Leaving directory '/home/sky/temp/ovs'
make: *** [Makefile:7300: debian-deb] Error 2
报错的原因是 dpkg-shlibdeps 是 Debian 打包工具的一部分,用于自动检测二进制文件的动态库依赖。由于我们通过源码安装了 DPDK,dpkg 不知道 librte_vhost.so.25 属于哪个包,因此报错。
考虑到生成 deb 包不是必须,我们可以直接安装 OVS(不打包成 .deb),这样就不触发 dpkg-shlibdeps 检查。
sudo make install
检查安装之后的版本:
$ ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 3.5.0
DPDK 24.11.2
安装不带 dpdk
不带 dpdk 会简单很多:
https://cloudspinx.com/build-open-vswitch-from-source-on-debian-and-ubuntu/
sudo dpkg -i ./openvswitch-common_3.5.0-1_amd64.deb sudo dpkg -i ./openvswitch-common-dbgsym_3.5.0-1_amd64.deb sudo dpkg -i ./openvswitch-switch_3.5.0-1_amd64.deb sudo dpkg -i ./openvswitch-switch-dbgsym_3.5.0-1_amd64.deb
sudo apt install libxdp1 libxdp-dev sudo apt install libfdt1 libfdt-dev
ovs-vsctl show ovs-vsctl: unix:/usr/local/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
3.2 - open vSwitch 简单软交换
背景
前面用 linux bridge 做软交换,功能没问题,性能看似还行,但其实有不少问题:
- 跑不到网卡极限
- cpu占用较高
- rdma 之类的高级特性缺失
cx5 介绍
https://www.nvidia.com/en-us/networking/ethernet/connectx-5/
NVIDIA® Mellanox® ConnectX®-5适配器提供先进的硬件卸载功能,可减少CPU资源消耗,并实现极高的数据包速率和吞吐量。这提高了数据中心基础设施的效率,并为Web 2.0、云、数据分析和存储平台提供了最高性能和最灵活的解决方案。
这里提到的 “先进的硬件卸载功能,可减少CPU资源消耗”,正是我需要的。
实现最高效率的主要特性:
-
NVIDIA RoCE技术封装了以太网上的数据包传输,并降低了CPU负载,从而为网络和存储密集型应用提供高带宽和低延迟的网络基础设施。
-
突破性的 NVIDIA ASAP² 技术通过将 Open vSwitch 数据路径从主机CPU卸载到适配器,提供创新的 SR-IOV 和 VirtIO 加速,从而实现极高的性能和可扩展性。
-
ConnectX 网卡利用 SR-IOV 来分离对虚拟化环境中物理资源和功能的访问。这减少了I/O开销,并允许网卡保持接近非虚拟化的性能。
关键特性:
- 每个端口最高可达100 Gb/s以太网
- 可靠传输上的自适应路由
- NVMe over Fabric(NVMf)目标卸载
- 增强的vSwitch / vRouter卸载
- NVGRE和VXLAN封装流量的硬件卸载
- 端到端QoS和拥塞控制
cx5 网卡资料:
https://network.nvidia.com/files/doc-2020/pb-connectx-5-en-card.pdf
硬件加速的基本知识
NVIDIA DPU白皮书:SR-IOV vs. VirtIO加速性能对比
https://aijishu.com/a/1060000000228117
搭建 open vswitch 软交换
安装 open vswitch
sudo apt install openvswitch-switch
查看Open vSwitch(OVS)的版本:
$ sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
ovs_version: "3.1.0"
和 openvswitch 模块的信息:
$ sudo modinfo openvswitch
filename: /lib/modules/6.1.0-33-amd64/kernel/net/openvswitch/openvswitch.ko
alias: net-pf-16-proto-16-family-ovs_ct_limit
alias: net-pf-16-proto-16-family-ovs_meter
alias: net-pf-16-proto-16-family-ovs_packet
alias: net-pf-16-proto-16-family-ovs_flow
alias: net-pf-16-proto-16-family-ovs_vport
alias: net-pf-16-proto-16-family-ovs_datapath
license: GPL
description: Open vSwitch switching datapath
depends: nf_conntrack,nsh,nf_nat,nf_defrag_ipv6,nf_conncount,libcrc32c
retpoline: Y
intree: Y
name: openvswitch
vermagic: 6.1.0-33-amd64 SMP preempt mod_unload modversions
sig_id: PKCS#7
signer: Debian Secure Boot CA
sig_key: 32:A0:28:7F:84:1A:03:6F:A3:93:C1:E0:65:C4:3A:E6:B2:42:26:43
sig_hashalgo: sha256
signature: 7E:3C:EA:A0:18:FE:81:6D:2C:A8:08:8A:1D:BD:D5:13:F1:5D:FE:C4:
06:2C:3B:4B:B2:4A:6D:1E:30:AD:65:CC:DB:87:73:F4:D7:D5:30:76:
D6:FF:E2:77:28:0A:AA:17:92:C4:C5:DF:EC:E8:E5:95:88:B4:62:36:
AF:BF:58:96:D0:C1:ED:A3:6D:23:18:DD:A0:CF:A6:2C:6E:71:B2:83:
AD:45:F0:59:8D:FB:6D:8C:FB:9D:80:4D:0A:16:0A:9B:CE:A3:61:60:
BC:85:9D:EE:70:4D:5A:62:6E:E3:33:C1:58:2B:C4:CE:36:27:C9:A5:
BB:6C:7D:F3:B5:74:C8:FA:C3:5F:E5:1B:28:46:55:7E:26:0E:2A:7A:
54:4B:DD:74:E8:EA:40:43:2B:62:F6:DC:13:A6:A3:C6:EA:BF:1B:41:
2B:0A:92:01:2D:57:02:EA:0A:24:C9:75:EB:F3:34:41:35:D7:31:67:
65:96:9B:3B:65:47:1B:2E:60:97:E9:C3:40:10:9F:C6:91:EB:C4:DB:
0C:D5:5D:9C:99:ED:DF:3C:CA:B3:DB:61:44:A9:A0:C5:1D:1D:C8:CF:
01:39:D6:F3:FE:81:2D:43:2C:DE:F7:A1:06:E5:EE:79:31:DC:41:83:
59:BC:30:42:BB:68:C8:27:AF:AE:69:30:51:2E:02:6A
检查 openvswitch 模块的正确加载:
$ lsmod | grep openvswitch
openvswitch 192512 0
nsh 16384 1 openvswitch
nf_conncount 24576 1 openvswitch
nf_nat 57344 1 openvswitch
nf_conntrack 188416 3 nf_nat,openvswitch,nf_conncount
nf_defrag_ipv6 24576 2 nf_conntrack,openvswitch
libcrc32c 16384 3 nf_conntrack,nf_nat,openvswitch
查看 openvswitch-switch 服务的运行情况:
$ sudo systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; preset: enabled)
Active: active (exited) since Tue 2025-04-29 14:39:38 CST; 1min 54s ago
Process: 1816 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 1816 (code=exited, status=0/SUCCESS)
CPU: 340us
Apr 29 14:39:38 debian12 systemd[1]: Starting openvswitch-switch.service - Open vSwitch...
Apr 29 14:39:38 debian12 systemd[1]: Finished openvswitch-switch.service - Open vSwitch.
查看 ovsdb-server 服务:
$ sudo systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
Loaded: loaded (/lib/systemd/system/ovsdb-server.service; static)
Active: active (running) since Tue 2025-04-29 14:39:38 CST; 2min 28s ago
Process: 1718 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monito>
Main PID: 1761 (ovsdb-server)
Tasks: 1 (limit: 19089)
Memory: 2.2M
CPU: 38ms
CGroup: /system.slice/ovsdb-server.service
└─1761 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:inf>
Apr 29 14:39:38 debian12 systemd[1]: Starting ovsdb-server.service - Open vSwitch Database Unit.>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Backing up database to /etc/openvswitch/conf.db.backup8.>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Compacting database.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Converting database schema.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Starting ovsdb-server.
Apr 29 14:39:38 debian12 ovs-vsctl[1762]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- >
Apr 29 14:39:38 debian12 ovs-vsctl[1767]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Configuring Open vSwitch system IDs.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Enabling remote OVSDB managers.
Apr 29 14:39:38 debian12 systemd[1]: Started ovsdb-server.service - Open vSwitch Database Unit.
查看 ovs-vswitchd 服务:
$ sudo systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
Loaded: loaded (/lib/systemd/system/ovs-vswitchd.service; static)
Active: active (running) since Tue 2025-04-29 14:39:38 CST; 3min 28s ago
Process: 1771 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monito>
Main PID: 1810 (ovs-vswitchd)
Tasks: 1 (limit: 19089)
Memory: 2.1M
CPU: 17ms
CGroup: /system.slice/ovs-vswitchd.service
└─1810 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err ->
Apr 29 14:39:38 debian12 systemd[1]: Starting ovs-vswitchd.service - Open vSwitch Forwarding Uni>
Apr 29 14:39:38 debian12 ovs-ctl[1771]: Starting ovs-vswitchd.
Apr 29 14:39:38 debian12 ovs-ctl[1771]: Enabling remote OVSDB managers.
Apr 29 14:39:38 debian12 systemd[1]: Started ovs-vswitchd.service - Open vSwitch Forwarding Unit.
创建网桥
创建网桥前,先查看当前的网络接口:
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global dynamic enp6s18
valid_lft 43156sec preferred_lft 43156sec
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6989sec preferred_lft 2523sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
当前有 3 个网口,分别是 enp6s18, enp1s0f0np0, enp1s0f1np1。enp6s18 是 cx4 25g 网卡,接交换机,enp1s0f0np0 和 enp1s0f1np1 是 cx5 100g 双头网卡,准备接另外两台机器。
创建 ovs 网桥:
sudo ovs-vsctl add-br br0
此时执行 ip addr
可以看到 ovs-system 和 br0:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global dynamic enp6s18
valid_lft 43027sec preferred_lft 43027sec
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6860sec preferred_lft 2394sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 46:39:3e:eb:b0:0b brd ff:ff:ff:ff:ff:ff
6: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 8a:1f:49:5c:10:4d brd ff:ff:ff:ff:ff:ff
注意: 此时网桥的 mac 地址 8a:1f:49:5c:10:4d 和其他三个网卡都不一样。
目前网桥下还没有配置任何端口:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
ovs_version: "3.1.0"
查看 br0 的详细信息:
sudo ovs-vsctl list br br0
_uuid : a345132c-00d9-4196-bb2c-2489c3669b47
auto_attach : []
controller : []
datapath_id : "00002e1345a39641"
datapath_type : ""
datapath_version : "<unknown>"
external_ids : {}
fail_mode : []
flood_vlans : []
flow_tables : {}
ipfix : []
mcast_snooping_enable: false
mirrors : []
name : br0
netflow : []
other_config : {}
ports : [1e053d35-c79f-4b5b-88cc-f5cba9519ca0]
protocols : []
rstp_enable : false
rstp_status : {}
sflow : []
status : {}
stp_enable : false
先修改 /etc/network/interfaces 文件,安全起见备份一份:
sudo cp /etc/network/interfaces /etc/network/interfaces.bak
sudo vi /etc/network/interfaces
内容为:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
#allow-hotplug enp6s18
#iface enp6s18 inet dhcp
# Open vSwitch Configuration
# active br0 interface
auto br0
allow-ovs br0
# config IP for interface br0
iface br0 inet static
address 192.168.3.227
netmask 255.255.255.0
gateway 192.168.3.1
ovs_type OVSBridge
ovs_ports enp6s18
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
然后给网桥中添加端口,先加 enp6s18:
sudo ovs-vsctl add-port br0 enp6s18
执行完成后网络就不能用了。只能在 pve 控制台继续控制。
加入 enp6s18 后,网桥的 mac 地址变为 enp6s18 的 mac 地址:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
Port enp6s18
Interface enp6s18
ovs_version: "3.1.0"
重启网络:
sudo systemctl restart networking
此时虚拟机的网络应该恢复了,可以重新 ssh 连接。查看此时网口信息:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
9: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 46:39:3e:eb:b0:0b brd ff:ff:ff:ff:ff:ff
10: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global br0
valid_lft forever preferred_lft forever
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6477sec preferred_lft 2874sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
注意: 此时 br0 的 mac 地址变为 enp6s18 的 mac 地址 bc:24:11:1c:d5:48。
检查一下是否可以访问外网:
ping 192.168.3.1
ping www.baidu.com
如果可以访问外网,则说明网络配置成功, openvswitch 网桥建立并可以工作。
往网桥中加入其他网口
目前网桥中只有 enp6s18 一个网口,只能用于访问外部访问,需要继续加入其他网口。
......
# config IP for interface br0
iface br0 inet static
address 192.168.3.227
netmask 255.255.255.0
gateway 192.168.3.1
ovs_type OVSBridge
ovs_ports enp6s18 enp1s0f0np0 enp1s0f1np1
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
# attach enp1s0f0np0 to bridge br0
allow-br0 enp1s0f0np0
iface enp1s0f0np0 inet manual
ovs_bridge br0
ovs_type OVSPort
# attach enp1s0f1np1 to bridge br0
allow-br0 enp1s0f1np1
iface enp1s0f1np1 inet manual
ovs_bridge br0
ovs_type OVSPort
重启网络
sudo systemctl restart networking
之后查看网桥信息:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
Port enp1s0f0np0
Interface enp1s0f0np0
Port enp6s18
Interface enp6s18
Port enp1s0f1np1
Interface enp1s0f1np1
ovs_version: "3.1.0"
此时网桥中已经有三个网口,可以用来连接其他机器扮演交换机的角色了。
在另外一台机器的虚拟机中,直通 cx5 网卡,网线直连接到前面这台机器的 cx5 双头网卡上。注意去掉虚拟机的 vmbr 网卡,只使用 cx5 网卡。
修改网络配置文件:
sudo vi /etc/network/interfaces
添加如下内容(dhcp或者静态地址二选一):
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
allow-hotplug enp1s0np0
iface enp1s0np0 inet dhcp
#iface enp1s0np0 inet static
#address 192.168.3.130
#netmask 255.255.255.0
#gateway 192.168.3.1
#dns-nameservers 192.168.3.1
重启网络:
sudo systemctl restart networking
检查此时是否可以正常连接网络。
简单性能测试
用 iperf 简单测试一下性能:
# 在服务器端运行
iperf -s
# 在客户端运行
iperf -c 192.168.3.227 -i 1 -t 20 -P 4
测试结果:
[SUM] 11.0000-11.7642 sec 8.37 GBytes 94.1 Gbits/sec
[SUM] 0.0000-11.7642 sec 129 GBytes 94.1 Gbits/sec
可以看到,服务器端和客户端的带宽都达到了 94.1 Gbits/sec,接近 100G 的理论极限。
设置 MTU
目前网桥的 MTU 是 1500,可以设置为 9000:
sudo ip link set br0 mtu 9000
sudo ip link set enp1s0f0np0 mtu 9000
sudo ip link set enp1s0f1np1 mtu 9000
sudo ip link set enp6s18 mtu 9000
注意: 三块网卡都要设置,如果只设置 enp1s0f0np0 和 enp1s0f1np1, 没有设置 enp6s18, 速度不会有提升。
客户端也要设置 MTU:
sudo ip link set enp1s0np0 mtu 9000
重新测试,这次 iperf 测试出来的速度从 94.1 Gbits/sec 提升到了 98.8 Gbits/sec, 提升还是比较明显的。
[SUM] 0.0000-20.0004 sec 230 GBytes 98.8 Gbits/sec
当这个方式重启之后就 MTU 又回复成 1500 了。
修改网络配置文件:
sudo vi /etc/network/interfaces
服务器端修改,将 3 个网卡都设置为 mtu 9000 即可:
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
# attach enp1s0f0np0 to bridge br0
allow-br0 enp1s0f0np0
iface enp1s0f0np0 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
# attach enp1s0f1np1 to bridge br0
allow-br0 enp1s0f1np1
iface enp1s0f1np1 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
客户端修改:
allow-hotplug enp1s0np0
iface enp1s0np0 inet static
address 192.168.3.205
netmask 255.255.255.0
gateway 192.168.3.1
dns-nameservers 192.168.3.1
mtu 9000
特别注意: 测试中发现,如果用 static 静态地址, 这样设置 MTU 9000 可以生效。但是如果用 dhcp 动态地址, 这样设置 MTU 9000 不会生效,会继续使用 1500。估计是 dhcp 动态地址获取的 MTU 是 1500。
重启网络:
sudo systemctl restart networking
sudo reboot
开启硬件卸载
sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
sudo systemctl restart networking
sudo systemctl restart openvswitch-switch
https://infoloup.no-ip.org/openvswitch-debian12-creation/
➜ ~ sudo ip link set br0 mtu 9000 ➜ ~ sudo ip link set enp1s0f1np1 mtu 9000 ➜ ~ sudo ip link set enp1s0f0np0 mtu 9000 ➜ ~ sudo ip link set enp6s18 mtu 9000
https://docs.openstack.org/neutron/latest/admin/config-ovs-offload.html
https://www.openvswitch.org/support/ovscon2019/day2/0951-hw_offload_ovs_con_19-Oz-Mellanox.pdf
https://docs.nvidia.com/doca/archive/doca-v2.2.0/switching-support/index.html
3.3 - TODO
文档
https://docs.nvidia.com/doca/archive/doca-v2.2.0/switching-support/index.html
Hardware Offload
4 - nfs 网络文件系统
nfs 服务器端
安装 nfs server
sudo apt install nfs-kernel-server -y
准备硬盘和分区
直通一块 3.84T 的 KIOXIA CD6 pcie4 SSD 进来虚拟机。查看这块 ssd 硬盘:
lspci | grep Non-Volatile
02:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller Cx6 (rev 01)
硬盘分区:
sudo fdisk /dev/nvme0n1
g 转为 GPT partition table p 打印分区表 n 创建新分区,这里就只创建一个大分区。
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 512G 0 disk
├─sda1 8:1 0 512M 0 part /boot/efi
├─sda2 8:2 0 465.7G 0 part /
└─sda3 8:3 0 45.8G 0 part /timeshift
nvme0n1 259:0 0 3.5T 0 disk
└─nvme0n1p1 259:1 0 3.5T 0 part
将硬盘格式化为 ext4 文件系统:
sudo mkfs.ext4 /dev/nvme0n1p1
查看 ssd 分区的 uuid:
$ sudo lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1 vfat FAT32 BE75-FC62 505.1M 1% /boot/efi
├─sda2 ext4 1.0 81fdaf25-6712-48ee-bb53-1c4a78c8ef9f 430.4G 1% /
└─sda3 ext4 1.0 4b922cfb-2123-48ce-b9fe-635e73fb6aa8 39G 8% /timeshift
nvme0n1
└─nvme0n1p1 ext4 1.0 1dee904a-aa51-4180-b65b-9449405b841f
准备挂载这块硬盘
sudo vi /etc/fstab
增加以下内容:
# data storage was on /dev/nvme0n1p1(3.84T)
UUID=1dee904a-aa51-4180-b65b-9449405b841f /mnt/data ext4 defaults 0 2
重启之后再看:
$ sudo lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1 vfat FAT32 BE75-FC62 505.1M 1% /boot/efi
├─sda2 ext4 1.0 81fdaf25-6712-48ee-bb53-1c4a78c8ef9f 430.4G 1% /
└─sda3 ext4 1.0 4b922cfb-2123-48ce-b9fe-635e73fb6aa8 39G 8% /timeshift
nvme0n1
└─nvme0n1p1 ext4 1.0 1dee904a-aa51-4180-b65b-9449405b841f 3.3T 0% /mnt/data
准备伪文件系统
为了方便后续的管理,采用伪文件系统:
sudo mkdir -p /mnt/data/shared
sudo chown -R nobody:nogroup /mnt/data/shared
cd /mnt/data
创建 export 目录:
sudo mkdir -p /exports/shared
sudo chown -R nobody:nogroup /exports
修改 /etc/fstab
文件来 mount 伪文件系统和 exports :
sudo vi /etc/fstab
增加如下内容:
# nfs exports
/mnt/data/shared /exports/shared none bind
配置 nfs export
sudo vi /etc/exports
增加以下内容:
/exports/shared 192.168.0.0/16(rw,sync,no_subtree_check,no_root_squash)
重启 nfs-kernel-server,查看 nfs-kernel-server 的状态:
sudo systemctl restart nfs-kernel-server
sudo systemctl status nfs-kernel-server
验证:
ps -ef | grep nfs
输出为:
root 918 1 0 01:25 ? 00:00:00 /usr/sbin/nfsdcld
root 1147 2 0 01:26 ? 00:00:00 [nfsd]
root 1148 2 0 01:26 ? 00:00:00 [nfsd]
root 1149 2 0 01:26 ? 00:00:00 [nfsd]
root 1150 2 0 01:26 ? 00:00:00 [nfsd]
root 1151 2 0 01:26 ? 00:00:00 [nfsd]
root 1152 2 0 01:26 ? 00:00:00 [nfsd]
root 1153 2 0 01:26 ? 00:00:00 [nfsd]
root 1154 2 0 01:26 ? 00:00:00 [nfsd]
查看当前挂载情况:
$ sudo showmount -e
Export list for debian12:
/exports/shared 192.168.0.0/16
设置 nfs 版本支持
查看目前服务器端支持的 nfs 版本:
sudo cat /proc/fs/nfsd/versions
默认情况下,nfs server 支持的 nfs 版本为:
+3 +4 +4.1 +4.2
通常我们只需要保留 nfs 4.2 版本,其他版本都删除:
sudo vi /etc/nfs.conf
将默认的 nfsd:
[nfsd]
# debug=0
# threads=8
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=n
# tcp=y
# vers3=y
# vers4=y
# vers4.0=y
# vers4.1=y
# vers4.2=y
# rdma=n
# rdma-port=20049
修改为:
[nfsd]
# debug=0
threads=32
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=n
# tcp=y
vers3=n
vers4=y
vers4.0=n
vers4.1=n
vers4.2=y
# rdma=n
# rdma-port=20049
顺便把 nfs 线程数量从默认的 8 修改为 32。
重启 nfs-kernel-server:
sudo systemctl restart nfs-kernel-server
然后验证 nfs 版本:
sudo cat /proc/fs/nfsd/versions
输出为:
-3 +4 -4.0 -4.1 +4.2
注意: +4 是必须保留的,只有先设置 +4 版本,才能设置 4.0/4.1/4.2 版本,如果 -4 则 4.0/4.1/4.2 版本都会不支持。
也可以通过 rpcinfo 命令查看 nfs 版本:
rpcinfo -p localhost
输出为:
program vers proto port service
......
100003 4 tcp 2049 nfs
这里的 4 代表 nfs 4.x 版本,但是没法区分 4.0/4.1/4.2 版本。
nfs 客户端
安装 nfs-common
然后安装 nfs-common 作为 nfs client:
sudo apt-get install nfs-common
配置 nfs 访问
准备好挂载点:
cd /mnt
sudo mkdir -p nfs
不带 nfsrdma 方式的挂载 nfs:
sudo mount -t nfs 192.168.3.227:/exports/shared /mnt/nfs
挂载成功后,测试一下读写速度:
cd nfs
# nfs 写入10G数据,速度大概在 610 MB/s
sudo dd if=/dev/zero of=./test-10g.img bs=1G count=10 oflag=dsync
10737418240 bytes (11 GB, 10 GiB) copied, 17.6079 s, 610 MB/s
# nfs 读取100G数据,速度大概在 1.1 GB/s
sudo dd if=./test-100g.img of=/dev/null bs=1G count=100 iflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 96.5171 s, 1.1 GB/s
对比一下在 nfs server 端直接硬盘读写 100G 数据的速度:
# 直接硬盘写入100G数据,速度大概在 1.3 GB/s
sudo dd if=/dev/zero of=./test-100g.img bs=1G count=100 oflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 82.5747 s, 1.3 GB/s
# 直接硬盘读取100G数据,速度大概在 4.0 GB/s
sudo dd if=./test-100g.img of=/dev/null bs=1G count=100 iflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 26.9138 s, 4.0 GB/s
写入性能差异很大(1.3 GB/s 降低到 610 MB/s),估计是用 pve vmbr 的网卡,导致写入性能下降。读取的性能有更大的差异(4.0 GB/s 降低到 1.1 GB/s)。
现在 nfs 服务器端和客户端之间的网络共享配置完成。
5 - nfs rdma 支持
准备工作
前置条件
- 已经配置好 nfs 服务器和客户端:参考上一章
- 已经配置好硬盘:参考上一章
安装 MLNX_OFED 驱动
cx-5 100g网卡参考: https://skyao.io/learning-computer-hardware/nic/cx5-huawei-sp350/driver/debian12/
nfs 服务器和客户端都要安装网卡驱动。
网卡直连
为了避免 linux bridge 或者 ovs 的桥接,直接使用dac线连接100G网卡的网口,以直连的方式连接两台服务器(后续再尝试openvswitch)。
服务器端设置网卡地址:
sudo vi /etc/network/interfaces
增加内容:
allow-hotplug enp1s0f1np1
# iface enp1s0f1np1 inet dhcp
iface enp1s0f1np1 inet static
address 192.168.119.1
netmask 255.255.255.0
gateway 192.168.3.1
dns-nameservers 192.168.3.1
重启网卡:
sudo systemctl restart networking
客户端设置网卡地址:
sudo vi /etc/network/interfaces
增加内容:
allow-hotplug enp1s0np0
# iface enp1s0np0 inet dhcp
iface enp1s0np0 inet static
address 192.168.119.2
netmask 255.255.255.0
gateway 192.168.119.1
dns-nameservers 192.168.119.1
重启网卡:
sudo systemctl restart networking
验证两台机器可以互联:
ping 192.168.119.1
ping 192.168.119.2
验证两台机器之间的网速,服务器端自动 iperf 服务器端:
iperf -s 192.168.119.1
客户端 iperf 测试:
iperf -c 192.168.119.1 -i 1 -t 20 -P 4
测试出来的速度大概是 94.1 Gbits/sec,比较接近100G网卡的理论最大速度了:
[ 4] 0.0000-20.0113 sec 39.2 GBytes 16.8 Gbits/sec
[ 2] 0.0000-20.0111 sec 70.1 GBytes 30.1 Gbits/sec
[ 3] 0.0000-20.0113 sec 70.8 GBytes 30.4 Gbits/sec
[ 1] 0.0000-20.0111 sec 39.2 GBytes 16.8 Gbits/sec
[SUM] 0.0000-20.0007 sec 219 GBytes 94.1 Gbits/sec
此时 nfs 服务器端和客户端之间的网络连接已经没有问题了,可以开始配置 nfs 了。
由于启用了两块网卡,有时默认网络路由会出问题,导致无法访问外网。此时需要检查网络路由:
$ ip route
default via 192.168.3.1 dev enp1s0f1np1 onlink
192.168.3.0/24 dev enp6s18 proto kernel scope link src 192.168.3.227
192.168.119.0/24 dev enp1s0f1np1 proto kernel scope link src 192.168.119.1
可以看到默认路由是 192.168.3.1, 但使用的网卡是 enp1s0f1np1, 而不是 enp6s18。
此时需要手动设置默认路由:
sudo ip route del default
sudo ip route add default dev enp6s18
安装 nfsrdma
sudo cp ./mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb /tmp/
cd /tmp
sudo apt-get install ./mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
输出为:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'mlnx-nfsrdma-dkms' instead of './mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb'
The following NEW packages will be installed:
mlnx-nfsrdma-dkms
0 upgraded, 1 newly installed, 0 to remove and 2 not upgraded.
Need to get 0 B/71.2 kB of archives.
After this operation, 395 kB of additional disk space will be used.
Get:1 /home/sky/temp/mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb mlnx-nfsrdma-dkms all 24.10.OFED.24.10.2.1.8.1-1 [71.2 kB]
Selecting previously unselected package mlnx-nfsrdma-dkms.
(Reading database ... 74820 files and directories currently installed.)
Preparing to unpack .../mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb ...
Unpacking mlnx-nfsrdma-dkms (24.10.OFED.24.10.2.1.8.1-1) ...
Setting up mlnx-nfsrdma-dkms (24.10.OFED.24.10.2.1.8.1-1) ...
Loading new mlnx-nfsrdma-24.10.OFED.24.10.2.1.8.1 DKMS files...
First Installation: checking all kernels...
Building only for 6.1.0-31-amd64
Building for architecture x86_64
Building initial module for 6.1.0-31-amd64
Done.
Forcing installation of mlnx-nfsrdma
rpcrdma.ko:
Running module version sanity check.
- Original module
- Installation
- Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/
svcrdma.ko:
Running module version sanity check.
- Original module
- Installation
- Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/
xprtrdma.ko:
Running module version sanity check.
- Original module
- Installation
- Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/
depmod...
如果安装时最后有如下报错:
N: Download is performed unsandboxed as root as file '/home/sky/temp/mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied)
最后的 Permission denied 提示是因为 _apt 用户无法直接访问下载的 .deb 文件(比如位于 /home/sky/temp/),需要先复制到 /tmp/ 目录下。
重启。
检查 DKMS 状态
确认 DKMS 模块已注册:
sudo dkms status | grep nfsrdma
输出:
mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed
配置 nfs server(rdma)
为了确保安全,尤其是升级内核之后,先安装内核头文件并重新编译 nfsrdma 模块:
sudo apt install linux-headers-$(uname -r)
sudo dkms autoinstall
检查 RDMA 相关模块是否已加载
运行以下命令检查模块是否已加载:
lsmod | grep -E "rpcrdma|svcrdma|xprtrdma|ib_core|mlx5_ib"
预计输出:
xprtrdma 16384 0
svcrdma 16384 0
rpcrdma 94208 0
rdma_cm 139264 2 rpcrdma,rdma_ucm
mlx5_ib 495616 0
ib_uverbs 188416 2 rdma_ucm,mlx5_ib
ib_core 462848 9 rdma_cm,ib_ipoib,rpcrdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
sunrpc 692224 18 nfsd,rpcrdma,auth_rpcgss,lockd,nfs_acl
mlx5_core 2441216 1 mlx5_ib
mlx_compat 20480 15 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,mlxfw,xprtrdma,iw_cm,svcrdma,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
如果没有,手动加载模块(实际操作下来发现不手工加载也可以,系统会自动加载):
sudo modprobe rpcrdma
sudo modprobe svcrdma
sudo modprobe xprtrdma
sudo modprobe ib_core
sudo modprobe mlx5_ib
确认 DKMS 模块已注册:
sudo dkms status | grep nfsrdma
输出:
mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed
检查内核日志:
sudo dmesg | grep rdma
[ 178.512334] RPC: Registered rdma transport module.
[ 178.512336] RPC: Registered rdma backchannel transport module.
[ 178.515613] svcrdma: svcrdma is obsoleted, loading rpcrdma instead
[ 178.552178] xprtrdma: xprtrdma is obsoleted, loading rpcrdma instead
测试 RDMA 连接
在服务器端执行:
ibstatus
我这里开启了一个网卡 mlx5_1:
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:1e34:daff:fe5a:1fec
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:1e34:daff:fe5a:1fed
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
或者执行 ibdev2netdev 显示:
$ ibdev2netdev
mlx5_0 port 1 ==> enp1s0f0np0 (Down)
mlx5_1 port 1 ==> enp1s0f1np1 (Up)
因此在 mlx5_1 上启动 ib_write_bw 测试:
ib_write_bw -d mlx5_1 -p 18515
显示:
************************************
* Waiting for client to connect... *
************************************
在客户端执行:
ibstatus
显示为:
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:8e2a:8eff:fe88:a136
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
因此在 mlx5_0 上启动 ib_write_bw 测试:
ib_write_bw -d mlx5_0 -p 18515 192.168.119.1
客户端显示测试结果为:
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x009b PSN 0x57bb8f RKey 0x1fd4bc VAddr 0x007f0e0c8aa000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:02
remote address: LID 0000 QPN 0x0146 PSN 0x860b7f RKey 0x23c6bc VAddr 0x007fb25aea4000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:01
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 5000 11030.78 11030.45 0.176487
---------------------------------------------------------------------------------------
服务器端显示测试结果为:
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0146 PSN 0x860b7f RKey 0x23c6bc VAddr 0x007fb25aea4000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:01
remote address: LID 0000 QPN 0x009b PSN 0x57bb8f RKey 0x1fd4bc VAddr 0x007f0e0c8aa000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 5000 11030.78 11030.45 0.176487
---------------------------------------------------------------------------------------
测试成功,说明 RDMA 网络就绪。
开启 nfs server 的 RDMA 支持
nfs server 需要开启 rdma 支持:
sudo vi /etc/nfs.conf
修改 /etc/nfs.conf 文件的 [nfsd] 部分(备注:尽量要把 nfs 版本那几行也放开,限制为只提供4.2版本):
[nfsd]
# debug=0
# 线程改大一点,默认8太少了
threads=128
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=n
# tcp=y
# 版本这几行一定要设置,注意只保留4.2版本
vers3=n
vers4=y
vers4.0=n
vers4.1=n
vers4.2=y
rdma=y
rdma-port=20049
然后重启 NFS 服务:
sudo systemctl restart nfs-server
然后检查 nfsd 的监听端口,验证 rdma 是否生效:
sudo cat /proc/fs/nfsd/portlist
rdma 20049
rdma 20049
tcp 2049
tcp 2049
如果看到 rdma 20049 端口,说明 rdma 配置成功。反之如果没有看到 rdma 20049 端口,说明 rdma 配置失败,需要检查前面的配置。
配置 nfs client(rdma)
检查 rdma 模块
确保客户端和 rdma 相关的模块都已经加载:
lsmod | grep -E "rpcrdma|svcrdma|xprtrdma|ib_core|mlx5_ib"
如果没有,手动加载模块(实际操作下来发现不手工加载也可以,系统会自动加载):
sudo modprobe rpcrdma
sudo modprobe svcrdma
sudo modprobe xprtrdma
sudo modprobe ib_core
sudo modprobe mlx5_ib
确认 DKMS 模块已注册:
sudo dkms status | grep nfsrdma
预计输出:
mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed
检查内核日志:
sudo dmesg | grep rdma
预计输出:
[ 3273.613081] RPC: Registered rdma transport module.
[ 3273.613082] RPC: Registered rdma backchannel transport module.
[ 3695.887144] svcrdma: svcrdma is obsoleted, loading rpcrdma instead
[ 3695.923962] xprtrdma: xprtrdma is obsoleted, loading rpcrdma instead
配置 nfs 访问(rdma)
准备挂载点:
cd /mnt
sudo mkdir -p nfs-rdma
带 nfsrdma 方式的挂载 nfs:
sudo mount -t nfs -o rdma,port=20049,vers=4.2,nconnect=16 192.168.119.1:/exports/shared /mnt/nfs-rdma
挂载成功后,测试一下读写速度:
# nfs 写入100G数据,速度大概在 1.0 GB/s
sudo dd if=/dev/zero of=./test-100g.img bs=1G count=100 oflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 104.569 s, 1.0 GB/s
# nfs 读取100G数据,速度大概在 2.2 GB/s
sudo dd if=./test-100g.img of=/dev/null bs=1G count=100 iflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 47.4093 s, 2.2 GB/s
对比三者的速度,同一块铠侠 cd6 3.84T ssd 硬盘直通到虚拟机:
场景 | 100g 单文件写入 | 100g 单文件读取 |
---|---|---|
直接读写硬盘 | 1.3 GB/s | 4.0 GB/s |
nfs 挂载(non-rdma) | 1.0 GB/s | 2.5 GB/s |
nfs 挂载(rdma) | 1.0 GB/s | 2.2 GB/s |