Recently we got a few new servers. All have identical configuration. Each has dual E5-2620v3 2.4Ghz CPUs, 128GiB RAM (8 x 16GiB DDR4 DIMMs), 1 dual-40G XL710, and two dual 10G SPF+ mezz cards (i.e. 4 x 10G SPF+ ports). All of them run CentOS 7.1 x86_64. These XL710s are connected to the 40G ports of QCT LY8 switches using genuine Intel QSFP+ DACs. All 10G SPF+ ports are connected to Arista 7280SE-68 switches, but using third party DACs. All systems have been so far minimally tuned:
- In each's BIOS, the pre-defined "High Performance" profile is selected, furthermore, Intel I/OAT is enabled, VT-d is disabled (We don't need them to run virtual machines, they are for HPC applications).
- In each's CentOS, the tuned-adm active is set to network-throughput.
After the servers have been setup, we have been using iperf3 to run long-running tests among such servers. So far, we have observed consistent packet drops on the receiving side. An example:
[root@sc2u1n0 ~]# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
ens10f0 9000 236406987 0 0 0 247785514 0 0 0 BMRU
ens1f0 9000 363116387 0 2391 0 2370529766 0 0 0 BMRU
ens1f1 9000 382484140 0 2248 0 2098335636 0 0 0 BMRU
ens20f0 9000 565532361 0 2258 0 1472188440 0 0 0 BMRU
ens20f1 9000 519587804 0 4225 0 5471601950 0 0 0 BMRU
lo 65536 19058603 0 0 0 19058603 0 0 0 LRU
We have also observed iperf3 retries at the beginning of a test session and often, during a session (not as often however). Two examples:
40G pairs:
$ iperf3 -c 192.168.11.100 -i 1 -t 10
Connecting to host 192.168.11.100, port 5201
[ 4] local 192.168.11.103 port 59351 connected to 192.168.11.100 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 2.77 GBytes 23.8 Gbits/sec 54 655 KBytes
[ 4] 1.00-2.00 sec 4.26 GBytes 36.6 Gbits/sec 0 1.52 MBytes
[ 4] 2.00-3.00 sec 4.61 GBytes 39.6 Gbits/sec 0 2.12 MBytes
[ 4] 3.00-4.00 sec 4.53 GBytes 38.9 Gbits/sec 0 2.57 MBytes
[ 4] 4.00-5.00 sec 4.00 GBytes 34.4 Gbits/sec 7 1.42 MBytes
[ 4] 5.00-6.00 sec 4.61 GBytes 39.6 Gbits/sec 0 2.01 MBytes
[ 4] 6.00-7.00 sec 4.61 GBytes 39.6 Gbits/sec 0 2.47 MBytes
[ 4] 7.00-8.00 sec 4.61 GBytes 39.6 Gbits/sec 0 2.88 MBytes
[ 4] 8.00-9.00 sec 4.61 GBytes 39.6 Gbits/sec 0 3.21 MBytes
[ 4] 9.00-10.00 sec 4.61 GBytes 39.6 Gbits/sec 0 3.52 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 43.2 GBytes 37.1 Gbits/sec 61 sender
[ 4] 0.00-10.00 sec 43.2 GBytes 37.1 Gbits/sec receiver
82599 powered 10G pairs:
$ iperf3 -c 192.168.15.100 -i 1 -t 10
Connecting to host 192.168.15.100, port 5201
[ 4] local 192.168.16.101 port 53464 connected to 192.168.15.100 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.05 GBytes 9.05 Gbits/sec 722 1.97 MBytes
[ 4] 1.00-2.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.80 MBytes
[ 4] 2.00-3.00 sec 1.10 GBytes 9.42 Gbits/sec 23 2.15 MBytes
[ 4] 3.00-4.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.16 MBytes
[ 4] 4.00-5.00 sec 1.09 GBytes 9.41 Gbits/sec 0 2.16 MBytes
[ 4] 5.00-6.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.17 MBytes
[ 4] 6.00-7.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.18 MBytes
[ 4] 7.00-8.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.22 MBytes
[ 4] 8.00-9.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.27 MBytes
[ 4] 9.00-10.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.34 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 10.9 GBytes 9.38 Gbits/sec 745 sender
[ 4] 0.00-10.00 sec 10.9 GBytes 9.37 Gbits/sec receiver
Looking around, I ran into a 40G NIC Tuning article on the DOE Energy Science Network fast data site, quoted "At the present time (February 2015), CPU clock rate still matters a lot for 40G hosts. In general, higher CPU clock rate is far more important than high core count for a 40G host. In general, you can expect it to be very difficult to achieve 40G performance with a CPU that runs more slowly than 3GHz per core." We don't have such fast CPUs The E5-2620v3 is a mid-range CPU from the Basic category, not even the Performance category. So,
- Are our servers too rich in NICs, but under-powered CPU-wise?
- Is there anything that we can do to get these servers to behave at least reasonably? Especially, not dropping packets?
BTW, a few days ago we updated all servers with the most recent Intel stable i40e and ixgbe drivers, but we have not run the set_irq_affinity CPU yet. Neither we have tuned the NIC (e.g. adjusting rx-usecs value etc). The reason is because each server runs two highly concurrent applications which tend to use all the cores. We are afraid that to use the set_irq_affinity script, we may negatively impact the performance of our applications. But if Intel folks consider running the script beneficial, we are willing to try.
Regards,
-- Zack