Hi,
I do *nix drivers for a popular 10GbE NIC. Many of our customers use
iperf to benchmark their 10GbE networks. Unfortunately, our customers
running non-Linux OSes tend to see much lower performance from iperf
than from netperf. I spent a little time analyzing this on FreeBSD.
Please note that we've seen similar behavior on Solaris.
On a pair of very low end 2.0GHz athlon64 x2s running FreeBSD 7.0, I
see the following for netperf, and iperf:
% netperf242 -Hrome-xgbe -P0
65536 32768 32768 10.00 9817.08
% iperf -c rome-xgbe
------------------------------------------------------------
Client connecting to rome-xgbe, TCP port 5001
TCP window size: 32.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.15 port 61387 connected with 192.168.1.16 port 5001
[ 3] 0.0-10.0 sec 4.85 GBytes 4.17 Gbits/sec
Iperf seems a lot slower than netperf, and is reporting less than 1/2
the bandwidth reported by netperf. Why?
To explore this, I used the FreeBSD ktrace / kdump system call
tracing tools (somewhat similar to strace or truss) to look "under
the hood" and see what the benchmarks are really doing. As you can
see, netperf sits in a tight loop, doing nothing but writing to the
network in (by default) 32KB (socketbuffer sized) chunks:
1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
1281 netperf RET sendto 32768/0x8000
1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
1281 netperf RET sendto 32768/0x8000
1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
1281 netperf RET sendto 32768/0x8000
1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
1281 netperf RET sendto 32768/0x8000
1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
1281 netperf RET sendto 32768/0x8000
1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
Iperf does a whole bunch of apparently expensive stuff, a minor
fraction of which is actually writing to the network in 8KB chunks:
1284 iperf RET write 8192/0x2000
1284 iperf CALL kse_wakeup(0x800d110a8)
1284 iperf RET kse_wakeup 0
1284 iperf RET kse_release 0
1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
1284 iperf CALL gettimeofday(0x800f010b8,0)
1284 iperf RET fork 0
1284 iperf RET nanosleep 0
1284 iperf CALL kse_release(0x800d59f20)
1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
1284 iperf RET kse_release 0
1284 iperf RET nanosleep 0
1284 iperf CALL write(0x3,0x800f03000,0x2000)
1284 iperf RET fork 0
1284 iperf CALL kse_release(0x800d10f20)
1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
1284 iperf RET write 8192/0x2000
1284 iperf RET kse_release 0
1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
1284 iperf RET nanosleep 0
1284 iperf RET fork 0
1284 iperf RET fork 0
1284 iperf CALL kse_release(0x800d59f20)
Most of this extra work is apparently cheap in Linux (where I assume
iperf was written), but it is quite expensive elsewhere. For example,
most gettimeofday() calls do not even go into the kernel in recent
Linux distros. On FreeBSD (and many other *nixes), they are very slow
because they will actually read from the system timecounter hardware,
usually involving a slow hardware (PIO) read operation. The fact
that iperf uses a tiny 8KB default socket read/write size magnifies the
impact of all these system calls made in the IO loop.
I took a look at the iperf source in hopes of understanding the
purpose of having 2 threads running, and timestamping every single
socket write, but I confess that I'm not sure I understand the
purpose, at least for TCP. It *seems* like from what is reported to
the user, simply taking an initial timestamp at the start of the test,
and a final one at the end of the test would suffice. Am I missing
something important here?
Would iperf accept a patch to reduce the non-network overhead on
non-Linux OSes so that it performs as well as netperf? If so, I'll try
to submit one.
Thanks,
Drew Gallatin