Re: Iperf performance problems for 10GbE on non-Linux platforms
On Wed, 29 Aug 2007 11:42:15 -0400
"Andrew Gallatin" <gallatin --at-- gmail.com> wrote:
> Hi,
>
> I do *nix drivers for a popular 10GbE NIC. Many of our customers use
> iperf to benchmark their 10GbE networks. Unfortunately, our customers
> running non-Linux OSes tend to see much lower performance from iperf
> than from netperf. I spent a little time analyzing this on FreeBSD.
> Please note that we've seen similar behavior on Solaris.
>
> On a pair of very low end 2.0GHz athlon64 x2s running FreeBSD 7.0, I
> see the following for netperf, and iperf:
>
> % netperf242 -Hrome-xgbe -P0
> 65536 32768 32768 10.00 9817.08
>
> % iperf -c rome-xgbe
> ------------------------------------------------------------
> Client connecting to rome-xgbe, TCP port 5001
> TCP window size: 32.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.15 port 61387 connected with 192.168.1.16 port 5001
> [ 3] 0.0-10.0 sec 4.85 GBytes 4.17 Gbits/sec
>
> Iperf seems a lot slower than netperf, and is reporting less than 1/2
> the bandwidth reported by netperf. Why?
>
> To explore this, I used the FreeBSD ktrace / kdump system call
> tracing tools (somewhat similar to strace or truss) to look "under
> the hood" and see what the benchmarks are really doing. As you can
> see, netperf sits in a tight loop, doing nothing but writing to the
> network in (by default) 32KB (socketbuffer sized) chunks:
>
> 1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
> 1281 netperf RET sendto 32768/0x8000
> 1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
> 1281 netperf RET sendto 32768/0x8000
> 1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
> 1281 netperf RET sendto 32768/0x8000
> 1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
> 1281 netperf RET sendto 32768/0x8000
> 1281 netperf CALL sendto(0x4,0x800a29000,0x8000,0,0,0)
> 1281 netperf RET sendto 32768/0x8000
> 1281 netperf CALL sendto(0x4,0x800a32000,0x8000,0,0,0)
>
> Iperf does a whole bunch of apparently expensive stuff, a minor
> fraction of which is actually writing to the network in 8KB chunks:
>
> 1284 iperf RET write 8192/0x2000
> 1284 iperf CALL kse_wakeup(0x800d110a8)
> 1284 iperf RET kse_wakeup 0
> 1284 iperf RET kse_release 0
> 1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
> 1284 iperf CALL gettimeofday(0x800f010b8,0)
> 1284 iperf RET fork 0
> 1284 iperf RET nanosleep 0
> 1284 iperf CALL kse_release(0x800d59f20)
> 1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
> 1284 iperf RET kse_release 0
> 1284 iperf RET nanosleep 0
> 1284 iperf CALL write(0x3,0x800f03000,0x2000)
> 1284 iperf RET fork 0
> 1284 iperf CALL kse_release(0x800d10f20)
> 1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
> 1284 iperf RET write 8192/0x2000
> 1284 iperf RET kse_release 0
> 1284 iperf CALL nanosleep(0x7fffff7fcf50,0)
> 1284 iperf RET nanosleep 0
> 1284 iperf RET fork 0
> 1284 iperf RET fork 0
> 1284 iperf CALL kse_release(0x800d59f20)
>
> Most of this extra work is apparently cheap in Linux (where I assume
> iperf was written), but it is quite expensive elsewhere. For example,
> most gettimeofday() calls do not even go into the kernel in recent
> Linux distros. On FreeBSD (and many other *nixes), they are very slow
> because they will actually read from the system timecounter hardware,
> usually involving a slow hardware (PIO) read operation. The fact
> that iperf uses a tiny 8KB default socket read/write size magnifies the
> impact of all these system calls made in the IO loop.
>
> I took a look at the iperf source in hopes of understanding the
> purpose of having 2 threads running, and timestamping every single
> socket write, but I confess that I'm not sure I understand the
> purpose, at least for TCP. It *seems* like from what is reported to
> the user, simply taking an initial timestamp at the start of the test,
> and a final one at the end of the test would suffice. Am I missing
> something important here?
>
> Would iperf accept a patch to reduce the non-network overhead on
> non-Linux OSes so that it performs as well as netperf? If so, I'll try
> to submit one.
>
> Thanks,
>
> Drew Gallatin
>
This probably fixes your problem. I sent it earlier to the iperf list.
--------------------------------------------------------------------
The latest version of Linux kernel aggravates a pre-existing iperf thread
library bug. The iperf thread library assumes that calling usleep(0) will cause
the thread to yield so that other threads will run. This has never been a documented
behavior of Linux/Unix. The new high resolution timer option in the kernel causes
usleep(0) to be a nop so the thread keeps running (until it's quanta is exhausted).
Without this fix, iperf will get poor performance because the monitoring thread
may hog the cpu, keeping the sender/receiver threads from running.
The fix to iperf is easy, just use sched_yield() instead. The manual page for sched_yield
says to test for POSIX_PRIORITY_SCHEDULING as a Posix option.
--- compat/Thread.c.orig 2005-05-03 08:15:51.000000000 -0700
+++ compat/Thread.c 2007-06-04 10:41:11.000000000 -0700
@@ -405,9 +405,13 @@
void thread_rest ( void ) {
#if defined( HAVE_THREAD )
#if defined( HAVE_POSIX_THREAD )
- // TODO add checks for sched_yield or pthread_yield and call that
- // if available
+
+#if defined( _POSIX_PRIORITY_SCHEDULING )
+ sched_yield();
+#else
usleep( 0 );
+#endif
+
#else // Win32
SwitchToThread( );
#endif
--
Stephen Hemminger <shemminger --at-- linux-foundation.org>