Re: Iperf performance problems for 10GbE on non-Linux platforms


On Wed, 29 Aug 2007 11:42:15 -0400
"Andrew Gallatin" <gallatin --at-- gmail.com> wrote:

> Hi,
> 
> I do *nix drivers for a popular 10GbE NIC.  Many of our customers use
> iperf to benchmark their 10GbE networks.  Unfortunately, our customers
> running non-Linux OSes tend to see much lower performance from iperf
> than from netperf.  I spent a little time analyzing this on FreeBSD.
> Please note that we've seen similar behavior on Solaris.
> 
> On a pair of very low end 2.0GHz athlon64 x2s running FreeBSD 7.0, I
> see the following for netperf, and iperf:
> 
> % netperf242 -Hrome-xgbe -P0
>  65536  32768  32768    10.00    9817.08
> 
> % iperf -c rome-xgbe
> ------------------------------------------------------------
> Client connecting to rome-xgbe, TCP port 5001
> TCP window size: 32.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.1.15 port 61387 connected with 192.168.1.16 port 5001
> [  3]  0.0-10.0 sec  4.85 GBytes  4.17 Gbits/sec
> 
> Iperf seems a lot slower than netperf, and is reporting less than 1/2
> the bandwidth reported by netperf.  Why?
> 
> To explore this, I used the FreeBSD ktrace / kdump system call
> tracing tools (somewhat similar to strace or truss) to look "under
> the hood" and see what the benchmarks are really doing.  As you can
> see, netperf sits in a tight loop, doing nothing but writing to the
> network in (by default) 32KB (socketbuffer sized) chunks:
> 
>   1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
>   1281 netperf  RET   sendto 32768/0x8000
>   1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)
>   1281 netperf  RET   sendto 32768/0x8000
>   1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
>   1281 netperf  RET   sendto 32768/0x8000
>   1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)
>   1281 netperf  RET   sendto 32768/0x8000
>   1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
>   1281 netperf  RET   sendto 32768/0x8000
>   1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)
> 
> Iperf does a whole bunch of apparently expensive stuff, a minor
> fraction of which is actually writing to the network in 8KB chunks:
> 
>   1284 iperf    RET   write 8192/0x2000
>   1284 iperf    CALL  kse_wakeup(0x800d110a8)
>   1284 iperf    RET   kse_wakeup 0
>   1284 iperf    RET   kse_release 0
>   1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
>   1284 iperf    CALL  gettimeofday(0x800f010b8,0)
>   1284 iperf    RET   fork 0
>   1284 iperf    RET   nanosleep 0
>   1284 iperf    CALL  kse_release(0x800d59f20)
>   1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
>   1284 iperf    RET   kse_release 0
>   1284 iperf    RET   nanosleep 0
>   1284 iperf    CALL  write(0x3,0x800f03000,0x2000)
>   1284 iperf    RET   fork 0
>   1284 iperf    CALL  kse_release(0x800d10f20)
>   1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
>   1284 iperf    RET   write 8192/0x2000
>   1284 iperf    RET   kse_release 0
>   1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
>   1284 iperf    RET   nanosleep 0
>   1284 iperf    RET   fork 0
>   1284 iperf    RET   fork 0
>   1284 iperf    CALL  kse_release(0x800d59f20)
> 
> Most of this extra work is apparently cheap in Linux (where I assume
> iperf was written), but it is quite expensive elsewhere.  For example,
> most gettimeofday() calls do not even go into the kernel in recent
> Linux distros.  On FreeBSD (and many other *nixes), they are very slow
> because they will actually read from the system timecounter hardware,
> usually involving a slow hardware (PIO) read operation.  The fact
> that iperf uses a tiny 8KB default socket read/write size magnifies the
> impact of all these system calls made in the IO loop.
> 
> I took a look at the iperf source in hopes of understanding the
> purpose of having 2 threads running, and timestamping every single
> socket write, but I confess that I'm not sure I understand the
> purpose, at least for TCP.  It *seems* like from what is reported to
> the user, simply taking an initial timestamp at the start of the test,
> and a final one at the end of the test would suffice.  Am I missing
> something important here?
> 
> Would iperf accept a patch to reduce the non-network overhead on
> non-Linux OSes so that it performs as well as netperf?  If so, I'll try
> to submit one.
> 
> Thanks,
> 
> Drew Gallatin
> 

This probably fixes your problem. I sent it earlier to the iperf list.
--------------------------------------------------------------------


The latest version of Linux kernel aggravates a pre-existing iperf thread
library bug. The iperf thread library assumes that calling usleep(0) will cause
the thread to yield so that other threads will run. This has never been a documented
behavior of Linux/Unix. The new high resolution timer option in the kernel causes
usleep(0) to be a nop so the thread keeps running (until it's quanta is exhausted).

Without this fix, iperf will get poor performance because the monitoring thread
may hog the cpu, keeping the sender/receiver threads from running.

The fix to iperf is easy, just use sched_yield() instead. The manual page for sched_yield
says to test for POSIX_PRIORITY_SCHEDULING as a Posix option.

--- compat/Thread.c.orig	2005-05-03 08:15:51.000000000 -0700
+++ compat/Thread.c	2007-06-04 10:41:11.000000000 -0700
@@ -405,9 +405,13 @@
 void thread_rest ( void ) {
 #if defined( HAVE_THREAD )
 #if defined( HAVE_POSIX_THREAD )
-    // TODO add checks for sched_yield or pthread_yield and call that
-    // if available
+
+#if defined( _POSIX_PRIORITY_SCHEDULING )
+    sched_yield();
+#else
     usleep( 0 );
+#endif
+
 #else // Win32
     SwitchToThread( );
 #endif



-- 
Stephen Hemminger <shemminger --at-- linux-foundation.org>



Other Mailing lists | Author Index | Date Index | Subject Index | Thread Index