Re: Iperf performance problems for 10GbE on non-Linux platforms


Stephen, Andrew, and iperf-users,

I have recently gained access to the Sourceforge Iperf project and am planning to merge patches such as the thread patch that Stephen provided and would be more than happy to incorporate any further changes proposed by Andrew.

I haven't had time to digest all of emails from the past several months but I am aware of several issues with Iperf and am eager to get the patches apply and roll a new release. I am planning to spend some time in the next few days going through the archives looking for problem reports and patches.

I am actively looking for patches, problem reports and enhancement requests. Feel free to resend things that you feel are important that have already gone to the list, however I would ask that you send such items to me directly to avoid redundant noise on the list.

Thanks,

Jon

PS: I haven't posted to this list lately, but by way of introduction I was involved with maintaining in the past but other responsibilities have kept me from it for some time now. I used to be located at NCSA just down the hall from the other Iperf developers, I've since moved away but I am interested in seeing Iperf continue to be a useful tool.


Stephen Hemminger wrote:
On Wed, 29 Aug 2007 11:42:15 -0400
"Andrew Gallatin" <gallatin --at-- gmail.com> wrote:

Hi,

I do *nix drivers for a popular 10GbE NIC.  Many of our customers use
iperf to benchmark their 10GbE networks.  Unfortunately, our customers
running non-Linux OSes tend to see much lower performance from iperf
than from netperf.  I spent a little time analyzing this on FreeBSD.
Please note that we've seen similar behavior on Solaris.

On a pair of very low end 2.0GHz athlon64 x2s running FreeBSD 7.0, I
see the following for netperf, and iperf:

% netperf242 -Hrome-xgbe -P0
 65536  32768  32768    10.00    9817.08

% iperf -c rome-xgbe
------------------------------------------------------------
Client connecting to rome-xgbe, TCP port 5001
TCP window size: 32.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.15 port 61387 connected with 192.168.1.16 port 5001
[  3]  0.0-10.0 sec  4.85 GBytes  4.17 Gbits/sec

Iperf seems a lot slower than netperf, and is reporting less than 1/2
the bandwidth reported by netperf.  Why?

To explore this, I used the FreeBSD ktrace / kdump system call
tracing tools (somewhat similar to strace or truss) to look "under
the hood" and see what the benchmarks are really doing.  As you can
see, netperf sits in a tight loop, doing nothing but writing to the
network in (by default) 32KB (socketbuffer sized) chunks:

  1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
  1281 netperf  RET   sendto 32768/0x8000
  1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)
  1281 netperf  RET   sendto 32768/0x8000
  1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
  1281 netperf  RET   sendto 32768/0x8000
  1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)
  1281 netperf  RET   sendto 32768/0x8000
  1281 netperf  CALL  sendto(0x4,0x800a29000,0x8000,0,0,0)
  1281 netperf  RET   sendto 32768/0x8000
  1281 netperf  CALL  sendto(0x4,0x800a32000,0x8000,0,0,0)

Iperf does a whole bunch of apparently expensive stuff, a minor
fraction of which is actually writing to the network in 8KB chunks:

  1284 iperf    RET   write 8192/0x2000
  1284 iperf    CALL  kse_wakeup(0x800d110a8)
  1284 iperf    RET   kse_wakeup 0
  1284 iperf    RET   kse_release 0
  1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
  1284 iperf    CALL  gettimeofday(0x800f010b8,0)
  1284 iperf    RET   fork 0
  1284 iperf    RET   nanosleep 0
  1284 iperf    CALL  kse_release(0x800d59f20)
  1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
  1284 iperf    RET   kse_release 0
  1284 iperf    RET   nanosleep 0
  1284 iperf    CALL  write(0x3,0x800f03000,0x2000)
  1284 iperf    RET   fork 0
  1284 iperf    CALL  kse_release(0x800d10f20)
  1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
  1284 iperf    RET   write 8192/0x2000
  1284 iperf    RET   kse_release 0
  1284 iperf    CALL  nanosleep(0x7fffff7fcf50,0)
  1284 iperf    RET   nanosleep 0
  1284 iperf    RET   fork 0
  1284 iperf    RET   fork 0
  1284 iperf    CALL  kse_release(0x800d59f20)

Most of this extra work is apparently cheap in Linux (where I assume
iperf was written), but it is quite expensive elsewhere.  For example,
most gettimeofday() calls do not even go into the kernel in recent
Linux distros.  On FreeBSD (and many other *nixes), they are very slow
because they will actually read from the system timecounter hardware,
usually involving a slow hardware (PIO) read operation.  The fact
that iperf uses a tiny 8KB default socket read/write size magnifies the
impact of all these system calls made in the IO loop.

I took a look at the iperf source in hopes of understanding the
purpose of having 2 threads running, and timestamping every single
socket write, but I confess that I'm not sure I understand the
purpose, at least for TCP.  It *seems* like from what is reported to
the user, simply taking an initial timestamp at the start of the test,
and a final one at the end of the test would suffice.  Am I missing
something important here?

Would iperf accept a patch to reduce the non-network overhead on
non-Linux OSes so that it performs as well as netperf?  If so, I'll try
to submit one.

Thanks,

Drew Gallatin


This probably fixes your problem. I sent it earlier to the iperf list. --------------------------------------------------------------------


The latest version of Linux kernel aggravates a pre-existing iperf thread library bug. The iperf thread library assumes that calling usleep(0) will cause the thread to yield so that other threads will run. This has never been a documented behavior of Linux/Unix. The new high resolution timer option in the kernel causes usleep(0) to be a nop so the thread keeps running (until it's quanta is exhausted).

Without this fix, iperf will get poor performance because the monitoring thread
may hog the cpu, keeping the sender/receiver threads from running.

The fix to iperf is easy, just use sched_yield() instead. The manual page for sched_yield
says to test for POSIX_PRIORITY_SCHEDULING as a Posix option.

--- compat/Thread.c.orig	2005-05-03 08:15:51.000000000 -0700
+++ compat/Thread.c	2007-06-04 10:41:11.000000000 -0700
@@ -405,9 +405,13 @@
 void thread_rest ( void ) {
 #if defined( HAVE_THREAD )
 #if defined( HAVE_POSIX_THREAD )
-    // TODO add checks for sched_yield or pthread_yield and call that
-    // if available
+
+#if defined( _POSIX_PRIORITY_SCHEDULING )
+    sched_yield();
+#else
     usleep( 0 );
+#endif
+
 #else // Win32
     SwitchToThread( );
 #endif






Other Mailing lists | Author Index | Date Index | Subject Index | Thread Index