From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rick.jones2@hp.com>
Received: from g4t0017.houston.hp.com (g4t0017.houston.hp.com [15.201.24.20])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "smtp1.hp.com",
	Issuer "VeriSign Class 3 Secure Server CA - G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 373AA201745
	for <bloat@lists.bufferbloat.net>; Fri, 13 May 2011 13:43:50 -0700 (PDT)
Received: from g4t0018.houston.hp.com (g4t0018.houston.hp.com [16.234.32.27])
	by g4t0017.houston.hp.com (Postfix) with ESMTP id 760B33880A;
	Fri, 13 May 2011 20:52:12 +0000 (UTC)
Received: from [16.89.244.213] (tardy.cup.hp.com [16.89.244.213])
	by g4t0018.houston.hp.com (Postfix) with ESMTP id 0DB6B10A65;
	Fri, 13 May 2011 20:47:48 +0000 (UTC)
From: Rick Jones <rick.jones2@hp.com>
To: Denton Gentry <denny@geekhold.com>
In-Reply-To: <BANLkTikz82Ato1No7Rum5ZvDPdZqG23X3Q@mail.gmail.com>
References: <BANLkTi=9Kgz4kXRzK_KC9LpSDBEoVQiseg@mail.gmail.com>
	<BANLkTi=pOCQRdUA3_-_=q+m27H526rWK7w@mail.gmail.com>
	<BANLkTimrxy6=8NQ+VpysJiRcWCX6RpbrWA@mail.gmail.com>
	<4DB70FDA.6000507@mti-systems.com>
	<D02B19AE0CC44AFCBA30F6CD0B10C56C@srichardlxp2>
	<4DC2C9D2.8040703@freedesktop.org> <20110505091046.3c73e067@nehalam>
	<6E25D2CF-D0F0-4C41-BABC-4AB0C00862A6@pnsol.com>
	<35D8AC71C7BF46E29CC3118AACD97FA6@srichardlxp2>
	<1304964368.8149.202.camel@tardy>
	<4DD9A464-8845-49AA-ADC4-A0D36D91AAEC@cisco.com>
	<BANLkTi=LyPBRsOXxw=VWrXi7BbELnhbtXg@mail.gmail.com>
	<1305297321.8149.549.camel@tardy>
	<BANLkTikz82Ato1No7Rum5ZvDPdZqG23X3Q@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Date: Fri, 13 May 2011 13:47:48 -0700
Message-ID: <1305319668.8149.673.camel@tardy>
Mime-Version: 1.0
X-Mailer: Evolution 2.30.3 
Content-Transfer-Encoding: 7bit
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] Burst Loss
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
Reply-To: rick.jones2@hp.com
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Fri, 13 May 2011 20:43:51 -0000

On Fri, 2011-05-13 at 12:32 -0700, Denton Gentry wrote:

>   NICs seem to be responding by hashing incoming 5-tuples to
> distribute flows across cores.

When I first kicked netperf out onto the Internet, when 10
Megabits/second was really fast, people started asking me "Why can't I
get link-rate on a single-stream netperf test?"  The answer was "Because
you don't have enough CPU horsepower, but perhaps the next processor
will." Then when 100BT happened, people asked me "Why can't I get
link-rate on a single-stream netperf test?"  And the answer was the
same.  Then when 1 GbE happened, people asked me "Why can't I get
link-rate on a single-stream netperf test?"  And the answer was the
same, tweaked slightly to suggest they get a NIC with CKO.  Then when 10
GbE happened people asked me "Why can't I get link-rate on a
single-stream netperf test?" And the answer was "Because you don't have
enough CPU, try a NIC with TSO and LRO."

Based on the past 20 years I am quite confident that when 40 and 100 GbE
NICs appear for end systems, I will again be asked "Why can't I get
link-rate on a single-stream netperf test?"  While indeed, the world is
not just unidirectional bulk flows (if it were netperf and its
request-response tests would never have come into being to replace
ttcp), even after decades it is still something people seem to expect.
There must be some value to high performance unidirectional transfer.

Only now the cores aren't going to have gotten any faster, and spreading
incoming 5-tuples across cores isn't going to help a single stream.

So, the "answer" will likely end-up being to add still more complexity -
either in the applications to use multiple streams, or to push the full
stack into the NIC. Adde parvum parvo manus acervus erit. But, by
Metcalf, we will have preserved the sacrosanct Ethernet maximum frame
size.

Crossing emails a bit, Kevin wrote about the 6X increase in latency.  It
is a 6X increase in *potential* latency *if* someone actually enables
the larger MTU.  And yes, the "We want to be on the Top 500 list" types
do worry about latency and some perhaps even many of them use Ethernet
instead of Infiniband (which does, BTW offer at least the illusion of a
quite large MTU to IP), but a sanctioned way to run a larger MTU over
Ethernet does not *force* them to use it if they want to make the
explicit latency vs overhead trade-off.  As it stands, those who do not
worry about micro or nanoseconds are forced off the standard in the name
of preserving something for those who do.  (And with 100 GbE it would be
nanosecond differences we would talking about - the 12 and 72 usec of 1
GbE become 120 and 720 nanoseconds at 100 GbE - the realm of a processor
cache miss because memory latency hasn't and won't likely get much
better either)

And, are transaction or SAN latencies actually measured in microseconds
or nanoseconds?  If "transactions" are OLTP, those things are measured
in milliseconds and even whole seconds (TPC), and spinning rust (yes,
but not SSDs) still has latencies measured in milliseconds.

rick jones


>  
>         And while it isn't the
>         strongest point in the world, one might even argue that the
>         need to use
>         TSO/LRO to achieve performance hinders new transport protocol
>         adoption -
>         the presence of NIC offloads for only TCP (or UDP) leaves a
>         new
>         transport protocol (perhaps SCTP) at a disadvantage.
> 
> 
>   True, and even UDP seems to be often blocked for anything other than
> DNS.