From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from g4t0017.houston.hp.com (g4t0017.houston.hp.com [15.201.24.20]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "smtp1.hp.com", Issuer "VeriSign Class 3 Secure Server CA - G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 373AA201745 for ; Fri, 13 May 2011 13:43:50 -0700 (PDT) Received: from g4t0018.houston.hp.com (g4t0018.houston.hp.com [16.234.32.27]) by g4t0017.houston.hp.com (Postfix) with ESMTP id 760B33880A; Fri, 13 May 2011 20:52:12 +0000 (UTC) Received: from [16.89.244.213] (tardy.cup.hp.com [16.89.244.213]) by g4t0018.houston.hp.com (Postfix) with ESMTP id 0DB6B10A65; Fri, 13 May 2011 20:47:48 +0000 (UTC) From: Rick Jones To: Denton Gentry In-Reply-To: References: <4DB70FDA.6000507@mti-systems.com> <4DC2C9D2.8040703@freedesktop.org> <20110505091046.3c73e067@nehalam> <6E25D2CF-D0F0-4C41-BABC-4AB0C00862A6@pnsol.com> <35D8AC71C7BF46E29CC3118AACD97FA6@srichardlxp2> <1304964368.8149.202.camel@tardy> <4DD9A464-8845-49AA-ADC4-A0D36D91AAEC@cisco.com> <1305297321.8149.549.camel@tardy> Content-Type: text/plain; charset="UTF-8" Date: Fri, 13 May 2011 13:47:48 -0700 Message-ID: <1305319668.8149.673.camel@tardy> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] Burst Loss X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list Reply-To: rick.jones2@hp.com List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 May 2011 20:43:51 -0000 On Fri, 2011-05-13 at 12:32 -0700, Denton Gentry wrote: > NICs seem to be responding by hashing incoming 5-tuples to > distribute flows across cores. When I first kicked netperf out onto the Internet, when 10 Megabits/second was really fast, people started asking me "Why can't I get link-rate on a single-stream netperf test?" The answer was "Because you don't have enough CPU horsepower, but perhaps the next processor will." Then when 100BT happened, people asked me "Why can't I get link-rate on a single-stream netperf test?" And the answer was the same. Then when 1 GbE happened, people asked me "Why can't I get link-rate on a single-stream netperf test?" And the answer was the same, tweaked slightly to suggest they get a NIC with CKO. Then when 10 GbE happened people asked me "Why can't I get link-rate on a single-stream netperf test?" And the answer was "Because you don't have enough CPU, try a NIC with TSO and LRO." Based on the past 20 years I am quite confident that when 40 and 100 GbE NICs appear for end systems, I will again be asked "Why can't I get link-rate on a single-stream netperf test?" While indeed, the world is not just unidirectional bulk flows (if it were netperf and its request-response tests would never have come into being to replace ttcp), even after decades it is still something people seem to expect. There must be some value to high performance unidirectional transfer. Only now the cores aren't going to have gotten any faster, and spreading incoming 5-tuples across cores isn't going to help a single stream. So, the "answer" will likely end-up being to add still more complexity - either in the applications to use multiple streams, or to push the full stack into the NIC. Adde parvum parvo manus acervus erit. But, by Metcalf, we will have preserved the sacrosanct Ethernet maximum frame size. Crossing emails a bit, Kevin wrote about the 6X increase in latency. It is a 6X increase in *potential* latency *if* someone actually enables the larger MTU. And yes, the "We want to be on the Top 500 list" types do worry about latency and some perhaps even many of them use Ethernet instead of Infiniband (which does, BTW offer at least the illusion of a quite large MTU to IP), but a sanctioned way to run a larger MTU over Ethernet does not *force* them to use it if they want to make the explicit latency vs overhead trade-off. As it stands, those who do not worry about micro or nanoseconds are forced off the standard in the name of preserving something for those who do. (And with 100 GbE it would be nanosecond differences we would talking about - the 12 and 72 usec of 1 GbE become 120 and 720 nanoseconds at 100 GbE - the realm of a processor cache miss because memory latency hasn't and won't likely get much better either) And, are transaction or SAN latencies actually measured in microseconds or nanoseconds? If "transactions" are OLTP, those things are measured in milliseconds and even whole seconds (TPC), and spinning rust (yes, but not SSDs) still has latencies measured in milliseconds. rick jones > > And while it isn't the > strongest point in the world, one might even argue that the > need to use > TSO/LRO to achieve performance hinders new transport protocol > adoption - > the presence of NIC offloads for only TCP (or UDP) leaves a > new > transport protocol (perhaps SCTP) at a disadvantage. > > > True, and even UDP seems to be often blocked for anything other than > DNS.