From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.vyatta.com (mail.vyatta.com [76.74.103.46]) by huchra.bufferbloat.net (Postfix) with ESMTP id ED85B2006D6 for ; Wed, 17 Aug 2011 20:03:34 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.vyatta.com (Postfix) with ESMTP id B66B71410107; Wed, 17 Aug 2011 20:57:16 -0700 (PDT) X-Virus-Scanned: amavisd-new at tahiti.vyatta.com Received: from mail.vyatta.com ([127.0.0.1]) by localhost (mail.vyatta.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zWnuVIc7XHCF; Wed, 17 Aug 2011 20:57:15 -0700 (PDT) Received: from nehalam.ftrdhcpuser.net (static-50-53-80-93.bvtn.or.frontiernet.net [50.53.80.93]) by mail.vyatta.com (Postfix) with ESMTPSA id 0E3C914100E9; Wed, 17 Aug 2011 20:57:14 -0700 (PDT) Date: Wed, 17 Aug 2011 20:57:24 -0700 From: Stephen Hemminger To: "Patrick J. LoPresti" Message-ID: <20110817205724.4b91e188@nehalam.ftrdhcpuser.net> In-Reply-To: References: Organization: Vyatta X-Mailer: Claws Mail 3.7.9 (GTK+ 2.24.4; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] Not all the world's a WAN X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Aug 2011 03:03:35 -0000 On Wed, 17 Aug 2011 18:26:00 -0700 "Patrick J. LoPresti" wrote: > Hello, BufferBloat crusaders. > > Permit me briefly to describe my application. I have a rack full of > Linux systems, all with 10GbE NICs tied together by a 10GbE switch. > There are no routers or broader Internet connectivity. (At least, > none that matters here.) Round trip "ping" times between systems are > 100 microseconds or so. > > Some of the systems are "servers", some are "clients". Any single > client may decide to slurp data from multiple servers. For example, > the servers could be serving up a distributed file system, so when a > client accesses a file striped across multiple servers, it tries to > pull data from multiple servers simultaneously. (This is not my > literal application, but it does represent the same access pattern.) > > The purpose of my cluster is to process data sets measured in hundreds > of gigabytes, as fast as possible. So, for my application: > > - Speed = Throughput (latency is irrelevant) > - TCP retransmissions are a disaster, not least because > - 200ms is an eternity > > > The problem I have is this: At 10 gigabits/second, it takes very > little time to overrun even a sizable buffer in a 10GbE switch. > Although each client does have a 10GbE connection, it is reading > multiple sockets from multiple servers, so over short intervals the > switch's aggregate incoming bandwidth (multiple 10GbE links from > servers) is larger than its outgoing bandwidth (single 10GbE link to > client). If the servers do not throttle themselves -- say, because > the TCP windows are large -- packets overrun the switch's buffer and > get lost. You need faster switches ;-) > I have "fixed" this problem by using a switch with a _large_ buffer, > plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window > never gets very large. This ensures that the servers never send so > much data that they overrun the switch. And it is actually working > great; I am able to saturate all of my 10GbE links with zero > retransmissions. You just papered over the problem. If the mean queue length over time is greater than one, you will lose packets. This maybe a case where Ethernet flow control might help. It does have the problem of head of line blocking when cascading switches but if the switch is just a pass through it might help. > I have not read all of the messages on this list, but I have read > enough to make me a little nervous. And thus I send this message in > the hope that, in your quest to slay the "buffer bloat" dragon, you do > not completely forget applications like mine. I would hate to have to > switch to Infiniband or whatever just because everyone decided that > Web browsers are the only TCP/IP application in the world. > My view is this all about getting the defaults right for average users. People with big servers will always end up tuning; thats what they get paid for. Think of it as the difference between a Formula 1 car versus an average sedan. You want the sedan to just work, and have all the traction control and rev limiters. For the F1 race car, the driver knows best.