From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout-de.gmx.net (mailout-de.gmx.net [213.165.64.22]) by huchra.bufferbloat.net (Postfix) with SMTP id 4C0B92006B0 for ; Mon, 22 Aug 2011 23:55:40 -0700 (PDT) Received: (qmail invoked by alias); 23 Aug 2011 07:51:47 -0000 Received: from unknown (EHLO srichardlxp2) [213.143.107.142] by mail.gmx.net (mp022) with SMTP; 23 Aug 2011 09:51:47 +0200 X-Authenticated: #20720068 X-Provags-ID: V01U2FsdGVkX1/Z2O0TTjXZtN4+yfEd/9g7BUCSQA4JwECtcAiAbJ +pQeMOb0Ijvfq3 Message-ID: From: "Richard Scheffenegger" To: "Patrick J. LoPresti" , References: Date: Tue, 23 Aug 2011 09:44:15 +0200 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6109 X-Y-GMX-Trusted: 0 Subject: Re: [Bloat] Not all the world's a WAN X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 06:55:40 -0000 you problem is called incast. and there is vast literature around that subject, how to alleviate it more or less. there are simple approaches - with limited benefits - like rto reduction, hires tcp timers, and introducing short random delays for the responses. none of these will give optimal bandwidth though. if you have a cheap 10g switch built around the broadcom chipsets, rather than the expensive gear from another more well known vendor, you can perhaps deploy dctcp, yielding up to 98% bandwidth with optimal latency and near zero loss, even when the setup is prone to severe incast... rgds ----- Original Message ----- From: "Patrick J. LoPresti" To: Sent: Thursday, August 18, 2011 3:26 AM Subject: [Bloat] Not all the world's a WAN > Hello, BufferBloat crusaders. > > Permit me briefly to describe my application. I have a rack full of > Linux systems, all with 10GbE NICs tied together by a 10GbE switch. > There are no routers or broader Internet connectivity. (At least, > none that matters here.) Round trip "ping" times between systems are > 100 microseconds or so. > > Some of the systems are "servers", some are "clients". Any single > client may decide to slurp data from multiple servers. For example, > the servers could be serving up a distributed file system, so when a > client accesses a file striped across multiple servers, it tries to > pull data from multiple servers simultaneously. (This is not my > literal application, but it does represent the same access pattern.) > > The purpose of my cluster is to process data sets measured in hundreds > of gigabytes, as fast as possible. So, for my application: > > - Speed = Throughput (latency is irrelevant) > - TCP retransmissions are a disaster, not least because > - 200ms is an eternity > > > The problem I have is this: At 10 gigabits/second, it takes very > little time to overrun even a sizable buffer in a 10GbE switch. > Although each client does have a 10GbE connection, it is reading > multiple sockets from multiple servers, so over short intervals the > switch's aggregate incoming bandwidth (multiple 10GbE links from > servers) is larger than its outgoing bandwidth (single 10GbE link to > client). If the servers do not throttle themselves -- say, because > the TCP windows are large -- packets overrun the switch's buffer and > get lost. > > I have "fixed" this problem by using a switch with a _large_ buffer, > plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window > never gets very large. This ensures that the servers never send so > much data that they overrun the switch. And it is actually working > great; I am able to saturate all of my 10GbE links with zero > retransmissions. > > I have not read all of the messages on this list, but I have read > enough to make me a little nervous. And thus I send this message in > the hope that, in your quest to slay the "buffer bloat" dragon, you do > not completely forget applications like mine. I would hate to have to > switch to Infiniband or whatever just because everyone decided that > Web browsers are the only TCP/IP application in the world. > > Thanks for reading. > > - Pat > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat