From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from eu1sys200aog107.obsmtp.com (eu1sys200aog107.obsmtp.com [207.126.144.123]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id 137EB2006D6 for ; Wed, 17 Aug 2011 23:51:52 -0700 (PDT) Received: from mail.la.pnsol.com ([89.145.213.110]) (using TLSv1) by eu1sys200aob107.postini.com ([207.126.147.11]) with SMTP ID DSNKTkzDGtnd3WrnGqAcwZ2ugNuOuR8p9N2x@postini.com; Thu, 18 Aug 2011 07:45:40 UTC Received: from ba6-office.pnsol.com ([172.20.5.199]) by mail.la.pnsol.com with esmtp (Exim 4.63) (envelope-from ) id 1QtxI9-0005Zu-NH; Thu, 18 Aug 2011 08:45:29 +0100 Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Neil Davies In-Reply-To: <20110817205724.4b91e188@nehalam.ftrdhcpuser.net> Date: Thu, 18 Aug 2011 08:45:29 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <21FE0F13-C946-4212-93F7-64112F63E8CF@pnsol.com> References: <20110817205724.4b91e188@nehalam.ftrdhcpuser.net> To: Stephen Hemminger X-Mailer: Apple Mail (2.1084) Cc: "Patrick J. LoPresti" , bloat@lists.bufferbloat.net Subject: Re: [Bloat] Not all the world's a WAN X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Aug 2011 06:51:53 -0000 Stephen I disagree with you - Patrick has solved his problem. As for papering over the cracks - that is just pure provocation - if = more than *one* packet in the buffer?=20 Any finite queueing system has two degrees of freedom - there are three = variables in play: loading factor (ratio of arrival to departure along = with distribution of those); delay (distribution); and loss (and its = distribution). And in that system it is a trade - Patrick's trade is to constrain the = arrival distribution/pattern so as to keep the total quality attenuation = (delay and loss) at his buffer points within an acceptable bound for his = application - he's made a rational choice for his requirements - he's = bounded the induced quality attenuation. To solve the 'general' case you need to solve the general 'induced = quality attenuation' - there is the nub of the issue=20 Neil On 18 Aug 2011, at 04:57, Stephen Hemminger wrote: > On Wed, 17 Aug 2011 18:26:00 -0700 > "Patrick J. LoPresti" wrote: >=20 >> Hello, BufferBloat crusaders. >>=20 >> Permit me briefly to describe my application. I have a rack full of >> Linux systems, all with 10GbE NICs tied together by a 10GbE switch. >> There are no routers or broader Internet connectivity. (At least, >> none that matters here.) Round trip "ping" times between systems are >> 100 microseconds or so. >>=20 >> Some of the systems are "servers", some are "clients". Any single >> client may decide to slurp data from multiple servers. For example, >> the servers could be serving up a distributed file system, so when a >> client accesses a file striped across multiple servers, it tries to >> pull data from multiple servers simultaneously. (This is not my >> literal application, but it does represent the same access pattern.) >>=20 >> The purpose of my cluster is to process data sets measured in = hundreds >> of gigabytes, as fast as possible. So, for my application: >>=20 >> - Speed =3D Throughput (latency is irrelevant) >> - TCP retransmissions are a disaster, not least because >> - 200ms is an eternity >>=20 >>=20 >> The problem I have is this: At 10 gigabits/second, it takes very >> little time to overrun even a sizable buffer in a 10GbE switch. >> Although each client does have a 10GbE connection, it is reading >> multiple sockets from multiple servers, so over short intervals the >> switch's aggregate incoming bandwidth (multiple 10GbE links from >> servers) is larger than its outgoing bandwidth (single 10GbE link to >> client). If the servers do not throttle themselves -- say, because >> the TCP windows are large -- packets overrun the switch's buffer and >> get lost. >=20 > You need faster switches ;-) >=20 >> I have "fixed" this problem by using a switch with a _large_ buffer, >> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window >> never gets very large. This ensures that the servers never send so >> much data that they overrun the switch. And it is actually working >> great; I am able to saturate all of my 10GbE links with zero >> retransmissions. >=20 > You just papered over the problem. If the mean queue length over > time is greater than one, you will lose packets. This maybe a case > where Ethernet flow control might help. It does have the problem > of head of line blocking when cascading switches but if the switch > is just a pass through it might help. >=20 >> I have not read all of the messages on this list, but I have read >> enough to make me a little nervous. And thus I send this message in >> the hope that, in your quest to slay the "buffer bloat" dragon, you = do >> not completely forget applications like mine. I would hate to have = to >> switch to Infiniband or whatever just because everyone decided that >> Web browsers are the only TCP/IP application in the world. >>=20 >=20 > My view is this all about getting the defaults right for average > users. People with big servers will always end up tuning; thats what > they get paid for. Think of it as the difference between a Formula 1 > car versus an average sedan. You want the sedan to just work, and > have all the traction control and rev limiters. For the F1 race > car, the driver knows best. >=20 > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat