From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <shemminger@vyatta.com>
Received: from mail.vyatta.com (mail.vyatta.com [76.74.103.46])
	by huchra.bufferbloat.net (Postfix) with ESMTP id ED85B2006D6
	for <bloat@lists.bufferbloat.net>; Wed, 17 Aug 2011 20:03:34 -0700 (PDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by mail.vyatta.com (Postfix) with ESMTP id B66B71410107;
	Wed, 17 Aug 2011 20:57:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at tahiti.vyatta.com
Received: from mail.vyatta.com ([127.0.0.1])
	by localhost (mail.vyatta.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id zWnuVIc7XHCF; Wed, 17 Aug 2011 20:57:15 -0700 (PDT)
Received: from nehalam.ftrdhcpuser.net
	(static-50-53-80-93.bvtn.or.frontiernet.net [50.53.80.93])
	by mail.vyatta.com (Postfix) with ESMTPSA id 0E3C914100E9;
	Wed, 17 Aug 2011 20:57:14 -0700 (PDT)
Date: Wed, 17 Aug 2011 20:57:24 -0700
From: Stephen Hemminger <shemminger@vyatta.com>
To: "Patrick J. LoPresti" <lopresti@gmail.com>
Message-ID: <20110817205724.4b91e188@nehalam.ftrdhcpuser.net>
In-Reply-To: <CAKGkousxwnvog=De9X9ynDs=_iqXXqD93opTcM_gGCshautHHg@mail.gmail.com>
References: <CAKGkousxwnvog=De9X9ynDs=_iqXXqD93opTcM_gGCshautHHg@mail.gmail.com>
Organization: Vyatta
X-Mailer: Claws Mail 3.7.9 (GTK+ 2.24.4; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] Not all the world's a WAN
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Thu, 18 Aug 2011 03:03:35 -0000

On Wed, 17 Aug 2011 18:26:00 -0700
"Patrick J. LoPresti" <lopresti@gmail.com> wrote:

> Hello, BufferBloat crusaders.
> 
> Permit me briefly to describe my application.  I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity.  (At least,
> none that matters here.)  Round trip "ping" times between systems are
> 100 microseconds or so.
> 
> Some of the systems are "servers", some are "clients".  Any single
> client may decide to slurp data from multiple servers.  For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously.  (This is not my
> literal application, but it does represent the same access pattern.)
> 
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible.  So, for my application:
> 
>  - Speed = Throughput (latency is irrelevant)
>  - TCP retransmissions are a disaster, not least because
>  - 200ms is an eternity
> 
> 
> The problem I have is this:  At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client).  If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.

You need faster switches ;-)

> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large.  This ensures that the servers never send so
> much data that they overrun the switch.  And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.

You just papered over the problem. If the mean queue length over
time is greater than one, you will lose packets. This maybe a case
where Ethernet flow control might help. It does have the problem
of head of line blocking when cascading switches but if the switch
is just a pass through it might help.

> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous.  And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine.  I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
> 

My view is this all about getting the defaults right for average
users. People with big servers will always end up tuning; thats what
they get paid for. Think of it as the difference between a Formula 1
car versus an average sedan. You want the sedan to just work, and
have all the traction control and rev limiters. For the F1 race
car, the driver knows best.