[Bloat] Not all the world's a WAN

Tue Aug 23 03:44:15 EDT 2011

you problem is called incast. and there is vast literature around that 
subject, how to alleviate it more or less.

there are simple approaches - with limited benefits - like rto reduction, 
hires tcp timers, and introducing short random delays for the responses. 
none of these will give optimal bandwidth though.

if you have a cheap 10g switch built around the broadcom chipsets, rather 
than the expensive gear from another more well known vendor, you can perhaps 
deploy dctcp, yielding up to 98% bandwidth with optimal latency and near 
zero loss, even when the setup is prone to severe incast...

rgds

----- Original Message ----- 
From: "Patrick J. LoPresti" <lopresti at gmail.com>
To: <bloat at lists.bufferbloat.net>
Sent: Thursday, August 18, 2011 3:26 AM
Subject: [Bloat] Not all the world's a WAN

> Hello, BufferBloat crusaders.
>
> Permit me briefly to describe my application.  I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity.  (At least,
> none that matters here.)  Round trip "ping" times between systems are
> 100 microseconds or so.
>
> Some of the systems are "servers", some are "clients".  Any single
> client may decide to slurp data from multiple servers.  For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously.  (This is not my
> literal application, but it does represent the same access pattern.)
>
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible.  So, for my application:
>
> - Speed = Throughput (latency is irrelevant)
> - TCP retransmissions are a disaster, not least because
> - 200ms is an eternity
>
>
> The problem I have is this:  At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client).  If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.
>
> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large.  This ensures that the servers never send so
> much data that they overrun the switch.  And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.
>
> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous.  And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine.  I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
>
> Thanks for reading.
>
> - Pat
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat