[Bloat] Not all the world's a WAN
Richard Scheffenegger
rscheff at gmx.at
Tue Aug 23 03:44:15 EDT 2011
you problem is called incast. and there is vast literature around that
subject, how to alleviate it more or less.
there are simple approaches - with limited benefits - like rto reduction,
hires tcp timers, and introducing short random delays for the responses.
none of these will give optimal bandwidth though.
if you have a cheap 10g switch built around the broadcom chipsets, rather
than the expensive gear from another more well known vendor, you can perhaps
deploy dctcp, yielding up to 98% bandwidth with optimal latency and near
zero loss, even when the setup is prone to severe incast...
rgds
----- Original Message -----
From: "Patrick J. LoPresti" <lopresti at gmail.com>
To: <bloat at lists.bufferbloat.net>
Sent: Thursday, August 18, 2011 3:26 AM
Subject: [Bloat] Not all the world's a WAN
> Hello, BufferBloat crusaders.
>
> Permit me briefly to describe my application. I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity. (At least,
> none that matters here.) Round trip "ping" times between systems are
> 100 microseconds or so.
>
> Some of the systems are "servers", some are "clients". Any single
> client may decide to slurp data from multiple servers. For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously. (This is not my
> literal application, but it does represent the same access pattern.)
>
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible. So, for my application:
>
> - Speed = Throughput (latency is irrelevant)
> - TCP retransmissions are a disaster, not least because
> - 200ms is an eternity
>
>
> The problem I have is this: At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client). If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.
>
> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large. This ensures that the servers never send so
> much data that they overrun the switch. And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.
>
> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous. And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine. I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
>
> Thanks for reading.
>
> - Pat
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
More information about the Bloat
mailing list