[Bloat] Not all the world's a WAN
Stephen Hemminger
shemminger at vyatta.com
Wed Aug 17 23:57:24 EDT 2011
On Wed, 17 Aug 2011 18:26:00 -0700
"Patrick J. LoPresti" <lopresti at gmail.com> wrote:
> Hello, BufferBloat crusaders.
>
> Permit me briefly to describe my application. I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity. (At least,
> none that matters here.) Round trip "ping" times between systems are
> 100 microseconds or so.
>
> Some of the systems are "servers", some are "clients". Any single
> client may decide to slurp data from multiple servers. For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously. (This is not my
> literal application, but it does represent the same access pattern.)
>
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible. So, for my application:
>
> - Speed = Throughput (latency is irrelevant)
> - TCP retransmissions are a disaster, not least because
> - 200ms is an eternity
>
>
> The problem I have is this: At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client). If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.
You need faster switches ;-)
> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large. This ensures that the servers never send so
> much data that they overrun the switch. And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.
You just papered over the problem. If the mean queue length over
time is greater than one, you will lose packets. This maybe a case
where Ethernet flow control might help. It does have the problem
of head of line blocking when cascading switches but if the switch
is just a pass through it might help.
> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous. And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine. I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
>
My view is this all about getting the defaults right for average
users. People with big servers will always end up tuning; thats what
they get paid for. Think of it as the difference between a Formula 1
car versus an average sedan. You want the sedan to just work, and
have all the traction control and rev limiters. For the F1 race
car, the driver knows best.
More information about the Bloat
mailing list