[Bloat] Not all the world's a WAN

General list for discussing Bufferbloat
 help / color / mirror / Atom feed

* [Bloat] Not all the world's a WAN
@ 2011-08-18  1:26 Patrick J. LoPresti
  2011-08-18  3:57 ` Stephen Hemminger
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Patrick J. LoPresti @ 2011-08-18  1:26 UTC (permalink / raw)
  To: bloat

Hello, BufferBloat crusaders.

Permit me briefly to describe my application.  I have a rack full of
Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
There are no routers or broader Internet connectivity.  (At least,
none that matters here.)  Round trip "ping" times between systems are
100 microseconds or so.

Some of the systems are "servers", some are "clients".  Any single
client may decide to slurp data from multiple servers.  For example,
the servers could be serving up a distributed file system, so when a
client accesses a file striped across multiple servers, it tries to
pull data from multiple servers simultaneously.  (This is not my
literal application, but it does represent the same access pattern.)

The purpose of my cluster is to process data sets measured in hundreds
of gigabytes, as fast as possible.  So, for my application:

 - Speed = Throughput (latency is irrelevant)
 - TCP retransmissions are a disaster, not least because
 - 200ms is an eternity

The problem I have is this:  At 10 gigabits/second, it takes very
little time to overrun even a sizable buffer in a 10GbE switch.
Although each client does have a 10GbE connection, it is reading
multiple sockets from multiple servers, so over short intervals the
switch's aggregate incoming bandwidth (multiple 10GbE links from
servers) is larger than its outgoing bandwidth (single 10GbE link to
client).  If the servers do not throttle themselves -- say, because
the TCP windows are large -- packets overrun the switch's buffer and
get lost.

I have "fixed" this problem by using a switch with a _large_ buffer,
plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
never gets very large.  This ensures that the servers never send so
much data that they overrun the switch.  And it is actually working
great; I am able to saturate all of my 10GbE links with zero
retransmissions.

I have not read all of the messages on this list, but I have read
enough to make me a little nervous.  And thus I send this message in
the hope that, in your quest to slay the "buffer bloat" dragon, you do
not completely forget applications like mine.  I would hate to have to
switch to Infiniband or whatever just because everyone decided that
Web browsers are the only TCP/IP application in the world.

Thanks for reading.

 - Pat

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-18  1:26 [Bloat] Not all the world's a WAN Patrick J. LoPresti
@ 2011-08-18  3:57 ` Stephen Hemminger
  2011-08-18  7:45   ` Neil Davies
  2011-08-18  5:08 ` Steinar H. Gunderson
  2011-08-23  7:44 ` Richard Scheffenegger
  2 siblings, 1 reply; 7+ messages in thread
From: Stephen Hemminger @ 2011-08-18  3:57 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: bloat

On Wed, 17 Aug 2011 18:26:00 -0700
"Patrick J. LoPresti" <lopresti@gmail.com> wrote:

> Hello, BufferBloat crusaders.
> 
> Permit me briefly to describe my application.  I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity.  (At least,
> none that matters here.)  Round trip "ping" times between systems are
> 100 microseconds or so.
> 
> Some of the systems are "servers", some are "clients".  Any single
> client may decide to slurp data from multiple servers.  For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously.  (This is not my
> literal application, but it does represent the same access pattern.)
> 
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible.  So, for my application:
> 
>  - Speed = Throughput (latency is irrelevant)
>  - TCP retransmissions are a disaster, not least because
>  - 200ms is an eternity
> 
> 
> The problem I have is this:  At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client).  If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.

You need faster switches ;-)

> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large.  This ensures that the servers never send so
> much data that they overrun the switch.  And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.

You just papered over the problem. If the mean queue length over
time is greater than one, you will lose packets. This maybe a case
where Ethernet flow control might help. It does have the problem
of head of line blocking when cascading switches but if the switch
is just a pass through it might help.

> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous.  And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine.  I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
> 

My view is this all about getting the defaults right for average
users. People with big servers will always end up tuning; thats what
they get paid for. Think of it as the difference between a Formula 1
car versus an average sedan. You want the sedan to just work, and
have all the traction control and rev limiters. For the F1 race
car, the driver knows best.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-18  1:26 [Bloat] Not all the world's a WAN Patrick J. LoPresti
  2011-08-18  3:57 ` Stephen Hemminger
@ 2011-08-18  5:08 ` Steinar H. Gunderson
  2011-08-19 14:59   ` BeckW
  2011-08-23  7:44 ` Richard Scheffenegger
  2 siblings, 1 reply; 7+ messages in thread
From: Steinar H. Gunderson @ 2011-08-18  5:08 UTC (permalink / raw)
  To: bloat

On Wed, Aug 17, 2011 at 06:26:00PM -0700, Patrick J. LoPresti wrote:
>  - Speed = Throughput (latency is irrelevant)
>  - TCP retransmissions are a disaster, not least because
>  - 200ms is an eternity

Note that you can adjust this; if nothing else by hacking the kernel. If you
only care about intra-cluster anyway, setting it to something like 2ms would
probably help a lot.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-18  3:57 ` Stephen Hemminger
@ 2011-08-18  7:45   ` Neil Davies
  0 siblings, 0 replies; 7+ messages in thread
From: Neil Davies @ 2011-08-18  7:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Patrick J. LoPresti, bloat

Stephen

I disagree with you - Patrick has solved his problem.

As for papering over the cracks - that is just pure provocation - if more than *one* packet in the buffer? 

Any finite queueing system has two degrees of freedom - there are three variables in play: loading factor (ratio of arrival to departure along with distribution of those); delay (distribution); and loss (and its distribution).

And in that system it is a trade - Patrick's trade is to constrain the arrival distribution/pattern so as to keep the total quality attenuation (delay and loss) at his buffer points within an acceptable bound for his application - he's made a rational choice for his requirements - he's bounded the induced quality attenuation.

To solve the 'general' case you need to solve the general 'induced quality attenuation' - there is the nub of the issue 

Neil

On 18 Aug 2011, at 04:57, Stephen Hemminger wrote:

> On Wed, 17 Aug 2011 18:26:00 -0700
> "Patrick J. LoPresti" <lopresti@gmail.com> wrote:
> 
>> Hello, BufferBloat crusaders.
>> 
>> Permit me briefly to describe my application.  I have a rack full of
>> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
>> There are no routers or broader Internet connectivity.  (At least,
>> none that matters here.)  Round trip "ping" times between systems are
>> 100 microseconds or so.
>> 
>> Some of the systems are "servers", some are "clients".  Any single
>> client may decide to slurp data from multiple servers.  For example,
>> the servers could be serving up a distributed file system, so when a
>> client accesses a file striped across multiple servers, it tries to
>> pull data from multiple servers simultaneously.  (This is not my
>> literal application, but it does represent the same access pattern.)
>> 
>> The purpose of my cluster is to process data sets measured in hundreds
>> of gigabytes, as fast as possible.  So, for my application:
>> 
>> - Speed = Throughput (latency is irrelevant)
>> - TCP retransmissions are a disaster, not least because
>> - 200ms is an eternity
>> 
>> 
>> The problem I have is this:  At 10 gigabits/second, it takes very
>> little time to overrun even a sizable buffer in a 10GbE switch.
>> Although each client does have a 10GbE connection, it is reading
>> multiple sockets from multiple servers, so over short intervals the
>> switch's aggregate incoming bandwidth (multiple 10GbE links from
>> servers) is larger than its outgoing bandwidth (single 10GbE link to
>> client).  If the servers do not throttle themselves -- say, because
>> the TCP windows are large -- packets overrun the switch's buffer and
>> get lost.
> 
> You need faster switches ;-)
> 
>> I have "fixed" this problem by using a switch with a _large_ buffer,
>> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
>> never gets very large.  This ensures that the servers never send so
>> much data that they overrun the switch.  And it is actually working
>> great; I am able to saturate all of my 10GbE links with zero
>> retransmissions.
> 
> You just papered over the problem. If the mean queue length over
> time is greater than one, you will lose packets. This maybe a case
> where Ethernet flow control might help. It does have the problem
> of head of line blocking when cascading switches but if the switch
> is just a pass through it might help.
> 
>> I have not read all of the messages on this list, but I have read
>> enough to make me a little nervous.  And thus I send this message in
>> the hope that, in your quest to slay the "buffer bloat" dragon, you do
>> not completely forget applications like mine.  I would hate to have to
>> switch to Infiniband or whatever just because everyone decided that
>> Web browsers are the only TCP/IP application in the world.
>> 
> 
> My view is this all about getting the defaults right for average
> users. People with big servers will always end up tuning; thats what
> they get paid for. Think of it as the difference between a Formula 1
> car versus an average sedan. You want the sedan to just work, and
> have all the traction control and rev limiters. For the F1 race
> car, the driver knows best.
> 
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-18  5:08 ` Steinar H. Gunderson
@ 2011-08-19 14:59   ` BeckW
  2011-08-23  7:37     ` Richard Scheffenegger
  0 siblings, 1 reply; 7+ messages in thread
From: BeckW @ 2011-08-19 14:59 UTC (permalink / raw)
  To: sgunderson, bloat

> [...]so over short intervals the switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to client).

Well, this kind of burstiness is the reason why we have buffers at all.

Restricting the window size limits the TCP speed and keeps it this way from overflowing the buffer outside of bursty periods.

ECN marking helps TCP to distinguish between really available bandwidth and 'fake' bandwidth that just fills up a buffer, at least if the feedback loop is fast enough. Do ethernet switches support ECN marking?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-19 14:59   ` BeckW
@ 2011-08-23  7:37     ` Richard Scheffenegger
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Scheffenegger @ 2011-08-23  7:37 UTC (permalink / raw)
  To: BeckW, sgunderson, bloat

Some do. i.e. l3 capable switches, broadcom scorpio (and later) can do ecn 
marking based on l2 buffer occupancy.

see the paper from stanford/microsoft about dctcp - ecn marking based on 
instantaneous buffer occupancy and proportional tcp reaction (rather than 
halving cwnd) per rtt...

rgds
----- Original Message ----- 
From: <BeckW@telekom.de>
To: <sgunderson@bigfoot.com>; <bloat@lists.bufferbloat.net>
Sent: Friday, August 19, 2011 4:59 PM
Subject: Re: [Bloat] Not all the world's a WAN


>> [...]so over short intervals the switch's aggregate incoming bandwidth 
>> (multiple 10GbE links from
>> servers) is larger than its outgoing bandwidth (single 10GbE link to 
>> client).
>
> Well, this kind of burstiness is the reason why we have buffers at all.
>
> Restricting the window size limits the TCP speed and keeps it this way 
> from overflowing the buffer outside of bursty periods.
>
> ECN marking helps TCP to distinguish between really available bandwidth 
> and 'fake' bandwidth that just fills up a buffer, at least if the feedback 
> loop is fast enough. Do ethernet switches support ECN marking?
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bloat] Not all the world's a WAN
  2011-08-18  1:26 [Bloat] Not all the world's a WAN Patrick J. LoPresti
  2011-08-18  3:57 ` Stephen Hemminger
  2011-08-18  5:08 ` Steinar H. Gunderson
@ 2011-08-23  7:44 ` Richard Scheffenegger
  2 siblings, 0 replies; 7+ messages in thread
From: Richard Scheffenegger @ 2011-08-23  7:44 UTC (permalink / raw)
  To: Patrick J. LoPresti, bloat

you problem is called incast. and there is vast literature around that 
subject, how to alleviate it more or less.

there are simple approaches - with limited benefits - like rto reduction, 
hires tcp timers, and introducing short random delays for the responses. 
none of these will give optimal bandwidth though.

if you have a cheap 10g switch built around the broadcom chipsets, rather 
than the expensive gear from another more well known vendor, you can perhaps 
deploy dctcp, yielding up to 98% bandwidth with optimal latency and near 
zero loss, even when the setup is prone to severe incast...


rgds

----- Original Message ----- 
From: "Patrick J. LoPresti" <lopresti@gmail.com>
To: <bloat@lists.bufferbloat.net>
Sent: Thursday, August 18, 2011 3:26 AM
Subject: [Bloat] Not all the world's a WAN


> Hello, BufferBloat crusaders.
>
> Permit me briefly to describe my application.  I have a rack full of
> Linux systems, all with 10GbE NICs tied together by a 10GbE switch.
> There are no routers or broader Internet connectivity.  (At least,
> none that matters here.)  Round trip "ping" times between systems are
> 100 microseconds or so.
>
> Some of the systems are "servers", some are "clients".  Any single
> client may decide to slurp data from multiple servers.  For example,
> the servers could be serving up a distributed file system, so when a
> client accesses a file striped across multiple servers, it tries to
> pull data from multiple servers simultaneously.  (This is not my
> literal application, but it does represent the same access pattern.)
>
> The purpose of my cluster is to process data sets measured in hundreds
> of gigabytes, as fast as possible.  So, for my application:
>
> - Speed = Throughput (latency is irrelevant)
> - TCP retransmissions are a disaster, not least because
> - 200ms is an eternity
>
>
> The problem I have is this:  At 10 gigabits/second, it takes very
> little time to overrun even a sizable buffer in a 10GbE switch.
> Although each client does have a 10GbE connection, it is reading
> multiple sockets from multiple servers, so over short intervals the
> switch's aggregate incoming bandwidth (multiple 10GbE links from
> servers) is larger than its outgoing bandwidth (single 10GbE link to
> client).  If the servers do not throttle themselves -- say, because
> the TCP windows are large -- packets overrun the switch's buffer and
> get lost.
>
> I have "fixed" this problem by using a switch with a _large_ buffer,
> plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window
> never gets very large.  This ensures that the servers never send so
> much data that they overrun the switch.  And it is actually working
> great; I am able to saturate all of my 10GbE links with zero
> retransmissions.
>
> I have not read all of the messages on this list, but I have read
> enough to make me a little nervous.  And thus I send this message in
> the hope that, in your quest to slay the "buffer bloat" dragon, you do
> not completely forget applications like mine.  I would hate to have to
> switch to Infiniband or whatever just because everyone decided that
> Web browsers are the only TCP/IP application in the world.
>
> Thanks for reading.
>
> - Pat
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-08-23  6:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-18  1:26 [Bloat] Not all the world's a WAN Patrick J. LoPresti
2011-08-18  3:57 ` Stephen Hemminger
2011-08-18  7:45   ` Neil Davies
2011-08-18  5:08 ` Steinar H. Gunderson
2011-08-19 14:59   ` BeckW
2011-08-23  7:37     ` Richard Scheffenegger
2011-08-23  7:44 ` Richard Scheffenegger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox