[Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat

Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed

* [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
@ 2016-06-06 15:29 Eric Johansson
  2016-06-06 16:53 ` Toke Høiland-Jørgensen
  2016-06-07 22:31 ` [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Valdis.Kletnieks
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Johansson @ 2016-06-06 15:29 UTC (permalink / raw)
  To: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 1572 bytes --]

one of my clients asked arista about bufferbloat issues in their
switches.  here was their response.  is their analysis right?

------

Buffer bloat was a relevant on 10/100M switches, not 10Gb switches. At
10Gb we can empty the queue in ~100ms, which is less than the TCP
retransmission timers, therefore no bloat. Buffer bloat can happen at
slower speeds, but not an issue at the speeds we have on our switches. 

There are some articles regarding bufferbloat on the net, but buffer
bloat is not a problem on our switches.  Some of the information
regarding bufferbloat sites Internet routers where packets can be held
in queues of very large buffers for several seconds, up to 10 seconds
which can cause TCP retransmission problems and lower overall
application performance when going across the public Internet.

The buffers on the Arista 7500 are 128MB of packet buffers per 10GbE
port coupled to a fully arbitrated (VOQ) virtual output queue forwarding
system.  At 10Gbps this is ~100msec of buffer capacity which is an order
of magnitude from ‘1 second’ and 2 orders of magnitude from the 10
seconds worst case identified in buffer bloat documents.

We (Arista) have switching systems with large buffers and high port
count, or low buffers and high port count running one operating system.
Buffer bloat is real in systems that would have more than 1.25GB of
packet buffer per 10Gb port - none of these systems contribute to the
buffer bloat issue.

We position deep buffering switches where lossless performance is
necessary.  

[-- Attachment #2: Type: text/html, Size: 5063 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 15:29 [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Eric Johansson
@ 2016-06-06 16:53 ` Toke Høiland-Jørgensen
  2016-06-06 17:46   ` Jonathan Morton
  2016-06-07 22:31 ` [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Valdis.Kletnieks
  1 sibling, 1 reply; 26+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-06-06 16:53 UTC (permalink / raw)
  To: Eric Johansson; +Cc: cerowrt-devel

Eric Johansson <esj@eggo.org> writes:

> Buffer bloat was a relevant on 10/100M switches, not 10Gb switches. At
> 10Gb we can empty the queue in ~100ms, which is less than the TCP
> retransmission timers, therefore no bloat. Buffer bloat can happen at
> slower speeds, but not an issue at the speeds we have on our switches.

100 ms of buffering at 10 Gbps? Holy cow!

There's no agreed-upon definition of what exactly constitutes 'bloat',
and it really depends on the application. As such, I'm not surprised
that this is the kind of answer you get if you ask "do your switches
suffer from bufferbloat". A better question would be "how much buffer
latency can your switches add to my traffic" - which they offer here.

If I read the answer right, anytime you have (say) two ingress ports
sending traffic at full speed out one egress port, that traffic will be
queued for 100 ms. I would certainly consider that broken, but well,
YMMV depending on what you need them for...

-Toke

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 16:53 ` Toke Høiland-Jørgensen
@ 2016-06-06 17:46   ` Jonathan Morton
  2016-06-06 18:37     ` Mikael Abrahamsson
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Morton @ 2016-06-06 17:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Eric Johansson, cerowrt-devel

> On 6 Jun, 2016, at 19:53, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> 
>> Buffer bloat was a relevant on 10/100M switches, not 10Gb switches. At
>> 10Gb we can empty the queue in ~100ms, which is less than the TCP
>> retransmission timers, therefore no bloat. Buffer bloat can happen at
>> slower speeds, but not an issue at the speeds we have on our switches.
> 
> 100 ms of buffering at 10 Gbps? Holy cow!
> 
> There's no agreed-upon definition of what exactly constitutes 'bloat',
> and it really depends on the application. As such, I'm not surprised
> that this is the kind of answer you get if you ask "do your switches
> suffer from bufferbloat". A better question would be "how much buffer
> latency can your switches add to my traffic" - which they offer here.
> 
> If I read the answer right, anytime you have (say) two ingress ports
> sending traffic at full speed out one egress port, that traffic will be
> queued for 100 ms. I would certainly consider that broken, but well,
> YMMV depending on what you need them for...

In a switch, which I have to assume will be used in a LAN or DC context, I would consider 1ms buffering to be a sane value - regardless of link speed.  At 10Gbps this still works out to roughly 1MB of buffer per port.

At 10Mbps this requirement corresponds to a single packet per port; I would tolerate an increase to 10ms (about 6 full-size packets) in that specific case, purely to reduce packet loss from typical packet-pair transmission characteristics.  The same buffer size should therefore suffice for 10Mbps and 100Mbps Ethernet.

Their reference to TCP retransmission timers betrays both a fundamental misunderstanding of how TCP works and an ignorance of the fact that non-TCP traffic is also important (and is typically more latency sensitive).  Some customers would consider even 1ms to be glacially slow.

At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s installed in into a transcontinental Internet path, as far as peak latency is concerned.  Just because RAM is cheap these days…

For anything above switch class (ie. with visibility at Layer 3 rather than 2), I would consider AQM mandatory to support a claim of “unbloated".  Even if it’s just WRED; it’s not considered a *good* AQM by today’s standards, but it beats a dumb FIFO hands down.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 17:46   ` Jonathan Morton
@ 2016-06-06 18:37     ` Mikael Abrahamsson
  2016-06-06 21:16       ` Ketan Kulkarni
  0 siblings, 1 reply; 26+ messages in thread
From: Mikael Abrahamsson @ 2016-06-06 18:37 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: cerowrt-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1355 bytes --]

On Mon, 6 Jun 2016, Jonathan Morton wrote:

> At 100ms buffering, their 10Gbps switch is effectively turning any DC 
> it’s installed in into a transcontinental Internet path, as far as peak 
> latency is concerned.  Just because RAM is cheap these days…

Nono, nononononono. I can tell you they're spending serious money on 
inserting this kind of buffering memory into these kinds of devices. 
Buying these devices without deep buffers is a lot lower cost.

These types of switch chips either have on-die memory (usually 16MB or 
less), or they have very expensive (a direct cost of lowered port density) 
off-chip buffering memory.

Typically you do this:

ports ---|-------
ports ---|      |
ports ---| chip |
ports ---|-------

Or you do this

ports ---|------|---buffer
ports ---| chip |---TCAM
          --------

or if you do a multi-linecard-device

ports ---|------|---buffer
          | chip |---TCAM
          --------
             |
         switch fabric

(or any variant of them)

So basically if you want to buffer and if you want large L2-L4 lookup 
tables, you have to sacrifice ports. Sacrifice lots of ports.

So never say these kinds of devices add buffering because RAM is cheap. 
This is most definitely not why they're doing it. Buffer memory for them 
is EXTREMELY EXPENSIVE.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 18:37     ` Mikael Abrahamsson
@ 2016-06-06 21:16       ` Ketan Kulkarni
  2016-06-07  2:52         ` dpreed
  0 siblings, 1 reply; 26+ messages in thread
From: Ketan Kulkarni @ 2016-06-06 21:16 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Jonathan Morton, cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

some time back they had this whitepaper -
"Why Big Data Needs Big Buffer Switches"
http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf

the type of apps they talk about is big data, hadoop etc


On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <swmike@swm.pp.se>
wrote:

> On Mon, 6 Jun 2016, Jonathan Morton wrote:
>
> At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s
>> installed in into a transcontinental Internet path, as far as peak latency
>> is concerned.  Just because RAM is cheap these days…
>>
>
> Nono, nononononono. I can tell you they're spending serious money on
> inserting this kind of buffering memory into these kinds of devices. Buying
> these devices without deep buffers is a lot lower cost.
>
> These types of switch chips either have on-die memory (usually 16MB or
> less), or they have very expensive (a direct cost of lowered port density)
> off-chip buffering memory.
>
> Typically you do this:
>
> ports ---|-------
> ports ---|      |
> ports ---| chip |
> ports ---|-------
>
> Or you do this
>
> ports ---|------|---buffer
> ports ---| chip |---TCAM
>          --------
>
> or if you do a multi-linecard-device
>
> ports ---|------|---buffer
>          | chip |---TCAM
>          --------
>             |
>         switch fabric
>
> (or any variant of them)
>
> So basically if you want to buffer and if you want large L2-L4 lookup
> tables, you have to sacrifice ports. Sacrifice lots of ports.
>
> So never say these kinds of devices add buffering because RAM is cheap.
> This is most definitely not why they're doing it. Buffer memory for them is
> EXTREMELY EXPENSIVE.
>
> --
> Mikael Abrahamsson    email: swmike@swm.pp.se
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
>

[-- Attachment #2: Type: text/html, Size: 2908 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 21:16       ` Ketan Kulkarni
@ 2016-06-07  2:52         ` dpreed
  2016-06-07  2:58           ` dpreed
  2021-07-02 16:42           ` [Cerowrt-devel] Bechtolschiem Dave Taht
  0 siblings, 2 replies; 26+ messages in thread
From: dpreed @ 2016-06-07  2:52 UTC (permalink / raw)
  To: Ketan Kulkarni; +Cc: Mikael Abrahamsson, Jonathan Morton, cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 3957 bytes --]

So did anyone write a response debunking their paper?   Their NS-2 simulation is most likely the erroneous part of their analysis - the white paper would not pass a review by qualified referees because there is no way to check their results and some of what they say beggars belief.

Bechtolsheim is one of those guys who can write any damn thing and it becomes "truth" - mostly because he co-founded Sun. But that doesn't mean that he can't make huge errors - any of us can.

The so-called TCP/IP Bandwidth Capture effect that he refers to doesn't sound like any capture effect I've ever heard of.  There is an "Ethernet Capture Effect" (which is cited), which is due to properties of CSMA/CD binary exponential backoff, not anything to do with TCP's flow/congestion control.  So it has that "truthiness" that makes glib people sound like they know what they are talking about, but I'd like to see a reference that says this is a property of TCP!

What's interesting is that the reference to the Ethernet Capture Effect in that white paper proposes a solution that involves changing the backoff algorithm slightly at the Ethernet level - NOT increasing buffer size!

Another thing that would probably improve matters a great deal would be to drop/ECN-mark packets when a contended output port on an Arista switch develops a backlog.  This will throttle TCP sources sharing the path.

The comments in the white paper that say that ACK contention in TCP in the reverse direction are the problem that causes the "so-called TCP/IP Bandwidth Capture effect" that is invented by the authors appears to be hogwash of the first order.

Debunking Bechtolsheim credibly would get a lot of attention to the bufferbloat cause, I suspect.

On Monday, June 6, 2016 5:16pm, "Ketan Kulkarni" <ketkulka@gmail.com> said:

some time back they had this whitepaper -
"Why Big Data Needs Big Buffer Switches"

[ http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf ]( http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf )
the type of apps they talk about is big data, hadoop etc

On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <[ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se )> wrote:
On Mon, 6 Jun 2016, Jonathan Morton wrote:

At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s installed in into a transcontinental Internet path, as far as peak latency is concerned.  Just because RAM is cheap these days…Nono, nononononono. I can tell you they're spending serious money on inserting this kind of buffering memory into these kinds of devices. Buying these devices without deep buffers is a lot lower cost.

 These types of switch chips either have on-die memory (usually 16MB or less), or they have very expensive (a direct cost of lowered port density) off-chip buffering memory.

 Typically you do this:

 ports ---|-------
 ports ---|      |
 ports ---| chip |
 ports ---|-------

 Or you do this

 ports ---|------|---buffer
 ports ---| chip |---TCAM
          --------

 or if you do a multi-linecard-device

 ports ---|------|---buffer
          | chip |---TCAM
          --------
             |
         switch fabric

 (or any variant of them)

 So basically if you want to buffer and if you want large L2-L4 lookup tables, you have to sacrifice ports. Sacrifice lots of ports.

 So never say these kinds of devices add buffering because RAM is cheap. This is most definitely not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.

 -- 
 Mikael Abrahamsson    email: [ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se )
_______________________________________________
 Cerowrt-devel mailing list
[ Cerowrt-devel@lists.bufferbloat.net ]( mailto:Cerowrt-devel@lists.bufferbloat.net )
[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )

[-- Attachment #2: Type: text/html, Size: 6714 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-07  2:52         ` dpreed
@ 2016-06-07  2:58           ` dpreed
  2016-06-07 10:46             ` Mikael Abrahamsson
  2016-06-07 17:51             ` Eric Johansson
  2021-07-02 16:42           ` [Cerowrt-devel] Bechtolschiem Dave Taht
  1 sibling, 2 replies; 26+ messages in thread
From: dpreed @ 2016-06-07  2:58 UTC (permalink / raw)
  To: dpreed; +Cc: Ketan Kulkarni, Jonathan Morton, cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 4295 bytes --]

Even better, it would be fun to get access to an Arista switch and some high performance TCP sources and sinks, and demonstrate extreme bufferbloat compared to a small-buffer switch.  Just a demo, not a simulation full of assumptions and guesses.

RRUL, basically.

On Monday, June 6, 2016 10:52pm, dpreed@reed.com said:

So did anyone write a response debunking their paper?   Their NS-2 simulation is most likely the erroneous part of their analysis - the white paper would not pass a review by qualified referees because there is no way to check their results and some of what they say beggars belief.

Bechtolsheim is one of those guys who can write any damn thing and it becomes "truth" - mostly because he co-founded Sun. But that doesn't mean that he can't make huge errors - any of us can.

The so-called TCP/IP Bandwidth Capture effect that he refers to doesn't sound like any capture effect I've ever heard of.  There is an "Ethernet Capture Effect" (which is cited), which is due to properties of CSMA/CD binary exponential backoff, not anything to do with TCP's flow/congestion control.  So it has that "truthiness" that makes glib people sound like they know what they are talking about, but I'd like to see a reference that says this is a property of TCP!

What's interesting is that the reference to the Ethernet Capture Effect in that white paper proposes a solution that involves changing the backoff algorithm slightly at the Ethernet level - NOT increasing buffer size!

Another thing that would probably improve matters a great deal would be to drop/ECN-mark packets when a contended output port on an Arista switch develops a backlog.  This will throttle TCP sources sharing the path.

The comments in the white paper that say that ACK contention in TCP in the reverse direction are the problem that causes the "so-called TCP/IP Bandwidth Capture effect" that is invented by the authors appears to be hogwash of the first order.

Debunking Bechtolsheim credibly would get a lot of attention to the bufferbloat cause, I suspect.

On Monday, June 6, 2016 5:16pm, "Ketan Kulkarni" <ketkulka@gmail.com> said:

some time back they had this whitepaper -
"Why Big Data Needs Big Buffer Switches"

[ http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf ]( http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf )
the type of apps they talk about is big data, hadoop etc

On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <[ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se )> wrote:
On Mon, 6 Jun 2016, Jonathan Morton wrote:

At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s installed in into a transcontinental Internet path, as far as peak latency is concerned.  Just because RAM is cheap these days…Nono, nononononono. I can tell you they're spending serious money on inserting this kind of buffering memory into these kinds of devices. Buying these devices without deep buffers is a lot lower cost.

 These types of switch chips either have on-die memory (usually 16MB or less), or they have very expensive (a direct cost of lowered port density) off-chip buffering memory.

 Typically you do this:

 ports ---|-------
 ports ---|      |
 ports ---| chip |
 ports ---|-------

 Or you do this

 ports ---|------|---buffer
 ports ---| chip |---TCAM
          --------

 or if you do a multi-linecard-device

 ports ---|------|---buffer
          | chip |---TCAM
          --------
             |
         switch fabric

 (or any variant of them)

 So basically if you want to buffer and if you want large L2-L4 lookup tables, you have to sacrifice ports. Sacrifice lots of ports.

 So never say these kinds of devices add buffering because RAM is cheap. This is most definitely not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.

 -- 
 Mikael Abrahamsson    email: [ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se )
_______________________________________________
 Cerowrt-devel mailing list
[ Cerowrt-devel@lists.bufferbloat.net ]( mailto:Cerowrt-devel@lists.bufferbloat.net )
[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )

[-- Attachment #2: Type: text/html, Size: 7997 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-07  2:58           ` dpreed
@ 2016-06-07 10:46             ` Mikael Abrahamsson
  2016-06-07 14:46               ` Dave Taht
  2016-06-07 17:51             ` Eric Johansson
  1 sibling, 1 reply; 26+ messages in thread
From: Mikael Abrahamsson @ 2016-06-07 10:46 UTC (permalink / raw)
  To: dpreed; +Cc: Jonathan Morton, cerowrt-devel

On Mon, 6 Jun 2016, dpreed@reed.com wrote:

> Even better, it would be fun to get access to an Arista switch and some 
> high performance TCP sources and sinks, and demonstrate extreme 
> bufferbloat compared to a small-buffer switch.  Just a demo, not a 
> simulation full of assumptions and guesses.

So while it can be rightfully argued that we don't need 100ms worth of 
buffering (here it actually is kind of correct to say "ram is cheap" 
because as soon as you go for offchip RAM, it's now cheap).

So these vendors have two choices:

1. 8-16MB on-chip buffer.
2. External RAM

If you choose the external RAM one, you might as well put a lot of RAM 
there, and give the option to the customer to configure the port buffer 
settings any way they want.

For the on-chip small buffer one, having 80 10GE ports,all sharing 8 
megabyte of buffer (let's say 10 ports are congesting, meaning each 
port gets 800kilobytes of buffer) and each port doing 1.25gigabyte/s of 
data, that's 0.64ms worth of buffer per congested port (I hope I got my 
math right). That is just too little unless you control the TCP stacks of 
the clients, and are just doing low-RTT communication.

So while I'd admit that 100ms worth of FIFO is too much, what needs to 
happen now is to have them configured to do something clever and aiming to 
never have prolonged use of more than a few ms worth of buffer.

It's hard to do AQM with half a millisecond worth of buffer, right?

At least this has been shown by previous generation of datacenter switches 
that had miniscule buffers and ISPs tried to use them and when there were 
microbursts there was uncontrolled packet loss.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-07 10:46             ` Mikael Abrahamsson
@ 2016-06-07 14:46               ` Dave Taht
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Taht @ 2016-06-07 14:46 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: David Reed, Jonathan Morton, cerowrt-devel

On Tue, Jun 7, 2016 at 3:46 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Mon, 6 Jun 2016, dpreed@reed.com wrote:
>
>> Even better, it would be fun to get access to an Arista switch and some
>> high performance TCP sources and sinks, and demonstrate extreme bufferbloat
>> compared to a small-buffer switch.  Just a demo, not a simulation full of
>> assumptions and guesses.

In terms of doing this at low cost, we can pretty easily setup a linux
box nowadays that can forward at 10GigE using mellonox hardware.

In terms of finding a (set of) cheap 10GigE capable switches, the
needed investment looks to be in the 20k range to buy one. (?) That is
essentially more than the entire cerowrt hw budget for the past 5
years....

>
> So while it can be rightfully argued that we don't need 100ms worth of
> buffering (here it actually is kind of correct to say "ram is cheap" because
> as soon as you go for offchip RAM, it's now cheap).
>
> So these vendors have two choices:
>
> 1. 8-16MB on-chip buffer.
> 2. External RAM
>
> If you choose the external RAM one, you might as well put a lot of RAM
> there, and give the option to the customer to configure the port buffer
> settings any way they want.
>
> For the on-chip small buffer one, having 80 10GE ports,all sharing 8
> megabyte of buffer (let's say 10 ports are congesting, meaning each port
> gets 800kilobytes of buffer) and each port doing 1.25gigabyte/s of data,
> that's 0.64ms worth of buffer per congested port (I hope I got my math
> right). That is just too little unless you control the TCP stacks of the
> clients, and are just doing low-RTT communication.
>
> So while I'd admit that 100ms worth of FIFO is too much, what needs to
> happen now is to have them configured to do something clever and aiming to
> never have prolonged use of more than a few ms worth of buffer.
>
> It's hard to do AQM with half a millisecond worth of buffer, right?
>
> At least this has been shown by previous generation of datacenter switches
> that had miniscule buffers and ISPs tried to use them and when there were
> microbursts there was uncontrolled packet loss.

Of possible interest, measurementlabs encountered and thoroughly
debugged a microburst problem across their backbones from last year.
This is a good read, although I wish the graphs were more directly
comparable.

https://www.measurementlab.net/publications/SwitchDiscardNotice-Final-20160525.pdf
>
> --
> Mikael Abrahamsson    email: swmike@swm.pp.se
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-07  2:58           ` dpreed
  2016-06-07 10:46             ` Mikael Abrahamsson
@ 2016-06-07 17:51             ` Eric Johansson
  2016-06-10 21:45               ` dpreed
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Johansson @ 2016-06-07 17:51 UTC (permalink / raw)
  To: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

On 6/6/2016 10:58 PM, dpreed@reed.com wrote:
>
> Even better, it would be fun to get access to an Arista switch and
> some high performance TCP sources and sinks, and demonstrate extreme
> bufferbloat compared to a small-buffer switch.  Just a demo, not a
> simulation full of assumptions and guesses.
>

I'm in the middle of a server room/company move.  I can make a available
a  XSM4348S NETGEAR M4300-24X24F, and probably an arista 7050T-52 for a
short time frame as part of my "testing".  tell me what you need for a
test setup, give me a script I can run and where I should send the
results.  I really need a cut and paste test because I have no time to
think about anything more than the move.

thanks.

[-- Attachment #2: Type: text/html, Size: 1422 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-06 15:29 [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Eric Johansson
  2016-06-06 16:53 ` Toke Høiland-Jørgensen
@ 2016-06-07 22:31 ` Valdis.Kletnieks
  1 sibling, 0 replies; 26+ messages in thread
From: Valdis.Kletnieks @ 2016-06-07 22:31 UTC (permalink / raw)
  To: Eric Johansson; +Cc: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 333 bytes --]

On Mon, 06 Jun 2016 11:29:38 -0400, Eric Johansson said:

> Buffer bloat was a relevant on 10/100M switches, not 10Gb switches. At
> 10Gb we can empty the queue in ~100ms

The users we've got on 10Gb ports complain when their RTT hits 10ms.

(And we've got plenty of boxes hanging on 40Gb ports. Not routers, servers.
Fun fun fun)



[-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-07 17:51             ` Eric Johansson
@ 2016-06-10 21:45               ` dpreed
  2016-06-11  1:36                 ` Jonathan Morton
  2016-06-11  8:25                 ` Sebastian Moeller
  0 siblings, 2 replies; 26+ messages in thread
From: dpreed @ 2016-06-10 21:45 UTC (permalink / raw)
  To: Eric Johansson; +Cc: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 1619 bytes --]

Just today I found out that a datacenter my company's engineering group is expanding into is putting us on Arista 7050's. And our very preliminary tests of our systems there is showiung what seems to be a latency problem under load.  I can't get in the way of the deployment process, but it's interesting/worrying that "big buffers" are there in the middle of our system, which is highly latency sensitive.

I may also need a diagnostic test that would detect the  potential occurence of bufferbloat within a 10 GigE switch, now.  Our software layers are not prepared to self-diagnose at the ethernet layer very well.

My thought is to use an ethernet ping while our system is loaded.  (our protocol is at the Ethernet layer, no IP stack). Anyone have an idea of the simplest way to do that?

On Tuesday, June 7, 2016 1:51pm, "Eric Johansson" <esj@eggo.org> said:

On 6/6/2016 10:58 PM, [ dpreed@reed.com ]( mailto:dpreed@reed.com ) wrote:
Even better, it would be fun to get access to an Arista switch and some high performance TCP sources and sinks, and demonstrate extreme bufferbloat compared to a small-buffer switch.  Just a demo, not a simulation full of assumptions and guesses.
 I'm in the middle of a server room/company move.  I can make a available a  XSM4348S NETGEAR M4300-24X24F, and probably an arista 7050T-52 for a short time frame as part of my "testing".  tell me what you need for a test setup, give me a script I can run and where I should send the results.  I really need a cut and paste test because I have no time to think about anything more than the move.

 thanks.

[-- Attachment #2: Type: text/html, Size: 2555 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-10 21:45               ` dpreed
@ 2016-06-11  1:36                 ` Jonathan Morton
  2016-06-11  8:25                 ` Sebastian Moeller
  1 sibling, 0 replies; 26+ messages in thread
From: Jonathan Morton @ 2016-06-11  1:36 UTC (permalink / raw)
  To: dpreed; +Cc: Eric Johansson, cerowrt-devel

> On 11 Jun, 2016, at 00:45, dpreed@reed.com wrote:
> 
> My thought is to use an ethernet ping while our system is loaded.  (our protocol is at the Ethernet layer, no IP stack). Anyone have an idea of the simplest way to do that?

It should be possible to write a utility and a daemon which can listen for and produce raw Ethernet frames using some unused protocol number.  It might even turn out to be quite easy.

One quick-and-dirty solution would be to bring up an IP address (or just an ARP daemon) on the target host, then use arping.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat
  2016-06-10 21:45               ` dpreed
  2016-06-11  1:36                 ` Jonathan Morton
@ 2016-06-11  8:25                 ` Sebastian Moeller
  1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Moeller @ 2016-06-11  8:25 UTC (permalink / raw)
  To: dpreed, Eric Johansson; +Cc: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

Hi,

Maybe https://github.com/jwbensley/Etherate/blob/master/README.md could be of use here? Have not used it myself but it seems to at least partly match your requirements based on reading the readme...

Best Regards
       Sebastian

On June 10, 2016 11:45:30 PM GMT+02:00, dpreed@reed.com wrote:
>
>Just today I found out that a datacenter my company's engineering group
>is expanding into is putting us on Arista 7050's. And our very
>preliminary tests of our systems there is showiung what seems to be a
>latency problem under load.  I can't get in the way of the deployment
>process, but it's interesting/worrying that "big buffers" are there in
>the middle of our system, which is highly latency sensitive.
>
>I may also need a diagnostic test that would detect the  potential
>occurence of bufferbloat within a 10 GigE switch, now.  Our software
>layers are not prepared to self-diagnose at the ethernet layer very
>well.
>
>My thought is to use an ethernet ping while our system is loaded.  (our
>protocol is at the Ethernet layer, no IP stack). Anyone have an idea of
>the simplest way to do that?
>
>On Tuesday, June 7, 2016 1:51pm, "Eric Johansson" <esj@eggo.org> said:
>
>
>
> 
>
>On 6/6/2016 10:58 PM, [ dpreed@reed.com ]( mailto:dpreed@reed.com )
>wrote:
>Even better, it would be fun to get access to an Arista switch and some
>high performance TCP sources and sinks, and demonstrate extreme
>bufferbloat compared to a small-buffer switch.  Just a demo, not a
>simulation full of assumptions and guesses.
>I'm in the middle of a server room/company move.  I can make a
>available a  XSM4348S NETGEAR M4300-24X24F, and probably an arista
>7050T-52 for a short time frame as part of my "testing".  tell me what
>you need for a test setup, give me a script I can run and where I
>should send the results.  I really need a cut and paste test because I
>have no time to think about anything more than the move.
>
> thanks.
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Cerowrt-devel mailing list
>Cerowrt-devel@lists.bufferbloat.net
>https://lists.bufferbloat.net/listinfo/cerowrt-devel

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

[-- Attachment #2: Type: text/html, Size: 3620 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cerowrt-devel] Bechtolschiem
  2016-06-07  2:52         ` dpreed
  2016-06-07  2:58           ` dpreed
@ 2021-07-02 16:42           ` Dave Taht
  2021-07-02 16:59             ` [Cerowrt-devel] [Bloat] Bechtolschiem Stephen Hemminger
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Taht @ 2021-07-02 16:42 UTC (permalink / raw)
  To: David Reed, bloat; +Cc: Ketan Kulkarni, Jonathan Morton, cerowrt-devel

"Debunking Bechtolsheim credibly would get a lot of attention to the
bufferbloat cause, I suspect." - dpreed

"Why Big Data Needs Big Buffer Switches" -
http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf

..

i think i've just gained access to a few networks with arista gear in
the bottleneck path.

On Mon, Jun 6, 2016 at 7:52 PM <dpreed@reed.com> wrote:
>
> So did anyone write a response debunking their paper?   Their NS-2 simulation is most likely the erroneous part of their analysis - the white paper would not pass a review by qualified referees because there is no way to check their results and some of what they say beggars belief.
>
>
>
> Bechtolsheim is one of those guys who can write any damn thing and it becomes "truth" - mostly because he co-founded Sun. But that doesn't mean that he can't make huge errors - any of us can.
>
>
>
> The so-called TCP/IP Bandwidth Capture effect that he refers to doesn't sound like any capture effect I've ever heard of.  There is an "Ethernet Capture Effect" (which is cited), which is due to properties of CSMA/CD binary exponential backoff, not anything to do with TCP's flow/congestion control.  So it has that "truthiness" that makes glib people sound like they know what they are talking about, but I'd like to see a reference that says this is a property of TCP!
>
>
>
> What's interesting is that the reference to the Ethernet Capture Effect in that white paper proposes a solution that involves changing the backoff algorithm slightly at the Ethernet level - NOT increasing buffer size!
>
>
>
> Another thing that would probably improve matters a great deal would be to drop/ECN-mark packets when a contended output port on an Arista switch develops a backlog.  This will throttle TCP sources sharing the path.
>
>
>
> The comments in the white paper that say that ACK contention in TCP in the reverse direction are the problem that causes the "so-called TCP/IP Bandwidth Capture effect" that is invented by the authors appears to be hogwash of the first order.
>
>
>
> Debunking Bechtolsheim credibly would get a lot of attention to the bufferbloat cause, I suspect.
>
>
>
>
>
> On Monday, June 6, 2016 5:16pm, "Ketan Kulkarni" <ketkulka@gmail.com> said:
>
> some time back they had this whitepaper -
> "Why Big Data Needs Big Buffer Switches"
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> the type of apps they talk about is big data, hadoop etc
>
> On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
>>
>> On Mon, 6 Jun 2016, Jonathan Morton wrote:
>>
>>> At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s installed in into a transcontinental Internet path, as far as peak latency is concerned.  Just because RAM is cheap these days…
>>
>> Nono, nononononono. I can tell you they're spending serious money on inserting this kind of buffering memory into these kinds of devices. Buying these devices without deep buffers is a lot lower cost.
>>
>> These types of switch chips either have on-die memory (usually 16MB or less), or they have very expensive (a direct cost of lowered port density) off-chip buffering memory.
>>
>> Typically you do this:
>>
>> ports ---|-------
>> ports ---|      |
>> ports ---| chip |
>> ports ---|-------
>>
>> Or you do this
>>
>> ports ---|------|---buffer
>> ports ---| chip |---TCAM
>>          --------
>>
>> or if you do a multi-linecard-device
>>
>> ports ---|------|---buffer
>>          | chip |---TCAM
>>          --------
>>             |
>>         switch fabric
>>
>> (or any variant of them)
>>
>> So basically if you want to buffer and if you want large L2-L4 lookup tables, you have to sacrifice ports. Sacrifice lots of ports.
>>
>> So never say these kinds of devices add buffering because RAM is cheap. This is most definitely not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.
>>
>> --
>> Mikael Abrahamsson    email: swmike@swm.pp.se
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Latest Podcast:
https://www.linkedin.com/feed/update/urn:li:activity:6791014284936785920/

Dave Täht CTO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] [Bloat] Bechtolschiem
  2021-07-02 16:42           ` [Cerowrt-devel] Bechtolschiem Dave Taht
@ 2021-07-02 16:59             ` Stephen Hemminger
  2021-07-02 19:46               ` Matt Mathis
  2021-07-02 20:28               ` [Cerowrt-devel] [Bloat] Bechtolschiem Jonathan Morton
  0 siblings, 2 replies; 26+ messages in thread
From: Stephen Hemminger @ 2021-07-02 16:59 UTC (permalink / raw)
  To: Dave Taht
  Cc: David Reed, bloat, Jonathan Morton, Ketan Kulkarni, cerowrt-devel

On Fri, 2 Jul 2021 09:42:24 -0700
Dave Taht <dave.taht@gmail.com> wrote:

> "Debunking Bechtolsheim credibly would get a lot of attention to the
> bufferbloat cause, I suspect." - dpreed
> 
> "Why Big Data Needs Big Buffer Switches" -
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> 

Also, a lot depends on the TCP congestion control algorithm being used.
They are using NewReno which only researchers use in real life.

Even TCP Cubic has gone through several revisions. In my experience, the
NS-2 models don't correlate well to real world behavior.

In real world tests, TCP Cubic will consume any buffer it sees at a
congested link. Maybe that is what they mean by capture effect.

There is also a weird oscillation effect with multiple streams, where one
flow will take the buffer, then see a packet loss and back off, the
other flow will take over the buffer until it sees loss.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Bloat] Bechtolschiem
  2021-07-02 16:59             ` [Cerowrt-devel] [Bloat] Bechtolschiem Stephen Hemminger
@ 2021-07-02 19:46               ` Matt Mathis
  2021-07-07 22:19                 ` [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem) Bless, Roland (TM)
  2021-07-02 20:28               ` [Cerowrt-devel] [Bloat] Bechtolschiem Jonathan Morton
  1 sibling, 1 reply; 26+ messages in thread
From: Matt Mathis @ 2021-07-02 19:46 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, Stephen Hemminger, David Reed, Ketan Kulkarni,
	cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

The argument is absolutely correct for Reno, CUBIC and all
other self-clocked protocols.  One of the core assumptions in Jacobson88,
was that the clock for the entire system comes from packets draining
through the bottleneck queue.  In this world, the clock is intrinsically
brittle if the buffers are too small.  The drain time needs to be a
substantial fraction of the RTT.

However, we have reached the point where we need to discard that
requirement.  One of the side points of BBR is that in many environments it
is cheaper to burn serving CPU to pace into short queue networks than it is
to "right size" the network queues.

The fundamental problem with the old way is that in some contexts the
buffer memory has to beat Moore's law, because to maintain constant drain
time the memory size and BW both have to scale with the link (laser) BW.

See the slides I gave at the Stanford Buffer Sizing workshop december
2019: Buffer
Sizing: Position Paper
<https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>

Note that we are talking about DC and Internet core.  At the edge, BW is
low enough where memory is relatively cheap.   In some sense BB came about
because memory is too cheap in these environments.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.

On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Fri, 2 Jul 2021 09:42:24 -0700
> Dave Taht <dave.taht@gmail.com> wrote:
>
> > "Debunking Bechtolsheim credibly would get a lot of attention to the
> > bufferbloat cause, I suspect." - dpreed
> >
> > "Why Big Data Needs Big Buffer Switches" -
> >
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> >
>
> Also, a lot depends on the TCP congestion control algorithm being used.
> They are using NewReno which only researchers use in real life.
>
> Even TCP Cubic has gone through several revisions. In my experience, the
> NS-2 models don't correlate well to real world behavior.
>
> In real world tests, TCP Cubic will consume any buffer it sees at a
> congested link. Maybe that is what they mean by capture effect.
>
> There is also a weird oscillation effect with multiple streams, where one
> flow will take the buffer, then see a packet loss and back off, the
> other flow will take over the buffer until it sees loss.
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 3954 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] [Bloat] Bechtolschiem
  2021-07-02 16:59             ` [Cerowrt-devel] [Bloat] Bechtolschiem Stephen Hemminger
  2021-07-02 19:46               ` Matt Mathis
@ 2021-07-02 20:28               ` Jonathan Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Jonathan Morton @ 2021-07-02 20:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Dave Taht, David Reed, bloat, Ketan Kulkarni, cerowrt-devel

> On 2 Jul, 2021, at 7:59 pm, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> In real world tests, TCP Cubic will consume any buffer it sees at a
> congested link. Maybe that is what they mean by capture effect.

First, I'll note that what they call "small buffer" corresponds to about a tenth of a millisecond at the port's link rate.  This would be ludicrously small at Internet scale, but is actually reasonable for datacentre conditions where RTTs are often in the microseconds.

Assuming the effect as described is real, it ultimately stems from a burst of traffic from a particular flow arriving at a queue that is *already* full.  Such bursts are expected from ack-clocked flows coming out of application-limited mode (ie. on completion of a disk read), in slow-start, or recovering from earlier losses.  It is also possible for a heavily coalesced ack to abruptly open the receive and congestion windows and trigger a send burst.  These bursts occur much less in paced flows, because the object of pacing is to avoid bursts.

The queue is full because tail drop upon queue overflow is the only congestion signal provided by the switch, and ack-clocked capacity-seeking transports naturally keep the queue as full as they can - especially under high statistical multiplexing conditions where a single multiplicative decrease event does not greatly reduce the total traffic demand. CUBIC arguably spends more time with the queue very close to full than Reno does, due to the plateau designed into it, but at these very short RTTs I would not be surprised if CUBIC is equivalent to Reno in practice.

The solution is to keep some normally-unused space in the queue for bursts of traffic to use occasionally.  This is most naturally done using ECN applied by some AQM algorithm, or the AQM can pre-emptively and selectively drop packets in Not-ECT flows.  And because the AQM is more likely to mark or drop packets from flows that occupy more link time or queue capacity, it has a natural equalising effect between flows.

Applying ECN requires some Layer 3 awareness in the switch, which might not be practical.  A simple alternative it to drop packets instead.  Single packet losses are easily recovered from by retransmission after approximately one RTT.  There are also emerging techniques for applying congestion signals at Layer 2, which can be converted into ECN signals at some convenient point downstream.

However it is achieved, the point is that keeping the *standing* queue down to some fraction of the total queue depth reserves space for accommodating those bursts which are expected occasionally in normal traffic.  Because those bursts are not lost, the flows experiencing them are not disadvantaged and the so-called "capture effect" will not occur.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem)
  2021-07-02 19:46               ` Matt Mathis
@ 2021-07-07 22:19                 ` Bless, Roland (TM)
  2021-07-07 22:38                   ` Matt Mathis
  0 siblings, 1 reply; 26+ messages in thread
From: Bless, Roland (TM) @ 2021-07-07 22:19 UTC (permalink / raw)
  To: Matt Mathis, Dave Taht; +Cc: cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

Hi Matt,

[sorry for the late reply, overlooked this one]

please, see comments inline.

On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
> The argument is absolutely correct for Reno, CUBIC and all 
> other self-clocked protocols.  One of the core assumptions in 
> Jacobson88, was that the clock for the entire system comes from 
> packets draining through the bottleneck queue.  In this world, the 
> clock is intrinsically brittle if the buffers are too small.  The 
> drain time needs to be a substantial fraction of the RTT.
I'd like to separate the functions here a bit:

1) "automatic pacing" by ACK clocking

2) congestion-window-based operation

I agree that the automatic pacing generated by the ACK clock (function 
1) is increasingly
distorted these days and may consequently cause micro bursts.
This can be mitigated by using paced sending, which I consider very useful.
However, I consider abandoning the (congestion) window-based approaches
with ACK feedback (function 2) as harmful:
a congestion window has an automatic self-stabilizing property since the 
ACK feedback reflects
also the queuing delay and the congestion window limits the amount of 
inflight data.
In contrast, rate-based senders risk instability: two senders in an 
M/D/1 setting, each sender sending with 50%
bottleneck rate in average, both using paced sending at 120% of the 
average rate, suffice to cause
instability (queue grows unlimited).

IMHO, two approaches seem to be useful:
a) congestion-window-based operation with paced sending
b) rate-based/paced sending with limiting the amount of inflight data

>
> However, we have reached the point where we need to discard that 
> requirement.  One of the side points of BBR is that in many 
> environments it is cheaper to burn serving CPU to pace into short 
> queue networks than it is to "right size" the network queues.
>
> The fundamental problem with the old way is that in some contexts the 
> buffer memory has to beat Moore's law, because to maintain constant 
> drain time the memory size and BW both have to scale with the link 
> (laser) BW.
>
> See the slides I gave at the Stanford Buffer Sizing workshop december 
> 2019: Buffer Sizing: Position Paper 
> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5> 
>
>
Thanks for the pointer. I don't quite get the point that the buffer must 
have a certain size to keep the ACK clock stable:
in case of an non application-limited sender, a very small buffer 
suffices to let the ACK clock
run steady. The large buffers were mainly required for loss-based CCs to 
let the standing queue
build up that keeps the bottleneck busy during CWnd reduction after 
packet loss, thereby
keeping the (bottleneck link) utilization high.

Regards,

  Roland


> Note that we are talking about DC and Internet core.  At the edge, BW 
> is low enough where memory is relatively cheap.  In some sense BB came 
> about because memory is too cheap in these environments.
>
> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out 
> of control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger 
> <stephen@networkplumber.org <mailto:stephen@networkplumber.org>> wrote:
>
>     On Fri, 2 Jul 2021 09:42:24 -0700
>     Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>> wrote:
>
>     > "Debunking Bechtolsheim credibly would get a lot of attention to the
>     > bufferbloat cause, I suspect." - dpreed
>     >
>     > "Why Big Data Needs Big Buffer Switches" -
>     >
>     http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>     <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>     >
>
>     Also, a lot depends on the TCP congestion control algorithm being
>     used.
>     They are using NewReno which only researchers use in real life.
>
>     Even TCP Cubic has gone through several revisions. In my
>     experience, the
>     NS-2 models don't correlate well to real world behavior.
>
>     In real world tests, TCP Cubic will consume any buffer it sees at a
>     congested link. Maybe that is what they mean by capture effect.
>
>     There is also a weird oscillation effect with multiple streams,
>     where one
>     flow will take the buffer, then see a packet loss and back off, the
>     other flow will take over the buffer until it sees loss.
>
>     _______________________________________________
>
> _______________________________________________


[-- Attachment #2: Type: text/html, Size: 8065 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem)
  2021-07-07 22:19                 ` [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem) Bless, Roland (TM)
@ 2021-07-07 22:38                   ` Matt Mathis
  2021-07-08 11:24                     ` [Cerowrt-devel] " Bless, Roland (TM)
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Mathis @ 2021-07-07 22:38 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 5352 bytes --]

Actually BBR does have a window based backup, which normally only comes
into play during load spikes and at very short RTTs.   It defaults to
2*minRTT*maxBW, which is twice the steady state window in it's normal paced
mode.

This is too large for short queue routers in the Internet core, but it
helps a lot with cross traffic on large queue edge routers.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.


On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> Hi Matt,
>
> [sorry for the late reply, overlooked this one]
>
> please, see comments inline.
>
> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>
> The argument is absolutely correct for Reno, CUBIC and all
> other self-clocked protocols.  One of the core assumptions in Jacobson88,
> was that the clock for the entire system comes from packets draining
> through the bottleneck queue.  In this world, the clock is intrinsically
> brittle if the buffers are too small.  The drain time needs to be a
> substantial fraction of the RTT.
>
> I'd like to separate the functions here a bit:
>
> 1) "automatic pacing" by ACK clocking
>
> 2) congestion-window-based operation
>
> I agree that the automatic pacing generated by the ACK clock (function 1)
> is increasingly
> distorted these days and may consequently cause micro bursts.
> This can be mitigated by using paced sending, which I consider very
> useful.
> However, I consider abandoning the (congestion) window-based approaches
> with ACK feedback (function 2) as harmful:
> a congestion window has an automatic self-stabilizing property since the
> ACK feedback reflects
> also the queuing delay and the congestion window limits the amount of
> inflight data.
> In contrast, rate-based senders risk instability: two senders in an M/D/1
> setting, each sender sending with 50%
> bottleneck rate in average, both using paced sending at 120% of the
> average rate, suffice to cause
> instability (queue grows unlimited).
>
> IMHO, two approaches seem to be useful:
> a) congestion-window-based operation with paced sending
> b) rate-based/paced sending with limiting the amount of inflight data
>
>
> However, we have reached the point where we need to discard that
> requirement.  One of the side points of BBR is that in many environments it
> is cheaper to burn serving CPU to pace into short queue networks than it is
> to "right size" the network queues.
>
> The fundamental problem with the old way is that in some contexts the
> buffer memory has to beat Moore's law, because to maintain constant drain
> time the memory size and BW both have to scale with the link (laser) BW.
>
> See the slides I gave at the Stanford Buffer Sizing workshop december
> 2019: Buffer Sizing: Position Paper
> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>
>
> Thanks for the pointer. I don't quite get the point that the buffer must
> have a certain size to keep the ACK clock stable:
> in case of an non application-limited sender, a very small buffer suffices
> to let the ACK clock
> run steady. The large buffers were mainly required for loss-based CCs to
> let the standing queue
> build up that keeps the bottleneck busy during CWnd reduction after packet
> loss, thereby
> keeping the (bottleneck link) utilization high.
>
> Regards,
>
>  Roland
>
>
> Note that we are talking about DC and Internet core.  At the edge, BW is
> low enough where memory is relatively cheap.   In some sense BB came about
> because memory is too cheap in these environments.
>
> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out of
> control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
> stephen@networkplumber.org> wrote:
>
>> On Fri, 2 Jul 2021 09:42:24 -0700
>> Dave Taht <dave.taht@gmail.com> wrote:
>>
>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>> > bufferbloat cause, I suspect." - dpreed
>> >
>> > "Why Big Data Needs Big Buffer Switches" -
>> >
>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>> >
>>
>> Also, a lot depends on the TCP congestion control algorithm being used.
>> They are using NewReno which only researchers use in real life.
>>
>> Even TCP Cubic has gone through several revisions. In my experience, the
>> NS-2 models don't correlate well to real world behavior.
>>
>> In real world tests, TCP Cubic will consume any buffer it sees at a
>> congested link. Maybe that is what they mean by capture effect.
>>
>> There is also a weird oscillation effect with multiple streams, where one
>> flow will take the buffer, then see a packet loss and back off, the
>> other flow will take over the buffer until it sees loss.
>>
>> _______________________________________________
>
> _______________________________________________
>
>
>

[-- Attachment #2: Type: text/html, Size: 8873 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem)
  2021-07-07 22:38                   ` Matt Mathis
@ 2021-07-08 11:24                     ` Bless, Roland (TM)
  2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 13:29                       ` Neal Cardwell
  0 siblings, 2 replies; 26+ messages in thread
From: Bless, Roland (TM) @ 2021-07-08 11:24 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 6702 bytes --]

Hi Matt,

On 08.07.21 at 00:38 Matt Mathis wrote:
> Actually BBR does have a window based backup, which normally only 
> comes into play during load spikes and at very short RTTs.   It 
> defaults to 2*minRTT*maxBW, which is twice the steady state window in 
> it's normal paced mode.

So yes, BBR follows option b), but I guess that you are referring to 
BBRv1 here.
We have shown in [1, Sec.III] that BBRv1 flows will *always* run 
(conceptually) toward their above quoted inflight-cap of
2*minRTT*maxBW, if more than one BBR flow is present at the bottleneck. 
So strictly speaking " which *normally only* comes
into play during load spikes and at very short RTTs" isn't true for 
multiple BBRv1 flows.

It seems that in BBRv2 there are many more mechanisms present
that try to control the amount of inflight data more tightly and the new 
"cap"
is at 1.25 BDP.

> This is too large for short queue routers in the Internet core, but it 
> helps a lot with cross traffic on large queue edge routers.

Best regards,
  Roland

[1] https://ieeexplore.ieee.org/document/8117540

>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) 
> <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>
>     Hi Matt,
>
>     [sorry for the late reply, overlooked this one]
>
>     please, see comments inline.
>
>     On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>     The argument is absolutely correct for Reno, CUBIC and all
>>     other self-clocked protocols.  One of the core assumptions in
>>     Jacobson88, was that the clock for the entire system comes from
>>     packets draining through the bottleneck queue.  In this world,
>>     the clock is intrinsically brittle if the buffers are too small.
>>     The drain time needs to be a substantial fraction of the RTT.
>     I'd like to separate the functions here a bit:
>
>     1) "automatic pacing" by ACK clocking
>
>     2) congestion-window-based operation
>
>     I agree that the automatic pacing generated by the ACK clock
>     (function 1) is increasingly
>     distorted these days and may consequently cause micro bursts.
>     This can be mitigated by using paced sending, which I consider
>     very useful.
>     However, I consider abandoning the (congestion) window-based
>     approaches
>     with ACK feedback (function 2) as harmful:
>     a congestion window has an automatic self-stabilizing property
>     since the ACK feedback reflects
>     also the queuing delay and the congestion window limits the amount
>     of inflight data.
>     In contrast, rate-based senders risk instability: two senders in
>     an M/D/1 setting, each sender sending with 50%
>     bottleneck rate in average, both using paced sending at 120% of
>     the average rate, suffice to cause
>     instability (queue grows unlimited).
>
>     IMHO, two approaches seem to be useful:
>     a) congestion-window-based operation with paced sending
>     b) rate-based/paced sending with limiting the amount of inflight data
>
>>
>>     However, we have reached the point where we need to discard that
>>     requirement.  One of the side points of BBR is that in many
>>     environments it is cheaper to burn serving CPU to pace into short
>>     queue networks than it is to "right size" the network queues.
>>
>>     The fundamental problem with the old way is that in some contexts
>>     the buffer memory has to beat Moore's law, because to maintain
>>     constant drain time the memory size and BW both have to scale
>>     with the link (laser) BW.
>>
>>     See the slides I gave at the Stanford Buffer Sizing workshop
>>     december 2019: Buffer Sizing: Position Paper
>>     <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>     Thanks for the pointer. I don't quite get the point that the
>     buffer must have a certain size to keep the ACK clock stable:
>     in case of an non application-limited sender, a very small buffer
>     suffices to let the ACK clock
>     run steady. The large buffers were mainly required for loss-based
>     CCs to let the standing queue
>     build up that keeps the bottleneck busy during CWnd reduction
>     after packet loss, thereby
>     keeping the (bottleneck link) utilization high.
>
>     Regards,
>
>      Roland
>
>
>>     Note that we are talking about DC and Internet core.  At the
>>     edge, BW is low enough where memory is relatively cheap.   In
>>     some sense BB came about because memory is too cheap in these
>>     environments.
>>
>>     Thanks,
>>     --MM--
>>     The best way to predict the future is to create it.  - Alan Kay
>>
>>     We must not tolerate intolerance;
>>            however our response must be carefully measured:
>>                 too strong would be hypocritical and risks spiraling
>>     out of control;
>>                 too weak risks being mistaken for tacit approval.
>>
>>
>>     On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger
>>     <stephen@networkplumber.org <mailto:stephen@networkplumber.org>>
>>     wrote:
>>
>>         On Fri, 2 Jul 2021 09:42:24 -0700
>>         Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>>
>>         wrote:
>>
>>         > "Debunking Bechtolsheim credibly would get a lot of
>>         attention to the
>>         > bufferbloat cause, I suspect." - dpreed
>>         >
>>         > "Why Big Data Needs Big Buffer Switches" -
>>         >
>>         http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>         <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>>         >
>>
>>         Also, a lot depends on the TCP congestion control algorithm
>>         being used.
>>         They are using NewReno which only researchers use in real life.
>>
>>         Even TCP Cubic has gone through several revisions. In my
>>         experience, the
>>         NS-2 models don't correlate well to real world behavior.
>>
>>         In real world tests, TCP Cubic will consume any buffer it
>>         sees at a
>>         congested link. Maybe that is what they mean by capture effect.
>>
>>         There is also a weird oscillation effect with multiple
>>         streams, where one
>>         flow will take the buffer, then see a packet loss and back
>>         off, the
>>         other flow will take over the buffer until it sees loss.
>>
>>         _______________________________________________
>>
>>     _______________________________________________
>


[-- Attachment #2: Type: text/html, Size: 11660 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem)
  2021-07-08 11:24                     ` [Cerowrt-devel] " Bless, Roland (TM)
@ 2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 14:05                         ` [Cerowrt-devel] " Bless, Roland (TM)
  2021-07-08 14:40                         ` [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Bechtolschiem) Jonathan Morton
  2021-07-08 13:29                       ` Neal Cardwell
  1 sibling, 2 replies; 26+ messages in thread
From: Matt Mathis @ 2021-07-08 13:29 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 7528 bytes --]

I think there is something missing from your model.    I just scanned your
paper and noticed that you made no mention of rounding errors, nor some
details around the drain phase timing,   The implementation guarantees that
the actual average rate across the combined BW probe and drain is strictly
less than the measured maxBW and that the flight size comes back down to
minRTT*maxBW before returning to unity pacing gain.  In some sense these
checks are redundant, but If you don't do them, it is absolutely true that
you are at risk of seeing divergent behaviors.

That said, it is also true that multi-stream BBR behavior is quite
complicated and needs more queue space than single stream.   This
complicates the story around the traditional workaround of using multiple
streams to compensate for Reno & CUBIC lameness at larger scales (ordinary
scales today).    Multi-stream does not help BBR throughput and raises the
queue occupancy, to the detriment of other users.

And yes, in my presentation, I described the core BBR algorithms as a
framework, which might be extended to incorporate many additional
algorithms if they provide optimal control in some settings.  And yes,
several are present in BBRv2.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.


On Thu, Jul 8, 2021 at 4:24 AM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> Hi Matt,
>
> On 08.07.21 at 00:38 Matt Mathis wrote:
>
> Actually BBR does have a window based backup, which normally only comes
> into play during load spikes and at very short RTTs.   It defaults to
> 2*minRTT*maxBW, which is twice the steady state window in it's normal paced
> mode.
>
> So yes, BBR follows option b), but I guess that you are referring to BBRv1
> here.
> We have shown in [1, Sec.III] that BBRv1 flows will *always* run
> (conceptually) toward their above quoted inflight-cap of
> 2*minRTT*maxBW, if more than one BBR flow is present at the bottleneck. So
> strictly speaking " which *normally only* comes
> into play during load spikes and at very short RTTs" isn't true for
> multiple BBRv1 flows.
>
> It seems that in BBRv2 there are many more mechanisms present
> that try to control the amount of inflight data more tightly and the new
> "cap"
> is at 1.25 BDP.
>
> This is too large for short queue routers in the Internet core, but it
> helps a lot with cross traffic on large queue edge routers.
>
> Best regards,
>  Roland
>
> [1] https://ieeexplore.ieee.org/document/8117540
>
>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
> wrote:
>
>> Hi Matt,
>>
>> [sorry for the late reply, overlooked this one]
>>
>> please, see comments inline.
>>
>> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>
>> The argument is absolutely correct for Reno, CUBIC and all
>> other self-clocked protocols.  One of the core assumptions in Jacobson88,
>> was that the clock for the entire system comes from packets draining
>> through the bottleneck queue.  In this world, the clock is intrinsically
>> brittle if the buffers are too small.  The drain time needs to be a
>> substantial fraction of the RTT.
>>
>> I'd like to separate the functions here a bit:
>>
>> 1) "automatic pacing" by ACK clocking
>>
>> 2) congestion-window-based operation
>>
>> I agree that the automatic pacing generated by the ACK clock (function 1)
>> is increasingly
>> distorted these days and may consequently cause micro bursts.
>> This can be mitigated by using paced sending, which I consider very
>> useful.
>> However, I consider abandoning the (congestion) window-based approaches
>> with ACK feedback (function 2) as harmful:
>> a congestion window has an automatic self-stabilizing property since the
>> ACK feedback reflects
>> also the queuing delay and the congestion window limits the amount of
>> inflight data.
>> In contrast, rate-based senders risk instability: two senders in an M/D/1
>> setting, each sender sending with 50%
>> bottleneck rate in average, both using paced sending at 120% of the
>> average rate, suffice to cause
>> instability (queue grows unlimited).
>>
>> IMHO, two approaches seem to be useful:
>> a) congestion-window-based operation with paced sending
>> b) rate-based/paced sending with limiting the amount of inflight data
>>
>>
>> However, we have reached the point where we need to discard that
>> requirement.  One of the side points of BBR is that in many environments it
>> is cheaper to burn serving CPU to pace into short queue networks than it is
>> to "right size" the network queues.
>>
>> The fundamental problem with the old way is that in some contexts the
>> buffer memory has to beat Moore's law, because to maintain constant drain
>> time the memory size and BW both have to scale with the link (laser) BW.
>>
>> See the slides I gave at the Stanford Buffer Sizing workshop december
>> 2019: Buffer Sizing: Position Paper
>> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>> Thanks for the pointer. I don't quite get the point that the buffer must
>> have a certain size to keep the ACK clock stable:
>> in case of an non application-limited sender, a very small buffer
>> suffices to let the ACK clock
>> run steady. The large buffers were mainly required for loss-based CCs to
>> let the standing queue
>> build up that keeps the bottleneck busy during CWnd reduction after
>> packet loss, thereby
>> keeping the (bottleneck link) utilization high.
>>
>> Regards,
>>
>>  Roland
>>
>>
>> Note that we are talking about DC and Internet core.  At the edge, BW is
>> low enough where memory is relatively cheap.   In some sense BB came about
>> because memory is too cheap in these environments.
>>
>> Thanks,
>> --MM--
>> The best way to predict the future is to create it.  - Alan Kay
>>
>> We must not tolerate intolerance;
>>        however our response must be carefully measured:
>>             too strong would be hypocritical and risks spiraling out of
>> control;
>>             too weak risks being mistaken for tacit approval.
>>
>>
>> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>>
>>> On Fri, 2 Jul 2021 09:42:24 -0700
>>> Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>>> > bufferbloat cause, I suspect." - dpreed
>>> >
>>> > "Why Big Data Needs Big Buffer Switches" -
>>> >
>>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>> >
>>>
>>> Also, a lot depends on the TCP congestion control algorithm being used.
>>> They are using NewReno which only researchers use in real life.
>>>
>>> Even TCP Cubic has gone through several revisions. In my experience, the
>>> NS-2 models don't correlate well to real world behavior.
>>>
>>> In real world tests, TCP Cubic will consume any buffer it sees at a
>>> congested link. Maybe that is what they mean by capture effect.
>>>
>>> There is also a weird oscillation effect with multiple streams, where one
>>> flow will take the buffer, then see a packet loss and back off, the
>>> other flow will take over the buffer until it sees loss.
>>>
>>> _______________________________________________
>>
>> _______________________________________________
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 13458 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 11:24                     ` [Cerowrt-devel] " Bless, Roland (TM)
  2021-07-08 13:29                       ` Matt Mathis
@ 2021-07-08 13:29                       ` Neal Cardwell
  1 sibling, 0 replies; 26+ messages in thread
From: Neal Cardwell @ 2021-07-08 13:29 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Matt Mathis, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 6142 bytes --]

On Thu, Jul 8, 2021 at 7:25 AM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> It seems that in BBRv2 there are many more mechanisms present
> that try to control the amount of inflight data more tightly and the new
> "cap"
> is at 1.25 BDP.
>
To clarify, the BBRv2 cwnd cap is not 1.25*BDP. If there is no packet loss
or ECN, the BBRv2 cwnd cap is the same as BBRv1. But if there has been
packet loss then conceptually the cwnd cap is the maximum amount of data
delivered in a single round trip since the last packet loss (with a floor
to ensure that the cwnd does not decrease by more than 30% per round trip
with packet loss, similar to CUBIC's 30% reduction in a round trip with
packet loss). (And upon RTO the BBR (v1 or v2) cwnd is reset to 1, and
slow-starts upward from there.)

There is an overview of the BBRv2 response to packet loss here:

https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00#page=18

best,
neal



> This is too large for short queue routers in the Internet core, but it
> helps a lot with cross traffic on large queue edge routers.
>
> Best regards,
>  Roland
>
> [1] https://ieeexplore.ieee.org/document/8117540
>
>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
> wrote:
>
>> Hi Matt,
>>
>> [sorry for the late reply, overlooked this one]
>>
>> please, see comments inline.
>>
>> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>
>> The argument is absolutely correct for Reno, CUBIC and all
>> other self-clocked protocols.  One of the core assumptions in Jacobson88,
>> was that the clock for the entire system comes from packets draining
>> through the bottleneck queue.  In this world, the clock is intrinsically
>> brittle if the buffers are too small.  The drain time needs to be a
>> substantial fraction of the RTT.
>>
>> I'd like to separate the functions here a bit:
>>
>> 1) "automatic pacing" by ACK clocking
>>
>> 2) congestion-window-based operation
>>
>> I agree that the automatic pacing generated by the ACK clock (function 1)
>> is increasingly
>> distorted these days and may consequently cause micro bursts.
>> This can be mitigated by using paced sending, which I consider very
>> useful.
>> However, I consider abandoning the (congestion) window-based approaches
>> with ACK feedback (function 2) as harmful:
>> a congestion window has an automatic self-stabilizing property since the
>> ACK feedback reflects
>> also the queuing delay and the congestion window limits the amount of
>> inflight data.
>> In contrast, rate-based senders risk instability: two senders in an M/D/1
>> setting, each sender sending with 50%
>> bottleneck rate in average, both using paced sending at 120% of the
>> average rate, suffice to cause
>> instability (queue grows unlimited).
>>
>> IMHO, two approaches seem to be useful:
>> a) congestion-window-based operation with paced sending
>> b) rate-based/paced sending with limiting the amount of inflight data
>>
>>
>> However, we have reached the point where we need to discard that
>> requirement.  One of the side points of BBR is that in many environments it
>> is cheaper to burn serving CPU to pace into short queue networks than it is
>> to "right size" the network queues.
>>
>> The fundamental problem with the old way is that in some contexts the
>> buffer memory has to beat Moore's law, because to maintain constant drain
>> time the memory size and BW both have to scale with the link (laser) BW.
>>
>> See the slides I gave at the Stanford Buffer Sizing workshop december
>> 2019: Buffer Sizing: Position Paper
>> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>> Thanks for the pointer. I don't quite get the point that the buffer must
>> have a certain size to keep the ACK clock stable:
>> in case of an non application-limited sender, a very small buffer
>> suffices to let the ACK clock
>> run steady. The large buffers were mainly required for loss-based CCs to
>> let the standing queue
>> build up that keeps the bottleneck busy during CWnd reduction after
>> packet loss, thereby
>> keeping the (bottleneck link) utilization high.
>>
>> Regards,
>>
>>  Roland
>>
>>
>> Note that we are talking about DC and Internet core.  At the edge, BW is
>> low enough where memory is relatively cheap.   In some sense BB came about
>> because memory is too cheap in these environments.
>>
>> Thanks,
>> --MM--
>> The best way to predict the future is to create it.  - Alan Kay
>>
>> We must not tolerate intolerance;
>>        however our response must be carefully measured:
>>             too strong would be hypocritical and risks spiraling out of
>> control;
>>             too weak risks being mistaken for tacit approval.
>>
>>
>> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>>
>>> On Fri, 2 Jul 2021 09:42:24 -0700
>>> Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>>> > bufferbloat cause, I suspect." - dpreed
>>> >
>>> > "Why Big Data Needs Big Buffer Switches" -
>>> >
>>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>> >
>>>
>>> Also, a lot depends on the TCP congestion control algorithm being used.
>>> They are using NewReno which only researchers use in real life.
>>>
>>> Even TCP Cubic has gone through several revisions. In my experience, the
>>> NS-2 models don't correlate well to real world behavior.
>>>
>>> In real world tests, TCP Cubic will consume any buffer it sees at a
>>> congested link. Maybe that is what they mean by capture effect.
>>>
>>> There is also a weird oscillation effect with multiple streams, where one
>>> flow will take the buffer, then see a packet loss and back off, the
>>> other flow will take over the buffer until it sees loss.
>>>
>>> _______________________________________________
>>
>> _______________________________________________
>>
>>
>>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 12003 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem)
  2021-07-08 13:29                       ` Matt Mathis
@ 2021-07-08 14:05                         ` Bless, Roland (TM)
  2021-07-08 14:40                         ` [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Bechtolschiem) Jonathan Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Bless, Roland (TM) @ 2021-07-08 14:05 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 9727 bytes --]

Hi Matt,

On 08.07.21 at 15:29 Matt Mathis wrote:
> I think there is something missing from your model.    I just scanned 
> your paper and noticed that you made no mention of rounding errors, 
> nor some details around the drain phase timing,   The 
> implementation guarantees that the actual average rate across the 
> combined BW probe and drain is strictly less than the measured maxBW 
> and that the flight size comes back down to minRTT*maxBW before 
> returning to unity pacing gain.  In some sense these checks are 
> redundant, but If you don't do them, it is absolutely true that you 
> are at risk of seeing divergent behaviors.
Sure, most models abstract things away and so does our model leave out
some details, but it describes quite accurately what happens if multiple
BBRv1 flows are present. So the model was not only confirmed by our
own measurements, but also by many others who did BBRv1 experiments.
> That said, it is also true that multi-stream BBR behavior is quite 
> complicated and needs more queue space than single stream.   This
Yes, mostly between 1bdp and 1.5bdp of queue space.
> complicates the story around the traditional workaround of using 
> multiple streams to compensate for Reno & CUBIC lameness at larger 
> scales (ordinary scales today). Multi-stream does not help BBR 
> throughput and raises the queue occupancy, to the detriment of other 
> users.
>
> And yes, in my presentation, I described the core BBR algorithms as a 
> framework, which might be extended to incorporate many additional 
> algorithms if they provide optimal control in some settings.  And yes, 
> several are present in BBRv2.

Ok, thanks for clarification.

Regards,
  Roland

> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out 
> of control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Thu, Jul 8, 2021 at 4:24 AM Bless, Roland (TM) 
> <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>
>     Hi Matt,
>
>     On 08.07.21 at 00:38 Matt Mathis wrote:
>>     Actually BBR does have a window based backup, which normally only
>>     comes into play during load spikes and at very short RTTs.   It
>>     defaults to 2*minRTT*maxBW, which is twice the steady state
>>     window in it's normal paced mode.
>
>     So yes, BBR follows option b), but I guess that you are referring
>     to BBRv1 here.
>     We have shown in [1, Sec.III] that BBRv1 flows will *always* run
>     (conceptually) toward their above quoted inflight-cap of
>     2*minRTT*maxBW, if more than one BBR flow is present at the
>     bottleneck. So strictly speaking " which *normally only* comes
>     into play during load spikes and at very short RTTs" isn't true
>     for multiple BBRv1 flows.
>
>     It seems that in BBRv2 there are many more mechanisms present
>     that try to control the amount of inflight data more tightly and
>     the new "cap"
>     is at 1.25 BDP.
>
>>     This is too large for short queue routers in the Internet core,
>>     but it helps a lot with cross traffic on large queue edge routers.
>
>     Best regards,
>      Roland
>
>     [1] https://ieeexplore.ieee.org/document/8117540
>     <https://ieeexplore.ieee.org/document/8117540>
>
>>
>>     On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM)
>>     <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>>
>>         Hi Matt,
>>
>>         [sorry for the late reply, overlooked this one]
>>
>>         please, see comments inline.
>>
>>         On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>>         The argument is absolutely correct for Reno, CUBIC and all
>>>         other self-clocked protocols.  One of the core assumptions
>>>         in Jacobson88, was that the clock for the entire system
>>>         comes from packets draining through the bottleneck queue. 
>>>         In this world, the clock is intrinsically brittle if the
>>>         buffers are too small.  The drain time needs to be a
>>>         substantial fraction of the RTT.
>>         I'd like to separate the functions here a bit:
>>
>>         1) "automatic pacing" by ACK clocking
>>
>>         2) congestion-window-based operation
>>
>>         I agree that the automatic pacing generated by the ACK clock
>>         (function 1) is increasingly
>>         distorted these days and may consequently cause micro bursts.
>>         This can be mitigated by using paced sending, which I
>>         consider very useful.
>>         However, I consider abandoning the (congestion) window-based
>>         approaches
>>         with ACK feedback (function 2) as harmful:
>>         a congestion window has an automatic self-stabilizing
>>         property since the ACK feedback reflects
>>         also the queuing delay and the congestion window limits the
>>         amount of inflight data.
>>         In contrast, rate-based senders risk instability: two senders
>>         in an M/D/1 setting, each sender sending with 50%
>>         bottleneck rate in average, both using paced sending at 120%
>>         of the average rate, suffice to cause
>>         instability (queue grows unlimited).
>>
>>         IMHO, two approaches seem to be useful:
>>         a) congestion-window-based operation with paced sending
>>         b) rate-based/paced sending with limiting the amount of
>>         inflight data
>>
>>>
>>>         However, we have reached the point where we need to discard
>>>         that requirement.  One of the side points of BBR is that in
>>>         many environments it is cheaper to burn serving CPU to pace
>>>         into short queue networks than it is to "right size" the
>>>         network queues.
>>>
>>>         The fundamental problem with the old way is that in some
>>>         contexts the buffer memory has to beat Moore's law, because
>>>         to maintain constant drain time the memory size and BW both
>>>         have to scale with the link (laser) BW.
>>>
>>>         See the slides I gave at the Stanford Buffer Sizing workshop
>>>         december 2019: Buffer Sizing: Position Paper
>>>         <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>>
>>>
>>         Thanks for the pointer. I don't quite get the point that the
>>         buffer must have a certain size to keep the ACK clock stable:
>>         in case of an non application-limited sender, a very small
>>         buffer suffices to let the ACK clock
>>         run steady. The large buffers were mainly required for
>>         loss-based CCs to let the standing queue
>>         build up that keeps the bottleneck busy during CWnd reduction
>>         after packet loss, thereby
>>         keeping the (bottleneck link) utilization high.
>>
>>         Regards,
>>
>>          Roland
>>
>>
>>>         Note that we are talking about DC and Internet core.  At the
>>>         edge, BW is low enough where memory is relatively cheap. 
>>>          In some sense BB came about because memory is too cheap in
>>>         these environments.
>>>
>>>         Thanks,
>>>         --MM--
>>>         The best way to predict the future is to create it.  - Alan Kay
>>>
>>>         We must not tolerate intolerance;
>>>                however our response must be carefully measured:
>>>                     too strong would be hypocritical and risks
>>>         spiraling out of control;
>>>                     too weak risks being mistaken for tacit approval.
>>>
>>>
>>>         On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger
>>>         <stephen@networkplumber.org
>>>         <mailto:stephen@networkplumber.org>> wrote:
>>>
>>>             On Fri, 2 Jul 2021 09:42:24 -0700
>>>             Dave Taht <dave.taht@gmail.com
>>>             <mailto:dave.taht@gmail.com>> wrote:
>>>
>>>             > "Debunking Bechtolsheim credibly would get a lot of
>>>             attention to the
>>>             > bufferbloat cause, I suspect." - dpreed
>>>             >
>>>             > "Why Big Data Needs Big Buffer Switches" -
>>>             >
>>>             http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>>             <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>>>             >
>>>
>>>             Also, a lot depends on the TCP congestion control
>>>             algorithm being used.
>>>             They are using NewReno which only researchers use in
>>>             real life.
>>>
>>>             Even TCP Cubic has gone through several revisions. In my
>>>             experience, the
>>>             NS-2 models don't correlate well to real world behavior.
>>>
>>>             In real world tests, TCP Cubic will consume any buffer
>>>             it sees at a
>>>             congested link. Maybe that is what they mean by capture
>>>             effect.
>>>
>>>             There is also a weird oscillation effect with multiple
>>>             streams, where one
>>>             flow will take the buffer, then see a packet loss and
>>>             back off, the
>>>             other flow will take over the buffer until it sees loss.
>>>
>>>             _______________________________________________
>>>
>>>         _______________________________________________
>>
>


[-- Attachment #2: Type: text/html, Size: 18241 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 14:05                         ` [Cerowrt-devel] " Bless, Roland (TM)
@ 2021-07-08 14:40                         ` Jonathan Morton
  2021-07-08 20:14                           ` David P. Reed
  1 sibling, 1 reply; 26+ messages in thread
From: Jonathan Morton @ 2021-07-08 14:40 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Bless, Roland (TM), cerowrt-devel, bloat

> On 8 Jul, 2021, at 4:29 pm, Matt Mathis via Bloat <bloat@lists.bufferbloat.net> wrote:
> 
> That said, it is also true that multi-stream BBR behavior is quite complicated and needs more queue space than single stream.   This complicates the story around the traditional workaround of using multiple streams to compensate for Reno & CUBIC lameness at larger scales (ordinary scales today).    Multi-stream does not help BBR throughput and raises the queue occupancy, to the detriment of other users.

I happen to think that using multiple streams for the sake of maximising throughput is the wrong approach - it is a workaround employed pragmatically by some applications, nothing more.  If BBR can do just as well using a single flow, so much the better.

Another approach to improving the throughput of a single flow is high-fidelity congestion control.  The L4S approach to this, derived rather directly from DCTCP, is fundamentally flawed in that, not being fully backwards compatible with ECN, it cannot safely be deployed on the existing Internet.

An alternative HFCC design using non-ambiguous signalling would be incrementally deployable (thus applicable to Internet scale) and naturally overlaid on existing window-based congestion control.  It's possible to imagine such a flow reaching optimal cwnd by way of slow-start alone, then "cruising" there in a true equilibrium with congestion signals applied by the network.  In fact, we've already shown this occurring under lab conditions; in other cases it still takes one CUBIC cycle to get there.  BBR's periodic probing phases would not be required here.

> IMHO, two approaches seem to be useful:
> a) congestion-window-based operation with paced sending
> b) rate-based/paced sending with limiting the amount of inflight data

So this corresponds to approach a) in Roland's taxonomy.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 14:40                         ` [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Bechtolschiem) Jonathan Morton
@ 2021-07-08 20:14                           ` David P. Reed
  0 siblings, 0 replies; 26+ messages in thread
From: David P. Reed @ 2021-07-08 20:14 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Matt Mathis, Bless, Roland (TM), cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 4635 bytes --]

Keep It Simple, Stupid.

That's a classic architectural principle that still applies. Unfortunately folks who only think hardware want to add features to hardware, but don't study the actual real world version of the problem.

IMO, and it's based on 50 years of experience in network and operating systems performance, latency (response time) is almost always the primary measure users care about. They never care about maximizing "utilization" of resources. After all, in a city, you get maximum utilization of roads when you create a traffic jam. That's not the normal state. In communications, the network should always be at about 10% utilization, because you never want a traffic jam across the whole system to accumulate. Even the old Bell System was engineered to not saturate the links on the worst minute of the worst hour of the worst day of the year (which was often Mother's Day, but could be when a power blackout occurs).

Yet, academics become obsessed with achieving constant very high utilization. And sometimes low leve communications folks adopt that value system, until their customers start complaining.

Why doesn't this penetrate the Net-Shaped Heads of switch designers and others?

What's excellent about what we used to call "best efforts" packet delivery (drop early and often to signal congestion) is that it is robust and puts the onus on the senders of traffic to sort out congestion as quickly as possible. The senders ALL observe congested links quite early if their receivers are paying attention, and they can collaborate *without even knowing who the others congesting the link are*. And by picking the heaviest congestors with higher probability to drop, fq_codel pushes back in a "fair" way when congestion actually crops up. (probabilistically).

It isn't the responsibility of routers to get packets through at any cost. It's their responsibility to signal congestion early enough that it doesn't persist very long at all due to source based rate adaptation.
In other words, a router's job is to route packets and do useful telemetry for the end points using it at the instant.

Please stop focusing on what is an irrelevant metric (maximum throughput with maximum utilization in a special situation only).

Focus on what routers can do well because they actually observe it (instantaneous congestion events) and keep them simple.
.
On Thursday, July 8, 2021 10:40am, "Jonathan Morton" <chromatix99@gmail.com> said:

> > On 8 Jul, 2021, at 4:29 pm, Matt Mathis via Bloat
> <bloat@lists.bufferbloat.net> wrote:
> >
> > That said, it is also true that multi-stream BBR behavior is quite
> complicated and needs more queue space than single stream. This complicates the
> story around the traditional workaround of using multiple streams to compensate
> for Reno & CUBIC lameness at larger scales (ordinary scales today). 
> Multi-stream does not help BBR throughput and raises the queue occupancy, to the
> detriment of other users.
> 
> I happen to think that using multiple streams for the sake of maximising
> throughput is the wrong approach - it is a workaround employed pragmatically by
> some applications, nothing more. If BBR can do just as well using a single flow,
> so much the better.
> 
> Another approach to improving the throughput of a single flow is high-fidelity
> congestion control. The L4S approach to this, derived rather directly from DCTCP,
> is fundamentally flawed in that, not being fully backwards compatible with ECN, it
> cannot safely be deployed on the existing Internet.
> 
> An alternative HFCC design using non-ambiguous signalling would be incrementally
> deployable (thus applicable to Internet scale) and naturally overlaid on existing
> window-based congestion control. It's possible to imagine such a flow reaching
> optimal cwnd by way of slow-start alone, then "cruising" there in a true
> equilibrium with congestion signals applied by the network. In fact, we've
> already shown this occurring under lab conditions; in other cases it still takes
> one CUBIC cycle to get there. BBR's periodic probing phases would not be required
> here.
> 
> > IMHO, two approaches seem to be useful:
> > a) congestion-window-based operation with paced sending
> > b) rate-based/paced sending with limiting the amount of inflight data
> 
> So this corresponds to approach a) in Roland's taxonomy.
> 
> - Jonathan Morton
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 

[-- Attachment #2: Type: text/html, Size: 7130 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-07-08 20:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 15:29 [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Eric Johansson
2016-06-06 16:53 ` Toke Høiland-Jørgensen
2016-06-06 17:46   ` Jonathan Morton
2016-06-06 18:37     ` Mikael Abrahamsson
2016-06-06 21:16       ` Ketan Kulkarni
2016-06-07  2:52         ` dpreed
2016-06-07  2:58           ` dpreed
2016-06-07 10:46             ` Mikael Abrahamsson
2016-06-07 14:46               ` Dave Taht
2016-06-07 17:51             ` Eric Johansson
2016-06-10 21:45               ` dpreed
2016-06-11  1:36                 ` Jonathan Morton
2016-06-11  8:25                 ` Sebastian Moeller
2021-07-02 16:42           ` [Cerowrt-devel] Bechtolschiem Dave Taht
2021-07-02 16:59             ` [Cerowrt-devel] [Bloat] Bechtolschiem Stephen Hemminger
2021-07-02 19:46               ` Matt Mathis
2021-07-07 22:19                 ` [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: [Bloat] Bechtolschiem) Bless, Roland (TM)
2021-07-07 22:38                   ` Matt Mathis
2021-07-08 11:24                     ` [Cerowrt-devel] " Bless, Roland (TM)
2021-07-08 13:29                       ` Matt Mathis
2021-07-08 14:05                         ` [Cerowrt-devel] " Bless, Roland (TM)
2021-07-08 14:40                         ` [Cerowrt-devel] [Bloat] Abandoning Window-based CC Considered Harmful (was Bechtolschiem) Jonathan Morton
2021-07-08 20:14                           ` David P. Reed
2021-07-08 13:29                       ` Neal Cardwell
2021-07-02 20:28               ` [Cerowrt-devel] [Bloat] Bechtolschiem Jonathan Morton
2016-06-07 22:31 ` [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat Valdis.Kletnieks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox