General list for discussing Bufferbloat
 help / color / mirror / Atom feed
* [Bloat] Bechtolschiem
       [not found]         ` <1465267957.902610235@apps.rackspace.com>
@ 2021-07-02 16:42           ` Dave Taht
  2021-07-02 16:59             ` Stephen Hemminger
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Taht @ 2021-07-02 16:42 UTC (permalink / raw)
  To: David Reed, bloat; +Cc: Ketan Kulkarni, Jonathan Morton, cerowrt-devel

"Debunking Bechtolsheim credibly would get a lot of attention to the
bufferbloat cause, I suspect." - dpreed

"Why Big Data Needs Big Buffer Switches" -
http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf

..

i think i've just gained access to a few networks with arista gear in
the bottleneck path.

On Mon, Jun 6, 2016 at 7:52 PM <dpreed@reed.com> wrote:
>
> So did anyone write a response debunking their paper?   Their NS-2 simulation is most likely the erroneous part of their analysis - the white paper would not pass a review by qualified referees because there is no way to check their results and some of what they say beggars belief.
>
>
>
> Bechtolsheim is one of those guys who can write any damn thing and it becomes "truth" - mostly because he co-founded Sun. But that doesn't mean that he can't make huge errors - any of us can.
>
>
>
> The so-called TCP/IP Bandwidth Capture effect that he refers to doesn't sound like any capture effect I've ever heard of.  There is an "Ethernet Capture Effect" (which is cited), which is due to properties of CSMA/CD binary exponential backoff, not anything to do with TCP's flow/congestion control.  So it has that "truthiness" that makes glib people sound like they know what they are talking about, but I'd like to see a reference that says this is a property of TCP!
>
>
>
> What's interesting is that the reference to the Ethernet Capture Effect in that white paper proposes a solution that involves changing the backoff algorithm slightly at the Ethernet level - NOT increasing buffer size!
>
>
>
> Another thing that would probably improve matters a great deal would be to drop/ECN-mark packets when a contended output port on an Arista switch develops a backlog.  This will throttle TCP sources sharing the path.
>
>
>
> The comments in the white paper that say that ACK contention in TCP in the reverse direction are the problem that causes the "so-called TCP/IP Bandwidth Capture effect" that is invented by the authors appears to be hogwash of the first order.
>
>
>
> Debunking Bechtolsheim credibly would get a lot of attention to the bufferbloat cause, I suspect.
>
>
>
>
>
> On Monday, June 6, 2016 5:16pm, "Ketan Kulkarni" <ketkulka@gmail.com> said:
>
> some time back they had this whitepaper -
> "Why Big Data Needs Big Buffer Switches"
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> the type of apps they talk about is big data, hadoop etc
>
> On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
>>
>> On Mon, 6 Jun 2016, Jonathan Morton wrote:
>>
>>> At 100ms buffering, their 10Gbps switch is effectively turning any DC it’s installed in into a transcontinental Internet path, as far as peak latency is concerned.  Just because RAM is cheap these days…
>>
>> Nono, nononononono. I can tell you they're spending serious money on inserting this kind of buffering memory into these kinds of devices. Buying these devices without deep buffers is a lot lower cost.
>>
>> These types of switch chips either have on-die memory (usually 16MB or less), or they have very expensive (a direct cost of lowered port density) off-chip buffering memory.
>>
>> Typically you do this:
>>
>> ports ---|-------
>> ports ---|      |
>> ports ---| chip |
>> ports ---|-------
>>
>> Or you do this
>>
>> ports ---|------|---buffer
>> ports ---| chip |---TCAM
>>          --------
>>
>> or if you do a multi-linecard-device
>>
>> ports ---|------|---buffer
>>          | chip |---TCAM
>>          --------
>>             |
>>         switch fabric
>>
>> (or any variant of them)
>>
>> So basically if you want to buffer and if you want large L2-L4 lookup tables, you have to sacrifice ports. Sacrifice lots of ports.
>>
>> So never say these kinds of devices add buffering because RAM is cheap. This is most definitely not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.
>>
>> --
>> Mikael Abrahamsson    email: swmike@swm.pp.se
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Latest Podcast:
https://www.linkedin.com/feed/update/urn:li:activity:6791014284936785920/

Dave Täht CTO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Bechtolschiem
  2021-07-02 16:42           ` [Bloat] Bechtolschiem Dave Taht
@ 2021-07-02 16:59             ` Stephen Hemminger
  2021-07-02 17:50               ` Dave Collier-Brown
                                 ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Stephen Hemminger @ 2021-07-02 16:59 UTC (permalink / raw)
  To: Dave Taht
  Cc: David Reed, bloat, Jonathan Morton, Ketan Kulkarni, cerowrt-devel

On Fri, 2 Jul 2021 09:42:24 -0700
Dave Taht <dave.taht@gmail.com> wrote:

> "Debunking Bechtolsheim credibly would get a lot of attention to the
> bufferbloat cause, I suspect." - dpreed
> 
> "Why Big Data Needs Big Buffer Switches" -
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> 

Also, a lot depends on the TCP congestion control algorithm being used.
They are using NewReno which only researchers use in real life.

Even TCP Cubic has gone through several revisions. In my experience, the
NS-2 models don't correlate well to real world behavior.

In real world tests, TCP Cubic will consume any buffer it sees at a
congested link. Maybe that is what they mean by capture effect.

There is also a weird oscillation effect with multiple streams, where one
flow will take the buffer, then see a packet loss and back off, the
other flow will take over the buffer until it sees loss.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Bechtolschiem
  2021-07-02 16:59             ` Stephen Hemminger
@ 2021-07-02 17:50               ` Dave Collier-Brown
  2021-07-02 19:46               ` Matt Mathis
  2021-07-02 20:28               ` [Bloat] Bechtolschiem Jonathan Morton
  2 siblings, 0 replies; 16+ messages in thread
From: Dave Collier-Brown @ 2021-07-02 17:50 UTC (permalink / raw)
  To: bloat

[-- Attachment #1: Type: text/plain, Size: 1833 bytes --]

It's written to look like an academic paper, but it's pure marketing.  
"Memory is cheap, we used a lot, so let's select some evidence that 
argues this is a good thing."

As always with the coin-operated, the way to get them to change is to 
offer additional information which

  * captures their attention,

and, more importantly

  * offers them a cheap way to /make more money/.

For example, a software change that make their big buffers not fill up 
with elephants...

--dave

On 2021-07-02 12:59 p.m., Stephen Hemminger wrote:
> On Fri, 2 Jul 2021 09:42:24 -0700
> Dave Taht <dave.taht@gmail.com> wrote:
>
>> "Debunking Bechtolsheim credibly would get a lot of attention to the
>> bufferbloat cause, I suspect." - dpreed
>>
>> "Why Big Data Needs Big Buffer Switches" -
>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>
> Also, a lot depends on the TCP congestion control algorithm being used.
> They are using NewReno which only researchers use in real life.
>
> Even TCP Cubic has gone through several revisions. In my experience, the
> NS-2 models don't correlate well to real world behavior.
>
> In real world tests, TCP Cubic will consume any buffer it sees at a
> congested link. Maybe that is what they mean by capture effect.
>
> There is also a weird oscillation effect with multiple streams, where one
> flow will take the buffer, then see a packet loss and back off, the
> other flow will take over the buffer until it sees loss.
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

-- 
David Collier-Brown,         | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
dave.collier-brown@indexexchange.com |              -- Mark Twain


[-- Attachment #2: Type: text/html, Size: 2923 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Bechtolschiem
  2021-07-02 16:59             ` Stephen Hemminger
  2021-07-02 17:50               ` Dave Collier-Brown
@ 2021-07-02 19:46               ` Matt Mathis
  2021-07-07 22:19                 ` [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem) Bless, Roland (TM)
  2021-07-02 20:28               ` [Bloat] Bechtolschiem Jonathan Morton
  2 siblings, 1 reply; 16+ messages in thread
From: Matt Mathis @ 2021-07-02 19:46 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, Stephen Hemminger, David Reed, Ketan Kulkarni,
	cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

The argument is absolutely correct for Reno, CUBIC and all
other self-clocked protocols.  One of the core assumptions in Jacobson88,
was that the clock for the entire system comes from packets draining
through the bottleneck queue.  In this world, the clock is intrinsically
brittle if the buffers are too small.  The drain time needs to be a
substantial fraction of the RTT.

However, we have reached the point where we need to discard that
requirement.  One of the side points of BBR is that in many environments it
is cheaper to burn serving CPU to pace into short queue networks than it is
to "right size" the network queues.

The fundamental problem with the old way is that in some contexts the
buffer memory has to beat Moore's law, because to maintain constant drain
time the memory size and BW both have to scale with the link (laser) BW.

See the slides I gave at the Stanford Buffer Sizing workshop december
2019: Buffer
Sizing: Position Paper
<https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>


Note that we are talking about DC and Internet core.  At the edge, BW is
low enough where memory is relatively cheap.   In some sense BB came about
because memory is too cheap in these environments.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.


On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Fri, 2 Jul 2021 09:42:24 -0700
> Dave Taht <dave.taht@gmail.com> wrote:
>
> > "Debunking Bechtolsheim credibly would get a lot of attention to the
> > bufferbloat cause, I suspect." - dpreed
> >
> > "Why Big Data Needs Big Buffer Switches" -
> >
> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
> >
>
> Also, a lot depends on the TCP congestion control algorithm being used.
> They are using NewReno which only researchers use in real life.
>
> Even TCP Cubic has gone through several revisions. In my experience, the
> NS-2 models don't correlate well to real world behavior.
>
> In real world tests, TCP Cubic will consume any buffer it sees at a
> congested link. Maybe that is what they mean by capture effect.
>
> There is also a weird oscillation effect with multiple streams, where one
> flow will take the buffer, then see a packet loss and back off, the
> other flow will take over the buffer until it sees loss.
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 3954 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Bechtolschiem
  2021-07-02 16:59             ` Stephen Hemminger
  2021-07-02 17:50               ` Dave Collier-Brown
  2021-07-02 19:46               ` Matt Mathis
@ 2021-07-02 20:28               ` Jonathan Morton
  2 siblings, 0 replies; 16+ messages in thread
From: Jonathan Morton @ 2021-07-02 20:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Dave Taht, David Reed, bloat, Ketan Kulkarni, cerowrt-devel

> On 2 Jul, 2021, at 7:59 pm, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> In real world tests, TCP Cubic will consume any buffer it sees at a
> congested link. Maybe that is what they mean by capture effect.

First, I'll note that what they call "small buffer" corresponds to about a tenth of a millisecond at the port's link rate.  This would be ludicrously small at Internet scale, but is actually reasonable for datacentre conditions where RTTs are often in the microseconds.

Assuming the effect as described is real, it ultimately stems from a burst of traffic from a particular flow arriving at a queue that is *already* full.  Such bursts are expected from ack-clocked flows coming out of application-limited mode (ie. on completion of a disk read), in slow-start, or recovering from earlier losses.  It is also possible for a heavily coalesced ack to abruptly open the receive and congestion windows and trigger a send burst.  These bursts occur much less in paced flows, because the object of pacing is to avoid bursts.

The queue is full because tail drop upon queue overflow is the only congestion signal provided by the switch, and ack-clocked capacity-seeking transports naturally keep the queue as full as they can - especially under high statistical multiplexing conditions where a single multiplicative decrease event does not greatly reduce the total traffic demand. CUBIC arguably spends more time with the queue very close to full than Reno does, due to the plateau designed into it, but at these very short RTTs I would not be surprised if CUBIC is equivalent to Reno in practice.

The solution is to keep some normally-unused space in the queue for bursts of traffic to use occasionally.  This is most naturally done using ECN applied by some AQM algorithm, or the AQM can pre-emptively and selectively drop packets in Not-ECT flows.  And because the AQM is more likely to mark or drop packets from flows that occupy more link time or queue capacity, it has a natural equalising effect between flows.

Applying ECN requires some Layer 3 awareness in the switch, which might not be practical.  A simple alternative it to drop packets instead.  Single packet losses are easily recovered from by retransmission after approximately one RTT.  There are also emerging techniques for applying congestion signals at Layer 2, which can be converted into ECN signals at some convenient point downstream.

However it is achieved, the point is that keeping the *standing* queue down to some fraction of the total queue depth reserves space for accommodating those bursts which are expected occasionally in normal traffic.  Because those bursts are not lost, the flows experiencing them are not disadvantaged and the so-called "capture effect" will not occur.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-02 19:46               ` Matt Mathis
@ 2021-07-07 22:19                 ` Bless, Roland (TM)
  2021-07-07 22:38                   ` Matt Mathis
  0 siblings, 1 reply; 16+ messages in thread
From: Bless, Roland (TM) @ 2021-07-07 22:19 UTC (permalink / raw)
  To: Matt Mathis, Dave Taht; +Cc: cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

Hi Matt,

[sorry for the late reply, overlooked this one]

please, see comments inline.

On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
> The argument is absolutely correct for Reno, CUBIC and all 
> other self-clocked protocols.  One of the core assumptions in 
> Jacobson88, was that the clock for the entire system comes from 
> packets draining through the bottleneck queue.  In this world, the 
> clock is intrinsically brittle if the buffers are too small.  The 
> drain time needs to be a substantial fraction of the RTT.
I'd like to separate the functions here a bit:

1) "automatic pacing" by ACK clocking

2) congestion-window-based operation

I agree that the automatic pacing generated by the ACK clock (function 
1) is increasingly
distorted these days and may consequently cause micro bursts.
This can be mitigated by using paced sending, which I consider very useful.
However, I consider abandoning the (congestion) window-based approaches
with ACK feedback (function 2) as harmful:
a congestion window has an automatic self-stabilizing property since the 
ACK feedback reflects
also the queuing delay and the congestion window limits the amount of 
inflight data.
In contrast, rate-based senders risk instability: two senders in an 
M/D/1 setting, each sender sending with 50%
bottleneck rate in average, both using paced sending at 120% of the 
average rate, suffice to cause
instability (queue grows unlimited).

IMHO, two approaches seem to be useful:
a) congestion-window-based operation with paced sending
b) rate-based/paced sending with limiting the amount of inflight data

>
> However, we have reached the point where we need to discard that 
> requirement.  One of the side points of BBR is that in many 
> environments it is cheaper to burn serving CPU to pace into short 
> queue networks than it is to "right size" the network queues.
>
> The fundamental problem with the old way is that in some contexts the 
> buffer memory has to beat Moore's law, because to maintain constant 
> drain time the memory size and BW both have to scale with the link 
> (laser) BW.
>
> See the slides I gave at the Stanford Buffer Sizing workshop december 
> 2019: Buffer Sizing: Position Paper 
> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5> 
>
>
Thanks for the pointer. I don't quite get the point that the buffer must 
have a certain size to keep the ACK clock stable:
in case of an non application-limited sender, a very small buffer 
suffices to let the ACK clock
run steady. The large buffers were mainly required for loss-based CCs to 
let the standing queue
build up that keeps the bottleneck busy during CWnd reduction after 
packet loss, thereby
keeping the (bottleneck link) utilization high.

Regards,

  Roland


> Note that we are talking about DC and Internet core.  At the edge, BW 
> is low enough where memory is relatively cheap.  In some sense BB came 
> about because memory is too cheap in these environments.
>
> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out 
> of control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger 
> <stephen@networkplumber.org <mailto:stephen@networkplumber.org>> wrote:
>
>     On Fri, 2 Jul 2021 09:42:24 -0700
>     Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>> wrote:
>
>     > "Debunking Bechtolsheim credibly would get a lot of attention to the
>     > bufferbloat cause, I suspect." - dpreed
>     >
>     > "Why Big Data Needs Big Buffer Switches" -
>     >
>     http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>     <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>     >
>
>     Also, a lot depends on the TCP congestion control algorithm being
>     used.
>     They are using NewReno which only researchers use in real life.
>
>     Even TCP Cubic has gone through several revisions. In my
>     experience, the
>     NS-2 models don't correlate well to real world behavior.
>
>     In real world tests, TCP Cubic will consume any buffer it sees at a
>     congested link. Maybe that is what they mean by capture effect.
>
>     There is also a weird oscillation effect with multiple streams,
>     where one
>     flow will take the buffer, then see a packet loss and back off, the
>     other flow will take over the buffer until it sees loss.
>
>     _______________________________________________
>
> _______________________________________________


[-- Attachment #2: Type: text/html, Size: 8065 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-07 22:19                 ` [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem) Bless, Roland (TM)
@ 2021-07-07 22:38                   ` Matt Mathis
  2021-07-08 11:24                     ` Bless, Roland (TM)
  0 siblings, 1 reply; 16+ messages in thread
From: Matt Mathis @ 2021-07-07 22:38 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 5352 bytes --]

Actually BBR does have a window based backup, which normally only comes
into play during load spikes and at very short RTTs.   It defaults to
2*minRTT*maxBW, which is twice the steady state window in it's normal paced
mode.

This is too large for short queue routers in the Internet core, but it
helps a lot with cross traffic on large queue edge routers.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.


On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> Hi Matt,
>
> [sorry for the late reply, overlooked this one]
>
> please, see comments inline.
>
> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>
> The argument is absolutely correct for Reno, CUBIC and all
> other self-clocked protocols.  One of the core assumptions in Jacobson88,
> was that the clock for the entire system comes from packets draining
> through the bottleneck queue.  In this world, the clock is intrinsically
> brittle if the buffers are too small.  The drain time needs to be a
> substantial fraction of the RTT.
>
> I'd like to separate the functions here a bit:
>
> 1) "automatic pacing" by ACK clocking
>
> 2) congestion-window-based operation
>
> I agree that the automatic pacing generated by the ACK clock (function 1)
> is increasingly
> distorted these days and may consequently cause micro bursts.
> This can be mitigated by using paced sending, which I consider very
> useful.
> However, I consider abandoning the (congestion) window-based approaches
> with ACK feedback (function 2) as harmful:
> a congestion window has an automatic self-stabilizing property since the
> ACK feedback reflects
> also the queuing delay and the congestion window limits the amount of
> inflight data.
> In contrast, rate-based senders risk instability: two senders in an M/D/1
> setting, each sender sending with 50%
> bottleneck rate in average, both using paced sending at 120% of the
> average rate, suffice to cause
> instability (queue grows unlimited).
>
> IMHO, two approaches seem to be useful:
> a) congestion-window-based operation with paced sending
> b) rate-based/paced sending with limiting the amount of inflight data
>
>
> However, we have reached the point where we need to discard that
> requirement.  One of the side points of BBR is that in many environments it
> is cheaper to burn serving CPU to pace into short queue networks than it is
> to "right size" the network queues.
>
> The fundamental problem with the old way is that in some contexts the
> buffer memory has to beat Moore's law, because to maintain constant drain
> time the memory size and BW both have to scale with the link (laser) BW.
>
> See the slides I gave at the Stanford Buffer Sizing workshop december
> 2019: Buffer Sizing: Position Paper
> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>
>
> Thanks for the pointer. I don't quite get the point that the buffer must
> have a certain size to keep the ACK clock stable:
> in case of an non application-limited sender, a very small buffer suffices
> to let the ACK clock
> run steady. The large buffers were mainly required for loss-based CCs to
> let the standing queue
> build up that keeps the bottleneck busy during CWnd reduction after packet
> loss, thereby
> keeping the (bottleneck link) utilization high.
>
> Regards,
>
>  Roland
>
>
> Note that we are talking about DC and Internet core.  At the edge, BW is
> low enough where memory is relatively cheap.   In some sense BB came about
> because memory is too cheap in these environments.
>
> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out of
> control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
> stephen@networkplumber.org> wrote:
>
>> On Fri, 2 Jul 2021 09:42:24 -0700
>> Dave Taht <dave.taht@gmail.com> wrote:
>>
>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>> > bufferbloat cause, I suspect." - dpreed
>> >
>> > "Why Big Data Needs Big Buffer Switches" -
>> >
>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>> >
>>
>> Also, a lot depends on the TCP congestion control algorithm being used.
>> They are using NewReno which only researchers use in real life.
>>
>> Even TCP Cubic has gone through several revisions. In my experience, the
>> NS-2 models don't correlate well to real world behavior.
>>
>> In real world tests, TCP Cubic will consume any buffer it sees at a
>> congested link. Maybe that is what they mean by capture effect.
>>
>> There is also a weird oscillation effect with multiple streams, where one
>> flow will take the buffer, then see a packet loss and back off, the
>> other flow will take over the buffer until it sees loss.
>>
>> _______________________________________________
>
> _______________________________________________
>
>
>

[-- Attachment #2: Type: text/html, Size: 8873 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-07 22:38                   ` Matt Mathis
@ 2021-07-08 11:24                     ` Bless, Roland (TM)
  2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 13:29                       ` [Bloat] " Neal Cardwell
  0 siblings, 2 replies; 16+ messages in thread
From: Bless, Roland (TM) @ 2021-07-08 11:24 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 6702 bytes --]

Hi Matt,

On 08.07.21 at 00:38 Matt Mathis wrote:
> Actually BBR does have a window based backup, which normally only 
> comes into play during load spikes and at very short RTTs.   It 
> defaults to 2*minRTT*maxBW, which is twice the steady state window in 
> it's normal paced mode.

So yes, BBR follows option b), but I guess that you are referring to 
BBRv1 here.
We have shown in [1, Sec.III] that BBRv1 flows will *always* run 
(conceptually) toward their above quoted inflight-cap of
2*minRTT*maxBW, if more than one BBR flow is present at the bottleneck. 
So strictly speaking " which *normally only* comes
into play during load spikes and at very short RTTs" isn't true for 
multiple BBRv1 flows.

It seems that in BBRv2 there are many more mechanisms present
that try to control the amount of inflight data more tightly and the new 
"cap"
is at 1.25 BDP.

> This is too large for short queue routers in the Internet core, but it 
> helps a lot with cross traffic on large queue edge routers.

Best regards,
  Roland

[1] https://ieeexplore.ieee.org/document/8117540

>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) 
> <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>
>     Hi Matt,
>
>     [sorry for the late reply, overlooked this one]
>
>     please, see comments inline.
>
>     On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>     The argument is absolutely correct for Reno, CUBIC and all
>>     other self-clocked protocols.  One of the core assumptions in
>>     Jacobson88, was that the clock for the entire system comes from
>>     packets draining through the bottleneck queue.  In this world,
>>     the clock is intrinsically brittle if the buffers are too small.
>>     The drain time needs to be a substantial fraction of the RTT.
>     I'd like to separate the functions here a bit:
>
>     1) "automatic pacing" by ACK clocking
>
>     2) congestion-window-based operation
>
>     I agree that the automatic pacing generated by the ACK clock
>     (function 1) is increasingly
>     distorted these days and may consequently cause micro bursts.
>     This can be mitigated by using paced sending, which I consider
>     very useful.
>     However, I consider abandoning the (congestion) window-based
>     approaches
>     with ACK feedback (function 2) as harmful:
>     a congestion window has an automatic self-stabilizing property
>     since the ACK feedback reflects
>     also the queuing delay and the congestion window limits the amount
>     of inflight data.
>     In contrast, rate-based senders risk instability: two senders in
>     an M/D/1 setting, each sender sending with 50%
>     bottleneck rate in average, both using paced sending at 120% of
>     the average rate, suffice to cause
>     instability (queue grows unlimited).
>
>     IMHO, two approaches seem to be useful:
>     a) congestion-window-based operation with paced sending
>     b) rate-based/paced sending with limiting the amount of inflight data
>
>>
>>     However, we have reached the point where we need to discard that
>>     requirement.  One of the side points of BBR is that in many
>>     environments it is cheaper to burn serving CPU to pace into short
>>     queue networks than it is to "right size" the network queues.
>>
>>     The fundamental problem with the old way is that in some contexts
>>     the buffer memory has to beat Moore's law, because to maintain
>>     constant drain time the memory size and BW both have to scale
>>     with the link (laser) BW.
>>
>>     See the slides I gave at the Stanford Buffer Sizing workshop
>>     december 2019: Buffer Sizing: Position Paper
>>     <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>     Thanks for the pointer. I don't quite get the point that the
>     buffer must have a certain size to keep the ACK clock stable:
>     in case of an non application-limited sender, a very small buffer
>     suffices to let the ACK clock
>     run steady. The large buffers were mainly required for loss-based
>     CCs to let the standing queue
>     build up that keeps the bottleneck busy during CWnd reduction
>     after packet loss, thereby
>     keeping the (bottleneck link) utilization high.
>
>     Regards,
>
>      Roland
>
>
>>     Note that we are talking about DC and Internet core.  At the
>>     edge, BW is low enough where memory is relatively cheap.   In
>>     some sense BB came about because memory is too cheap in these
>>     environments.
>>
>>     Thanks,
>>     --MM--
>>     The best way to predict the future is to create it.  - Alan Kay
>>
>>     We must not tolerate intolerance;
>>            however our response must be carefully measured:
>>                 too strong would be hypocritical and risks spiraling
>>     out of control;
>>                 too weak risks being mistaken for tacit approval.
>>
>>
>>     On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger
>>     <stephen@networkplumber.org <mailto:stephen@networkplumber.org>>
>>     wrote:
>>
>>         On Fri, 2 Jul 2021 09:42:24 -0700
>>         Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>>
>>         wrote:
>>
>>         > "Debunking Bechtolsheim credibly would get a lot of
>>         attention to the
>>         > bufferbloat cause, I suspect." - dpreed
>>         >
>>         > "Why Big Data Needs Big Buffer Switches" -
>>         >
>>         http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>         <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>>         >
>>
>>         Also, a lot depends on the TCP congestion control algorithm
>>         being used.
>>         They are using NewReno which only researchers use in real life.
>>
>>         Even TCP Cubic has gone through several revisions. In my
>>         experience, the
>>         NS-2 models don't correlate well to real world behavior.
>>
>>         In real world tests, TCP Cubic will consume any buffer it
>>         sees at a
>>         congested link. Maybe that is what they mean by capture effect.
>>
>>         There is also a weird oscillation effect with multiple
>>         streams, where one
>>         flow will take the buffer, then see a packet loss and back
>>         off, the
>>         other flow will take over the buffer until it sees loss.
>>
>>         _______________________________________________
>>
>>     _______________________________________________
>


[-- Attachment #2: Type: text/html, Size: 11660 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 11:24                     ` Bless, Roland (TM)
@ 2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 14:05                         ` Bless, Roland (TM)
  2021-07-08 14:40                         ` Jonathan Morton
  2021-07-08 13:29                       ` [Bloat] " Neal Cardwell
  1 sibling, 2 replies; 16+ messages in thread
From: Matt Mathis @ 2021-07-08 13:29 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 7528 bytes --]

I think there is something missing from your model.    I just scanned your
paper and noticed that you made no mention of rounding errors, nor some
details around the drain phase timing,   The implementation guarantees that
the actual average rate across the combined BW probe and drain is strictly
less than the measured maxBW and that the flight size comes back down to
minRTT*maxBW before returning to unity pacing gain.  In some sense these
checks are redundant, but If you don't do them, it is absolutely true that
you are at risk of seeing divergent behaviors.

That said, it is also true that multi-stream BBR behavior is quite
complicated and needs more queue space than single stream.   This
complicates the story around the traditional workaround of using multiple
streams to compensate for Reno & CUBIC lameness at larger scales (ordinary
scales today).    Multi-stream does not help BBR throughput and raises the
queue occupancy, to the detriment of other users.

And yes, in my presentation, I described the core BBR algorithms as a
framework, which might be extended to incorporate many additional
algorithms if they provide optimal control in some settings.  And yes,
several are present in BBRv2.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of
control;
            too weak risks being mistaken for tacit approval.


On Thu, Jul 8, 2021 at 4:24 AM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> Hi Matt,
>
> On 08.07.21 at 00:38 Matt Mathis wrote:
>
> Actually BBR does have a window based backup, which normally only comes
> into play during load spikes and at very short RTTs.   It defaults to
> 2*minRTT*maxBW, which is twice the steady state window in it's normal paced
> mode.
>
> So yes, BBR follows option b), but I guess that you are referring to BBRv1
> here.
> We have shown in [1, Sec.III] that BBRv1 flows will *always* run
> (conceptually) toward their above quoted inflight-cap of
> 2*minRTT*maxBW, if more than one BBR flow is present at the bottleneck. So
> strictly speaking " which *normally only* comes
> into play during load spikes and at very short RTTs" isn't true for
> multiple BBRv1 flows.
>
> It seems that in BBRv2 there are many more mechanisms present
> that try to control the amount of inflight data more tightly and the new
> "cap"
> is at 1.25 BDP.
>
> This is too large for short queue routers in the Internet core, but it
> helps a lot with cross traffic on large queue edge routers.
>
> Best regards,
>  Roland
>
> [1] https://ieeexplore.ieee.org/document/8117540
>
>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
> wrote:
>
>> Hi Matt,
>>
>> [sorry for the late reply, overlooked this one]
>>
>> please, see comments inline.
>>
>> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>
>> The argument is absolutely correct for Reno, CUBIC and all
>> other self-clocked protocols.  One of the core assumptions in Jacobson88,
>> was that the clock for the entire system comes from packets draining
>> through the bottleneck queue.  In this world, the clock is intrinsically
>> brittle if the buffers are too small.  The drain time needs to be a
>> substantial fraction of the RTT.
>>
>> I'd like to separate the functions here a bit:
>>
>> 1) "automatic pacing" by ACK clocking
>>
>> 2) congestion-window-based operation
>>
>> I agree that the automatic pacing generated by the ACK clock (function 1)
>> is increasingly
>> distorted these days and may consequently cause micro bursts.
>> This can be mitigated by using paced sending, which I consider very
>> useful.
>> However, I consider abandoning the (congestion) window-based approaches
>> with ACK feedback (function 2) as harmful:
>> a congestion window has an automatic self-stabilizing property since the
>> ACK feedback reflects
>> also the queuing delay and the congestion window limits the amount of
>> inflight data.
>> In contrast, rate-based senders risk instability: two senders in an M/D/1
>> setting, each sender sending with 50%
>> bottleneck rate in average, both using paced sending at 120% of the
>> average rate, suffice to cause
>> instability (queue grows unlimited).
>>
>> IMHO, two approaches seem to be useful:
>> a) congestion-window-based operation with paced sending
>> b) rate-based/paced sending with limiting the amount of inflight data
>>
>>
>> However, we have reached the point where we need to discard that
>> requirement.  One of the side points of BBR is that in many environments it
>> is cheaper to burn serving CPU to pace into short queue networks than it is
>> to "right size" the network queues.
>>
>> The fundamental problem with the old way is that in some contexts the
>> buffer memory has to beat Moore's law, because to maintain constant drain
>> time the memory size and BW both have to scale with the link (laser) BW.
>>
>> See the slides I gave at the Stanford Buffer Sizing workshop december
>> 2019: Buffer Sizing: Position Paper
>> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>> Thanks for the pointer. I don't quite get the point that the buffer must
>> have a certain size to keep the ACK clock stable:
>> in case of an non application-limited sender, a very small buffer
>> suffices to let the ACK clock
>> run steady. The large buffers were mainly required for loss-based CCs to
>> let the standing queue
>> build up that keeps the bottleneck busy during CWnd reduction after
>> packet loss, thereby
>> keeping the (bottleneck link) utilization high.
>>
>> Regards,
>>
>>  Roland
>>
>>
>> Note that we are talking about DC and Internet core.  At the edge, BW is
>> low enough where memory is relatively cheap.   In some sense BB came about
>> because memory is too cheap in these environments.
>>
>> Thanks,
>> --MM--
>> The best way to predict the future is to create it.  - Alan Kay
>>
>> We must not tolerate intolerance;
>>        however our response must be carefully measured:
>>             too strong would be hypocritical and risks spiraling out of
>> control;
>>             too weak risks being mistaken for tacit approval.
>>
>>
>> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>>
>>> On Fri, 2 Jul 2021 09:42:24 -0700
>>> Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>>> > bufferbloat cause, I suspect." - dpreed
>>> >
>>> > "Why Big Data Needs Big Buffer Switches" -
>>> >
>>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>> >
>>>
>>> Also, a lot depends on the TCP congestion control algorithm being used.
>>> They are using NewReno which only researchers use in real life.
>>>
>>> Even TCP Cubic has gone through several revisions. In my experience, the
>>> NS-2 models don't correlate well to real world behavior.
>>>
>>> In real world tests, TCP Cubic will consume any buffer it sees at a
>>> congested link. Maybe that is what they mean by capture effect.
>>>
>>> There is also a weird oscillation effect with multiple streams, where one
>>> flow will take the buffer, then see a packet loss and back off, the
>>> other flow will take over the buffer until it sees loss.
>>>
>>> _______________________________________________
>>
>> _______________________________________________
>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 13458 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 11:24                     ` Bless, Roland (TM)
  2021-07-08 13:29                       ` Matt Mathis
@ 2021-07-08 13:29                       ` Neal Cardwell
  2021-07-08 14:28                         ` Bless, Roland (TM)
  1 sibling, 1 reply; 16+ messages in thread
From: Neal Cardwell @ 2021-07-08 13:29 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Matt Mathis, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 6142 bytes --]

On Thu, Jul 8, 2021 at 7:25 AM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> It seems that in BBRv2 there are many more mechanisms present
> that try to control the amount of inflight data more tightly and the new
> "cap"
> is at 1.25 BDP.
>
To clarify, the BBRv2 cwnd cap is not 1.25*BDP. If there is no packet loss
or ECN, the BBRv2 cwnd cap is the same as BBRv1. But if there has been
packet loss then conceptually the cwnd cap is the maximum amount of data
delivered in a single round trip since the last packet loss (with a floor
to ensure that the cwnd does not decrease by more than 30% per round trip
with packet loss, similar to CUBIC's 30% reduction in a round trip with
packet loss). (And upon RTO the BBR (v1 or v2) cwnd is reset to 1, and
slow-starts upward from there.)

There is an overview of the BBRv2 response to packet loss here:

https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00#page=18

best,
neal



> This is too large for short queue routers in the Internet core, but it
> helps a lot with cross traffic on large queue edge routers.
>
> Best regards,
>  Roland
>
> [1] https://ieeexplore.ieee.org/document/8117540
>
>
> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
> wrote:
>
>> Hi Matt,
>>
>> [sorry for the late reply, overlooked this one]
>>
>> please, see comments inline.
>>
>> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>
>> The argument is absolutely correct for Reno, CUBIC and all
>> other self-clocked protocols.  One of the core assumptions in Jacobson88,
>> was that the clock for the entire system comes from packets draining
>> through the bottleneck queue.  In this world, the clock is intrinsically
>> brittle if the buffers are too small.  The drain time needs to be a
>> substantial fraction of the RTT.
>>
>> I'd like to separate the functions here a bit:
>>
>> 1) "automatic pacing" by ACK clocking
>>
>> 2) congestion-window-based operation
>>
>> I agree that the automatic pacing generated by the ACK clock (function 1)
>> is increasingly
>> distorted these days and may consequently cause micro bursts.
>> This can be mitigated by using paced sending, which I consider very
>> useful.
>> However, I consider abandoning the (congestion) window-based approaches
>> with ACK feedback (function 2) as harmful:
>> a congestion window has an automatic self-stabilizing property since the
>> ACK feedback reflects
>> also the queuing delay and the congestion window limits the amount of
>> inflight data.
>> In contrast, rate-based senders risk instability: two senders in an M/D/1
>> setting, each sender sending with 50%
>> bottleneck rate in average, both using paced sending at 120% of the
>> average rate, suffice to cause
>> instability (queue grows unlimited).
>>
>> IMHO, two approaches seem to be useful:
>> a) congestion-window-based operation with paced sending
>> b) rate-based/paced sending with limiting the amount of inflight data
>>
>>
>> However, we have reached the point where we need to discard that
>> requirement.  One of the side points of BBR is that in many environments it
>> is cheaper to burn serving CPU to pace into short queue networks than it is
>> to "right size" the network queues.
>>
>> The fundamental problem with the old way is that in some contexts the
>> buffer memory has to beat Moore's law, because to maintain constant drain
>> time the memory size and BW both have to scale with the link (laser) BW.
>>
>> See the slides I gave at the Stanford Buffer Sizing workshop december
>> 2019: Buffer Sizing: Position Paper
>> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>
>>
>> Thanks for the pointer. I don't quite get the point that the buffer must
>> have a certain size to keep the ACK clock stable:
>> in case of an non application-limited sender, a very small buffer
>> suffices to let the ACK clock
>> run steady. The large buffers were mainly required for loss-based CCs to
>> let the standing queue
>> build up that keeps the bottleneck busy during CWnd reduction after
>> packet loss, thereby
>> keeping the (bottleneck link) utilization high.
>>
>> Regards,
>>
>>  Roland
>>
>>
>> Note that we are talking about DC and Internet core.  At the edge, BW is
>> low enough where memory is relatively cheap.   In some sense BB came about
>> because memory is too cheap in these environments.
>>
>> Thanks,
>> --MM--
>> The best way to predict the future is to create it.  - Alan Kay
>>
>> We must not tolerate intolerance;
>>        however our response must be carefully measured:
>>             too strong would be hypocritical and risks spiraling out of
>> control;
>>             too weak risks being mistaken for tacit approval.
>>
>>
>> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>>
>>> On Fri, 2 Jul 2021 09:42:24 -0700
>>> Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>>> > bufferbloat cause, I suspect." - dpreed
>>> >
>>> > "Why Big Data Needs Big Buffer Switches" -
>>> >
>>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>> >
>>>
>>> Also, a lot depends on the TCP congestion control algorithm being used.
>>> They are using NewReno which only researchers use in real life.
>>>
>>> Even TCP Cubic has gone through several revisions. In my experience, the
>>> NS-2 models don't correlate well to real world behavior.
>>>
>>> In real world tests, TCP Cubic will consume any buffer it sees at a
>>> congested link. Maybe that is what they mean by capture effect.
>>>
>>> There is also a weird oscillation effect with multiple streams, where one
>>> flow will take the buffer, then see a packet loss and back off, the
>>> other flow will take over the buffer until it sees loss.
>>>
>>> _______________________________________________
>>
>> _______________________________________________
>>
>>
>>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 12003 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 13:29                       ` Matt Mathis
@ 2021-07-08 14:05                         ` Bless, Roland (TM)
  2021-07-08 14:40                         ` Jonathan Morton
  1 sibling, 0 replies; 16+ messages in thread
From: Bless, Roland (TM) @ 2021-07-08 14:05 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Dave Taht, cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 9727 bytes --]

Hi Matt,

On 08.07.21 at 15:29 Matt Mathis wrote:
> I think there is something missing from your model.    I just scanned 
> your paper and noticed that you made no mention of rounding errors, 
> nor some details around the drain phase timing,   The 
> implementation guarantees that the actual average rate across the 
> combined BW probe and drain is strictly less than the measured maxBW 
> and that the flight size comes back down to minRTT*maxBW before 
> returning to unity pacing gain.  In some sense these checks are 
> redundant, but If you don't do them, it is absolutely true that you 
> are at risk of seeing divergent behaviors.
Sure, most models abstract things away and so does our model leave out
some details, but it describes quite accurately what happens if multiple
BBRv1 flows are present. So the model was not only confirmed by our
own measurements, but also by many others who did BBRv1 experiments.
> That said, it is also true that multi-stream BBR behavior is quite 
> complicated and needs more queue space than single stream.   This
Yes, mostly between 1bdp and 1.5bdp of queue space.
> complicates the story around the traditional workaround of using 
> multiple streams to compensate for Reno & CUBIC lameness at larger 
> scales (ordinary scales today). Multi-stream does not help BBR 
> throughput and raises the queue occupancy, to the detriment of other 
> users.
>
> And yes, in my presentation, I described the core BBR algorithms as a 
> framework, which might be extended to incorporate many additional 
> algorithms if they provide optimal control in some settings.  And yes, 
> several are present in BBRv2.

Ok, thanks for clarification.

Regards,
  Roland

> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
>
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out 
> of control;
>             too weak risks being mistaken for tacit approval.
>
>
> On Thu, Jul 8, 2021 at 4:24 AM Bless, Roland (TM) 
> <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>
>     Hi Matt,
>
>     On 08.07.21 at 00:38 Matt Mathis wrote:
>>     Actually BBR does have a window based backup, which normally only
>>     comes into play during load spikes and at very short RTTs.   It
>>     defaults to 2*minRTT*maxBW, which is twice the steady state
>>     window in it's normal paced mode.
>
>     So yes, BBR follows option b), but I guess that you are referring
>     to BBRv1 here.
>     We have shown in [1, Sec.III] that BBRv1 flows will *always* run
>     (conceptually) toward their above quoted inflight-cap of
>     2*minRTT*maxBW, if more than one BBR flow is present at the
>     bottleneck. So strictly speaking " which *normally only* comes
>     into play during load spikes and at very short RTTs" isn't true
>     for multiple BBRv1 flows.
>
>     It seems that in BBRv2 there are many more mechanisms present
>     that try to control the amount of inflight data more tightly and
>     the new "cap"
>     is at 1.25 BDP.
>
>>     This is too large for short queue routers in the Internet core,
>>     but it helps a lot with cross traffic on large queue edge routers.
>
>     Best regards,
>      Roland
>
>     [1] https://ieeexplore.ieee.org/document/8117540
>     <https://ieeexplore.ieee.org/document/8117540>
>
>>
>>     On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM)
>>     <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>>
>>         Hi Matt,
>>
>>         [sorry for the late reply, overlooked this one]
>>
>>         please, see comments inline.
>>
>>         On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>>         The argument is absolutely correct for Reno, CUBIC and all
>>>         other self-clocked protocols.  One of the core assumptions
>>>         in Jacobson88, was that the clock for the entire system
>>>         comes from packets draining through the bottleneck queue. 
>>>         In this world, the clock is intrinsically brittle if the
>>>         buffers are too small.  The drain time needs to be a
>>>         substantial fraction of the RTT.
>>         I'd like to separate the functions here a bit:
>>
>>         1) "automatic pacing" by ACK clocking
>>
>>         2) congestion-window-based operation
>>
>>         I agree that the automatic pacing generated by the ACK clock
>>         (function 1) is increasingly
>>         distorted these days and may consequently cause micro bursts.
>>         This can be mitigated by using paced sending, which I
>>         consider very useful.
>>         However, I consider abandoning the (congestion) window-based
>>         approaches
>>         with ACK feedback (function 2) as harmful:
>>         a congestion window has an automatic self-stabilizing
>>         property since the ACK feedback reflects
>>         also the queuing delay and the congestion window limits the
>>         amount of inflight data.
>>         In contrast, rate-based senders risk instability: two senders
>>         in an M/D/1 setting, each sender sending with 50%
>>         bottleneck rate in average, both using paced sending at 120%
>>         of the average rate, suffice to cause
>>         instability (queue grows unlimited).
>>
>>         IMHO, two approaches seem to be useful:
>>         a) congestion-window-based operation with paced sending
>>         b) rate-based/paced sending with limiting the amount of
>>         inflight data
>>
>>>
>>>         However, we have reached the point where we need to discard
>>>         that requirement.  One of the side points of BBR is that in
>>>         many environments it is cheaper to burn serving CPU to pace
>>>         into short queue networks than it is to "right size" the
>>>         network queues.
>>>
>>>         The fundamental problem with the old way is that in some
>>>         contexts the buffer memory has to beat Moore's law, because
>>>         to maintain constant drain time the memory size and BW both
>>>         have to scale with the link (laser) BW.
>>>
>>>         See the slides I gave at the Stanford Buffer Sizing workshop
>>>         december 2019: Buffer Sizing: Position Paper
>>>         <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>>
>>>
>>         Thanks for the pointer. I don't quite get the point that the
>>         buffer must have a certain size to keep the ACK clock stable:
>>         in case of an non application-limited sender, a very small
>>         buffer suffices to let the ACK clock
>>         run steady. The large buffers were mainly required for
>>         loss-based CCs to let the standing queue
>>         build up that keeps the bottleneck busy during CWnd reduction
>>         after packet loss, thereby
>>         keeping the (bottleneck link) utilization high.
>>
>>         Regards,
>>
>>          Roland
>>
>>
>>>         Note that we are talking about DC and Internet core.  At the
>>>         edge, BW is low enough where memory is relatively cheap. 
>>>          In some sense BB came about because memory is too cheap in
>>>         these environments.
>>>
>>>         Thanks,
>>>         --MM--
>>>         The best way to predict the future is to create it.  - Alan Kay
>>>
>>>         We must not tolerate intolerance;
>>>                however our response must be carefully measured:
>>>                     too strong would be hypocritical and risks
>>>         spiraling out of control;
>>>                     too weak risks being mistaken for tacit approval.
>>>
>>>
>>>         On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger
>>>         <stephen@networkplumber.org
>>>         <mailto:stephen@networkplumber.org>> wrote:
>>>
>>>             On Fri, 2 Jul 2021 09:42:24 -0700
>>>             Dave Taht <dave.taht@gmail.com
>>>             <mailto:dave.taht@gmail.com>> wrote:
>>>
>>>             > "Debunking Bechtolsheim credibly would get a lot of
>>>             attention to the
>>>             > bufferbloat cause, I suspect." - dpreed
>>>             >
>>>             > "Why Big Data Needs Big Buffer Switches" -
>>>             >
>>>             http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>>             <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>>>             >
>>>
>>>             Also, a lot depends on the TCP congestion control
>>>             algorithm being used.
>>>             They are using NewReno which only researchers use in
>>>             real life.
>>>
>>>             Even TCP Cubic has gone through several revisions. In my
>>>             experience, the
>>>             NS-2 models don't correlate well to real world behavior.
>>>
>>>             In real world tests, TCP Cubic will consume any buffer
>>>             it sees at a
>>>             congested link. Maybe that is what they mean by capture
>>>             effect.
>>>
>>>             There is also a weird oscillation effect with multiple
>>>             streams, where one
>>>             flow will take the buffer, then see a packet loss and
>>>             back off, the
>>>             other flow will take over the buffer until it sees loss.
>>>
>>>             _______________________________________________
>>>
>>>         _______________________________________________
>>
>


[-- Attachment #2: Type: text/html, Size: 18241 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 13:29                       ` [Bloat] " Neal Cardwell
@ 2021-07-08 14:28                         ` Bless, Roland (TM)
  2021-07-08 15:47                           ` Neal Cardwell
  0 siblings, 1 reply; 16+ messages in thread
From: Bless, Roland (TM) @ 2021-07-08 14:28 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Matt Mathis, bloat

[-- Attachment #1: Type: text/plain, Size: 8324 bytes --]

Hi Neal,

On 08.07.21 at 15:29 Neal Cardwell wrote:
> On Thu, Jul 8, 2021 at 7:25 AM Bless, Roland (TM) 
> <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>
>     It seems that in BBRv2 there are many more mechanisms present
>     that try to control the amount of inflight data more tightly and
>     the new "cap"
>     is at 1.25 BDP.
>
> To clarify, the BBRv2 cwnd cap is not 1.25*BDP. If there is no packet 
> loss or ECN, the BBRv2 cwnd cap is the same as BBRv1. But if there has 
> been packet loss then conceptually the cwnd cap is the maximum amount 
> of data delivered in a single round trip since the last packet loss 
> (with a floor to ensure that the cwnd does not decrease by more than 
> 30% per round trip with packet loss, similar to CUBIC's 30% reduction 
> in a round trip with packet loss). (And upon RTO the BBR (v1 or v2) 
> cwnd is reset to 1, and slow-starts upward from there.)
Thanks for the clarification. I'm patiently waiting to see the BBRv2 
mechanisms coherently written up
in that new BBR Internet-Draft version ;-) Getting this together from 
the "diffs" on the IETF slides or the source code
is somewhat tedious, so I'll be very grateful for having that single 
write up.
> There is an overview of the BBRv2 response to packet loss here:
> https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00#page=18 
> <https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00#page=18>
My assumption came from slide 25 of this slide set:
the probing is terminated if inflight > 1.25 estimated_bdp (or "hard 
ceiling" seen).
So without experiencing more than 2% packet loss this may end up beyond 
1.25 estimated_bdp,
but would it often end at 2estimated_bdp?

Best regards,

  Roland

>
>>     This is too large for short queue routers in the Internet core,
>>     but it helps a lot with cross traffic on large queue edge routers.
>
>     Best regards,
>      Roland
>
>     [1] https://ieeexplore.ieee.org/document/8117540
>     <https://ieeexplore.ieee.org/document/8117540>
>
>>
>>     On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM)
>>     <roland.bless@kit.edu <mailto:roland.bless@kit.edu>> wrote:
>>
>>         Hi Matt,
>>
>>         [sorry for the late reply, overlooked this one]
>>
>>         please, see comments inline.
>>
>>         On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>>         The argument is absolutely correct for Reno, CUBIC and all
>>>         other self-clocked protocols.  One of the core assumptions
>>>         in Jacobson88, was that the clock for the entire system
>>>         comes from packets draining through the bottleneck queue. 
>>>         In this world, the clock is intrinsically brittle if the
>>>         buffers are too small.  The drain time needs to be a
>>>         substantial fraction of the RTT.
>>         I'd like to separate the functions here a bit:
>>
>>         1) "automatic pacing" by ACK clocking
>>
>>         2) congestion-window-based operation
>>
>>         I agree that the automatic pacing generated by the ACK clock
>>         (function 1) is increasingly
>>         distorted these days and may consequently cause micro bursts.
>>         This can be mitigated by using paced sending, which I
>>         consider very useful.
>>         However, I consider abandoning the (congestion) window-based
>>         approaches
>>         with ACK feedback (function 2) as harmful:
>>         a congestion window has an automatic self-stabilizing
>>         property since the ACK feedback reflects
>>         also the queuing delay and the congestion window limits the
>>         amount of inflight data.
>>         In contrast, rate-based senders risk instability: two senders
>>         in an M/D/1 setting, each sender sending with 50%
>>         bottleneck rate in average, both using paced sending at 120%
>>         of the average rate, suffice to cause
>>         instability (queue grows unlimited).
>>
>>         IMHO, two approaches seem to be useful:
>>         a) congestion-window-based operation with paced sending
>>         b) rate-based/paced sending with limiting the amount of
>>         inflight data
>>
>>>
>>>         However, we have reached the point where we need to discard
>>>         that requirement. One of the side points of BBR is that in
>>>         many environments it is cheaper to burn serving CPU to pace
>>>         into short queue networks than it is to "right size" the
>>>         network queues.
>>>
>>>         The fundamental problem with the old way is that in some
>>>         contexts the buffer memory has to beat Moore's law, because
>>>         to maintain constant drain time the memory size and BW both
>>>         have to scale with the link (laser) BW.
>>>
>>>         See the slides I gave at the Stanford Buffer Sizing workshop
>>>         december 2019: Buffer Sizing: Position Paper
>>>         <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>>
>>>
>>         Thanks for the pointer. I don't quite get the point that the
>>         buffer must have a certain size to keep the ACK clock stable:
>>         in case of an non application-limited sender, a very small
>>         buffer suffices to let the ACK clock
>>         run steady. The large buffers were mainly required for
>>         loss-based CCs to let the standing queue
>>         build up that keeps the bottleneck busy during CWnd reduction
>>         after packet loss, thereby
>>         keeping the (bottleneck link) utilization high.
>>
>>         Regards,
>>
>>          Roland
>>
>>
>>>         Note that we are talking about DC and Internet core.  At the
>>>         edge, BW is low enough where memory is relatively cheap.  In
>>>         some sense BB came about because memory is too cheap in
>>>         these environments.
>>>
>>>         Thanks,
>>>         --MM--
>>>         The best way to predict the future is to create it.  - Alan Kay
>>>
>>>         We must not tolerate intolerance;
>>>                however our response must be carefully measured:
>>>                     too strong would be hypocritical and risks
>>>         spiraling out of control;
>>>                     too weak risks being mistaken for tacit approval.
>>>
>>>
>>>         On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger
>>>         <stephen@networkplumber.org
>>>         <mailto:stephen@networkplumber.org>> wrote:
>>>
>>>             On Fri, 2 Jul 2021 09:42:24 -0700
>>>             Dave Taht <dave.taht@gmail.com
>>>             <mailto:dave.taht@gmail.com>> wrote:
>>>
>>>             > "Debunking Bechtolsheim credibly would get a lot of
>>>             attention to the
>>>             > bufferbloat cause, I suspect." - dpreed
>>>             >
>>>             > "Why Big Data Needs Big Buffer Switches" -
>>>             >
>>>             http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>>             <http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf>
>>>             >
>>>
>>>             Also, a lot depends on the TCP congestion control
>>>             algorithm being used.
>>>             They are using NewReno which only researchers use in
>>>             real life.
>>>
>>>             Even TCP Cubic has gone through several revisions. In my
>>>             experience, the
>>>             NS-2 models don't correlate well to real world behavior.
>>>
>>>             In real world tests, TCP Cubic will consume any buffer
>>>             it sees at a
>>>             congested link. Maybe that is what they mean by capture
>>>             effect.
>>>
>>>             There is also a weird oscillation effect with multiple
>>>             streams, where one
>>>             flow will take the buffer, then see a packet loss and
>>>             back off, the
>>>             other flow will take over the buffer until it sees loss.
>>>
>>>             _______________________________________________
>>>
>>>         _______________________________________________
>>


[-- Attachment #2: Type: text/html, Size: 16474 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 13:29                       ` Matt Mathis
  2021-07-08 14:05                         ` Bless, Roland (TM)
@ 2021-07-08 14:40                         ` Jonathan Morton
  2021-07-08 20:14                           ` [Bloat] [Cerowrt-devel] " David P. Reed
  1 sibling, 1 reply; 16+ messages in thread
From: Jonathan Morton @ 2021-07-08 14:40 UTC (permalink / raw)
  To: Matt Mathis; +Cc: Bless, Roland (TM), cerowrt-devel, bloat

> On 8 Jul, 2021, at 4:29 pm, Matt Mathis via Bloat <bloat@lists.bufferbloat.net> wrote:
> 
> That said, it is also true that multi-stream BBR behavior is quite complicated and needs more queue space than single stream.   This complicates the story around the traditional workaround of using multiple streams to compensate for Reno & CUBIC lameness at larger scales (ordinary scales today).    Multi-stream does not help BBR throughput and raises the queue occupancy, to the detriment of other users.

I happen to think that using multiple streams for the sake of maximising throughput is the wrong approach - it is a workaround employed pragmatically by some applications, nothing more.  If BBR can do just as well using a single flow, so much the better.

Another approach to improving the throughput of a single flow is high-fidelity congestion control.  The L4S approach to this, derived rather directly from DCTCP, is fundamentally flawed in that, not being fully backwards compatible with ECN, it cannot safely be deployed on the existing Internet.

An alternative HFCC design using non-ambiguous signalling would be incrementally deployable (thus applicable to Internet scale) and naturally overlaid on existing window-based congestion control.  It's possible to imagine such a flow reaching optimal cwnd by way of slow-start alone, then "cruising" there in a true equilibrium with congestion signals applied by the network.  In fact, we've already shown this occurring under lab conditions; in other cases it still takes one CUBIC cycle to get there.  BBR's periodic probing phases would not be required here.

> IMHO, two approaches seem to be useful:
> a) congestion-window-based operation with paced sending
> b) rate-based/paced sending with limiting the amount of inflight data

So this corresponds to approach a) in Roland's taxonomy.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 14:28                         ` Bless, Roland (TM)
@ 2021-07-08 15:47                           ` Neal Cardwell
  0 siblings, 0 replies; 16+ messages in thread
From: Neal Cardwell @ 2021-07-08 15:47 UTC (permalink / raw)
  To: Bless, Roland (TM); +Cc: Matt Mathis, bloat

[-- Attachment #1: Type: text/plain, Size: 7203 bytes --]

On Thu, Jul 8, 2021 at 10:28 AM Bless, Roland (TM) <roland.bless@kit.edu>
wrote:

> Hi Neal,
>
> On 08.07.21 at 15:29 Neal Cardwell wrote:
>
> On Thu, Jul 8, 2021 at 7:25 AM Bless, Roland (TM) <roland.bless@kit.edu>
> wrote:
>
>> It seems that in BBRv2 there are many more mechanisms present
>> that try to control the amount of inflight data more tightly and the new
>> "cap"
>> is at 1.25 BDP.
>>
> To clarify, the BBRv2 cwnd cap is not 1.25*BDP. If there is no packet loss
> or ECN, the BBRv2 cwnd cap is the same as BBRv1. But if there has been
> packet loss then conceptually the cwnd cap is the maximum amount of data
> delivered in a single round trip since the last packet loss (with a floor
> to ensure that the cwnd does not decrease by more than 30% per round trip
> with packet loss, similar to CUBIC's 30% reduction in a round trip with
> packet loss). (And upon RTO the BBR (v1 or v2) cwnd is reset to 1, and
> slow-starts upward from there.)
>
> Thanks for the clarification. I'm patiently waiting to see the BBRv2
> mechanisms coherently written up
> in that new BBR Internet-Draft version ;-) Getting this together from the
> "diffs" on the IETF slides or the source code
> is somewhat tedious, so I'll be very grateful for having that single write
> up.
>
> There is an overview of the BBRv2 response to packet loss here:
>
> https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00#page=18
>
> My assumption came from slide 25 of this slide set:
> the probing is terminated if inflight > 1.25 estimated_bdp (or "hard
> ceiling" seen).
> So without experiencing more than 2% packet loss this may end up beyond
> 1.25 estimated_bdp,
>

Yes, that can be the behavior when BBRv2 is probing for bandwidth, but is
not the average or steady-state behavior.


> but would it often end at 2estimated_bdp?
>

That depends on the details of the bottleneck buffer depth, number of
competing flows and what congestion control algorithm they are using, etc.

neal



> Best regards,
>
>  Roland
>
>
>
>
>> This is too large for short queue routers in the Internet core, but it
>> helps a lot with cross traffic on large queue edge routers.
>>
>> Best regards,
>>  Roland
>>
>> [1] https://ieeexplore.ieee.org/document/8117540
>>
>>
>> On Wed, Jul 7, 2021 at 3:19 PM Bless, Roland (TM) <roland.bless@kit.edu>
>> wrote:
>>
>>> Hi Matt,
>>>
>>> [sorry for the late reply, overlooked this one]
>>>
>>> please, see comments inline.
>>>
>>> On 02.07.21 at 21:46 Matt Mathis via Bloat wrote:
>>>
>>> The argument is absolutely correct for Reno, CUBIC and all
>>> other self-clocked protocols.  One of the core assumptions in Jacobson88,
>>> was that the clock for the entire system comes from packets draining
>>> through the bottleneck queue.  In this world, the clock is intrinsically
>>> brittle if the buffers are too small.  The drain time needs to be a
>>> substantial fraction of the RTT.
>>>
>>> I'd like to separate the functions here a bit:
>>>
>>> 1) "automatic pacing" by ACK clocking
>>>
>>> 2) congestion-window-based operation
>>>
>>> I agree that the automatic pacing generated by the ACK clock (function
>>> 1) is increasingly
>>> distorted these days and may consequently cause micro bursts.
>>> This can be mitigated by using paced sending, which I consider very
>>> useful.
>>> However, I consider abandoning the (congestion) window-based approaches
>>> with ACK feedback (function 2) as harmful:
>>> a congestion window has an automatic self-stabilizing property since the
>>> ACK feedback reflects
>>> also the queuing delay and the congestion window limits the amount of
>>> inflight data.
>>> In contrast, rate-based senders risk instability: two senders in an
>>> M/D/1 setting, each sender sending with 50%
>>> bottleneck rate in average, both using paced sending at 120% of the
>>> average rate, suffice to cause
>>> instability (queue grows unlimited).
>>>
>>> IMHO, two approaches seem to be useful:
>>> a) congestion-window-based operation with paced sending
>>> b) rate-based/paced sending with limiting the amount of inflight data
>>>
>>>
>>> However, we have reached the point where we need to discard that
>>> requirement.  One of the side points of BBR is that in many environments it
>>> is cheaper to burn serving CPU to pace into short queue networks than it is
>>> to "right size" the network queues.
>>>
>>> The fundamental problem with the old way is that in some contexts the
>>> buffer memory has to beat Moore's law, because to maintain constant drain
>>> time the memory size and BW both have to scale with the link (laser) BW.
>>>
>>> See the slides I gave at the Stanford Buffer Sizing workshop december
>>> 2019: Buffer Sizing: Position Paper
>>> <https://docs.google.com/presentation/d/1VyBlYQJqWvPuGnQpxW4S46asHMmiA-OeMbewxo_r3Cc/edit#slide=id.g791555f04c_0_5>
>>>
>>>
>>> Thanks for the pointer. I don't quite get the point that the buffer must
>>> have a certain size to keep the ACK clock stable:
>>> in case of an non application-limited sender, a very small buffer
>>> suffices to let the ACK clock
>>> run steady. The large buffers were mainly required for loss-based CCs to
>>> let the standing queue
>>> build up that keeps the bottleneck busy during CWnd reduction after
>>> packet loss, thereby
>>> keeping the (bottleneck link) utilization high.
>>>
>>> Regards,
>>>
>>>  Roland
>>>
>>>
>>> Note that we are talking about DC and Internet core.  At the edge, BW is
>>> low enough where memory is relatively cheap.   In some sense BB came about
>>> because memory is too cheap in these environments.
>>>
>>> Thanks,
>>> --MM--
>>> The best way to predict the future is to create it.  - Alan Kay
>>>
>>> We must not tolerate intolerance;
>>>        however our response must be carefully measured:
>>>             too strong would be hypocritical and risks spiraling out of
>>> control;
>>>             too weak risks being mistaken for tacit approval.
>>>
>>>
>>> On Fri, Jul 2, 2021 at 9:59 AM Stephen Hemminger <
>>> stephen@networkplumber.org> wrote:
>>>
>>>> On Fri, 2 Jul 2021 09:42:24 -0700
>>>> Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>> > "Debunking Bechtolsheim credibly would get a lot of attention to the
>>>> > bufferbloat cause, I suspect." - dpreed
>>>> >
>>>> > "Why Big Data Needs Big Buffer Switches" -
>>>> >
>>>> http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf
>>>> >
>>>>
>>>> Also, a lot depends on the TCP congestion control algorithm being used.
>>>> They are using NewReno which only researchers use in real life.
>>>>
>>>> Even TCP Cubic has gone through several revisions. In my experience, the
>>>> NS-2 models don't correlate well to real world behavior.
>>>>
>>>> In real world tests, TCP Cubic will consume any buffer it sees at a
>>>> congested link. Maybe that is what they mean by capture effect.
>>>>
>>>> There is also a weird oscillation effect with multiple streams, where
>>>> one
>>>> flow will take the buffer, then see a packet loss and back off, the
>>>> other flow will take over the buffer until it sees loss.
>>>>
>>>> _______________________________________________
>>>
>>> _______________________________________________
>>>
>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 16818 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] [Cerowrt-devel]  Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 14:40                         ` Jonathan Morton
@ 2021-07-08 20:14                           ` David P. Reed
  2021-07-09  7:10                             ` Erik Auerswald
  0 siblings, 1 reply; 16+ messages in thread
From: David P. Reed @ 2021-07-08 20:14 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Matt Mathis, Bless, Roland (TM), cerowrt-devel, bloat

[-- Attachment #1: Type: text/plain, Size: 4635 bytes --]


Keep It Simple, Stupid.
 
That's a classic architectural principle that still applies. Unfortunately folks who only think hardware want to add features to hardware, but don't study the actual real world version of the problem.
 
IMO, and it's based on 50 years of experience in network and operating systems performance, latency (response time) is almost always the primary measure users care about. They never care about maximizing "utilization" of resources. After all, in a city, you get maximum utilization of roads when you create a traffic jam. That's not the normal state. In communications, the network should always be at about 10% utilization, because you never want a traffic jam across the whole system to accumulate. Even the old Bell System was engineered to not saturate the links on the worst minute of the worst hour of the worst day of the year (which was often Mother's Day, but could be when a power blackout occurs).
 
Yet, academics become obsessed with achieving constant very high utilization. And sometimes low leve communications folks adopt that value system, until their customers start complaining.
 
Why doesn't this penetrate the Net-Shaped Heads of switch designers and others?
 
What's excellent about what we used to call "best efforts" packet delivery (drop early and often to signal congestion) is that it is robust and puts the onus on the senders of traffic to sort out congestion as quickly as possible. The senders ALL observe congested links quite early if their receivers are paying attention, and they can collaborate *without even knowing who the others congesting the link are*. And by picking the heaviest congestors with higher probability to drop, fq_codel pushes back in a "fair" way when congestion actually crops up. (probabilistically).
 
It isn't the responsibility of routers to get packets through at any cost. It's their responsibility to signal congestion early enough that it doesn't persist very long at all due to source based rate adaptation.
In other words, a router's job is to route packets and do useful telemetry for the end points using it at the instant.
 
Please stop focusing on what is an irrelevant metric (maximum throughput with maximum utilization in a special situation only).
 
Focus on what routers can do well because they actually observe it (instantaneous congestion events) and keep them simple.
.
On Thursday, July 8, 2021 10:40am, "Jonathan Morton" <chromatix99@gmail.com> said:



> > On 8 Jul, 2021, at 4:29 pm, Matt Mathis via Bloat
> <bloat@lists.bufferbloat.net> wrote:
> >
> > That said, it is also true that multi-stream BBR behavior is quite
> complicated and needs more queue space than single stream. This complicates the
> story around the traditional workaround of using multiple streams to compensate
> for Reno & CUBIC lameness at larger scales (ordinary scales today). 
> Multi-stream does not help BBR throughput and raises the queue occupancy, to the
> detriment of other users.
> 
> I happen to think that using multiple streams for the sake of maximising
> throughput is the wrong approach - it is a workaround employed pragmatically by
> some applications, nothing more. If BBR can do just as well using a single flow,
> so much the better.
> 
> Another approach to improving the throughput of a single flow is high-fidelity
> congestion control. The L4S approach to this, derived rather directly from DCTCP,
> is fundamentally flawed in that, not being fully backwards compatible with ECN, it
> cannot safely be deployed on the existing Internet.
> 
> An alternative HFCC design using non-ambiguous signalling would be incrementally
> deployable (thus applicable to Internet scale) and naturally overlaid on existing
> window-based congestion control. It's possible to imagine such a flow reaching
> optimal cwnd by way of slow-start alone, then "cruising" there in a true
> equilibrium with congestion signals applied by the network. In fact, we've
> already shown this occurring under lab conditions; in other cases it still takes
> one CUBIC cycle to get there. BBR's periodic probing phases would not be required
> here.
> 
> > IMHO, two approaches seem to be useful:
> > a) congestion-window-based operation with paced sending
> > b) rate-based/paced sending with limiting the amount of inflight data
> 
> So this corresponds to approach a) in Roland's taxonomy.
> 
> - Jonathan Morton
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 

[-- Attachment #2: Type: text/html, Size: 7130 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bloat] [Cerowrt-devel] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem)
  2021-07-08 20:14                           ` [Bloat] [Cerowrt-devel] " David P. Reed
@ 2021-07-09  7:10                             ` Erik Auerswald
  0 siblings, 0 replies; 16+ messages in thread
From: Erik Auerswald @ 2021-07-09  7:10 UTC (permalink / raw)
  To: bloat; +Cc: cerowrt-devel

Thank you, David!


On Thu, Jul 08, 2021 at 04:14:00PM -0400, David P. Reed wrote:
> 
> Keep It Simple, Stupid.
>  
> That's a classic architectural principle that still applies. Unfortunately folks who only think hardware want to add features to hardware, but don't study the actual real world version of the problem.
>  
> IMO, and it's based on 50 years of experience in network and operating systems performance, latency (response time) is almost always the primary measure users care about. They never care about maximizing "utilization" of resources. After all, in a city, you get maximum utilization of roads when you create a traffic jam. That's not the normal state. In communications, the network should always be at about 10% utilization, because you never want a traffic jam across the whole system to accumulate. Even the old Bell System was engineered to not saturate the links on the worst minute of the worst hour of the worst day of the year (which was often Mother's Day, but could be when a power blackout occurs).
>  
> Yet, academics become obsessed with achieving constant very high utilization. And sometimes low leve communications folks adopt that value system, until their customers start complaining.
>  
> Why doesn't this penetrate the Net-Shaped Heads of switch designers and others?
>  
> What's excellent about what we used to call "best efforts" packet delivery (drop early and often to signal congestion) is that it is robust and puts the onus on the senders of traffic to sort out congestion as quickly as possible. The senders ALL observe congested links quite early if their receivers are paying attention, and they can collaborate *without even knowing who the others congesting the link are*. And by picking the heaviest congestors with higher probability to drop, fq_codel pushes back in a "fair" way when congestion actually crops up. (probabilistically).
>  
> It isn't the responsibility of routers to get packets through at any cost. It's their responsibility to signal congestion early enough that it doesn't persist very long at all due to source based rate adaptation.
> In other words, a router's job is to route packets and do useful telemetry for the end points using it at the instant.
>  
> Please stop focusing on what is an irrelevant metric (maximum throughput with maximum utilization in a special situation only).
>  
> Focus on what routers can do well because they actually observe it (instantaneous congestion events) and keep them simple.
> .
> On Thursday, July 8, 2021 10:40am, "Jonathan Morton" <chromatix99@gmail.com> said:
> 
> 
> 
> > > On 8 Jul, 2021, at 4:29 pm, Matt Mathis via Bloat
> > <bloat@lists.bufferbloat.net> wrote:
> > >
> > > That said, it is also true that multi-stream BBR behavior is quite
> > complicated and needs more queue space than single stream. This complicates the
> > story around the traditional workaround of using multiple streams to compensate
> > for Reno & CUBIC lameness at larger scales (ordinary scales today). 
> > Multi-stream does not help BBR throughput and raises the queue occupancy, to the
> > detriment of other users.
> > 
> > I happen to think that using multiple streams for the sake of maximising
> > throughput is the wrong approach - it is a workaround employed pragmatically by
> > some applications, nothing more. If BBR can do just as well using a single flow,
> > so much the better.
> > 
> > Another approach to improving the throughput of a single flow is high-fidelity
> > congestion control. The L4S approach to this, derived rather directly from DCTCP,
> > is fundamentally flawed in that, not being fully backwards compatible with ECN, it
> > cannot safely be deployed on the existing Internet.
> > 
> > An alternative HFCC design using non-ambiguous signalling would be incrementally
> > deployable (thus applicable to Internet scale) and naturally overlaid on existing
> > window-based congestion control. It's possible to imagine such a flow reaching
> > optimal cwnd by way of slow-start alone, then "cruising" there in a true
> > equilibrium with congestion signals applied by the network. In fact, we've
> > already shown this occurring under lab conditions; in other cases it still takes
> > one CUBIC cycle to get there. BBR's periodic probing phases would not be required
> > here.
> > 
> > > IMHO, two approaches seem to be useful:
> > > a) congestion-window-based operation with paced sending
> > > b) rate-based/paced sending with limiting the amount of inflight data
> > 
> > So this corresponds to approach a) in Roland's taxonomy.
> > 
> > - Jonathan Morton

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-09  7:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <55fdf513-9c54-bea9-1f53-fe2c5229d7ba@eggo.org>
     [not found] ` <871t4as1h9.fsf@toke.dk>
     [not found]   ` <3D32F19B-5DEA-48AD-97E7-D043C4EAEC51@gmail.com>
     [not found]     ` <alpine.DEB.2.02.1606062029380.28955@uplift.swm.pp.se>
     [not found]       ` <CAD6NSj6vA=bjHt3Txyw8VuV9tqg-A7wvLd6ovJG4Jxabvvjw4g@mail.gmail.com>
     [not found]         ` <1465267957.902610235@apps.rackspace.com>
2021-07-02 16:42           ` [Bloat] Bechtolschiem Dave Taht
2021-07-02 16:59             ` Stephen Hemminger
2021-07-02 17:50               ` Dave Collier-Brown
2021-07-02 19:46               ` Matt Mathis
2021-07-07 22:19                 ` [Bloat] Abandoning Window-based CC Considered Harmful (was Re: Bechtolschiem) Bless, Roland (TM)
2021-07-07 22:38                   ` Matt Mathis
2021-07-08 11:24                     ` Bless, Roland (TM)
2021-07-08 13:29                       ` Matt Mathis
2021-07-08 14:05                         ` Bless, Roland (TM)
2021-07-08 14:40                         ` Jonathan Morton
2021-07-08 20:14                           ` [Bloat] [Cerowrt-devel] " David P. Reed
2021-07-09  7:10                             ` Erik Auerswald
2021-07-08 13:29                       ` [Bloat] " Neal Cardwell
2021-07-08 14:28                         ` Bless, Roland (TM)
2021-07-08 15:47                           ` Neal Cardwell
2021-07-02 20:28               ` [Bloat] Bechtolschiem Jonathan Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox