From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp91.iad3a.emailsrvr.com (smtp91.iad3a.emailsrvr.com [173.203.187.91]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 2F2613B29D for ; Tue, 28 Sep 2021 18:15:56 -0400 (EDT) Received: from app50.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp4.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 8E60D547F; Tue, 28 Sep 2021 18:15:55 -0400 (EDT) Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app50.wa-webapps.iad3a (Postfix) with ESMTP id 7A8E1600BC; Tue, 28 Sep 2021 18:15:55 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Tue, 28 Sep 2021 18:15:55 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Tue, 28 Sep 2021 18:15:55 -0400 (EDT) From: "David P. Reed" To: "Bob Briscoe" Cc: "Dave Taht" , "Mohit P. Tahiliani" , "Asad Sajjad Ahmed" , "ECN-Sane" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20210928181555000000_87796" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: X-Client-IP: 209.6.168.128 Message-ID: <1632867355.4986972@apps.rackspace.com> X-Mailer: webmail/19.0.13-RC X-Classification-ID: 9b73869d-a9ac-4937-ad95-d6a0359fffb0-1-1 Subject: Re: [Ecn-sane] paper idea: praising smaller packets X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2021 22:15:56 -0000 ------=_20210928181555000000_87796 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AUpon thinking about this, here's a radical idea:=0A =0Athe expected time= until a bottleneck link clears, that is, 0 packets are in the queue to be = sent on it, must be < t, where t is an Internet-wide constant corresponding= to the time it takes light to circle the earth.=0A =0AThis is a local cons= traint, one that is required of a router. It can be achieved in any of a va= riety of ways (for example choosing to route different flows on different p= aths that don't include the bottleneck link).=0A =0AIt need not be true at = all times - but when I say "expected time", I mean that the queue's behavio= r is monitored so that this situation is quite rare over any interval of te= n minutes or more.=0A =0AIf a bottleneck link is continuously full for more= than the time it takes for packets on a fiber (< light speed) to circle th= e earth, it is in REALLY bad shape. That must never happen.=0A =0AWhy is th= is important?=0A =0AIt's a matter of control theory - if the control loop d= elay gets longer than its minimum, instability tends to take over no matter= what control discipline is used to manage the system.=0A =0ANow, it is imp= ortant as hell to avoid bullshit research programs that try to "optimize" u= stilization of link capacity at 100%. Those research programs focus on the = absolute wrong measure - a proxy for "network capital cost" that is in fact= the wrong measure of any real network operator's cost structure. The cost = of media (wires, airtime, ...) is a tiny fraction of most network operation= s' cost in any real business or institution. We don't optimize highways by = maximizing the number of cars on every stretch of highway, for obvious reas= ons, but also for non-obvious reasons.=0A =0ALatency and lack of flexibiilt= y or reconfigurability impose real costs on a system that are far more sig= nificant to end-user value than the cost of the media.=0A =0AA sustained co= ngestion of a bottleneck link is not a feature, but a very serious operatio= nal engineering error. People should be fired if they don't prevent that fr= om ever happening, or allow it to persist.=0A =0AThis is why telcos, for ex= ample, design networks to handle the expected maximum traffic with some exc= ess apactity. This is why networks are constantly being upgraded as load in= creases, *before* overloads occur.=0A =0AIt's an incredibly dangerous and a= rrogant assumption that operation in a congested mode is acceptable.=0A =0A= That's the rationale for the "radical proposal".=0A =0ASadly, academic thin= kers (even ones who have worked in industry research labs on minor aspects)= get drawn into solving the wrong problem - optimizing the case that should= never happen.=0A =0ASure that's helpful - but only in the same sense that = when designing systems where accidents need to have fallbacks one needs to = design the fallback system to work.=0A =0AOperating at fully congested stat= e - or designing TCP to essencially come close to DDoS behavior on a bottle= neck to get a publishable paper - is missing the point.=0A =0A =0AOn Monday= , September 27, 2021 10:50am, "Bob Briscoe" said:= =0A=0A=0A=0A> Dave,=0A> =0A> On 26/09/2021 21:08, Dave Taht wrote:=0A> > ..= . an exploration of smaller mss sizes in response to persistent congestion= =0A> >=0A> > This is in response to two declarative statements in here that= I've=0A> > long disagreed with,=0A> > involving NOT shrinking the mss, and= not trying to do pacing...=0A> =0A> I would still avoid shrinking the MSS,= 'cos you don't know if the=0A> congestion constraint is the CPU, in which = case you'll make congestion=0A> worse. But we'll have to differ on that if = you disagree.=0A> =0A> I don't think that paper said don't do pacing. In fa= ct, it says "...pace=0A> the segments at less than one per round trip..."= =0A> =0A> Whatever, that paper was the problem statement, with just some id= eas on=0A> how we were going to solve it.=0A> after that, Asad (added to th= e distro) did his whole Masters thesis on=0A> this - I suggest you look at = his thesis and code (pointers below).=0A> =0A> Also soon after he'd finishe= d, changes to BBRv2 were introduced to=0A> reduce queuing delay with large = numbers of flows. You might want to take=0A> a look at that too:=0A> https:= //datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-bbr= v2#page=3D10=0A> =0A> >=0A> > https://www.bobbriscoe.net/projects/latency/s= ub-mss-w.pdf=0A> >=0A> > OTherwise, for a change, I largely agree with bob.= =0A> >=0A> > "No amount of AQM twiddling can fix this. The solution has to = fix TCP."=0A> >=0A> > "nearly all TCP implementations cannot operate at les= s than two packets per=0A> RTT"=0A> =0A> Back to Asad's Master's thesis, we= found that just pacing out the=0A> packets wasn't enough. There's a very b= rief summary of the 4 things we=0A> found we had to do in 4 bullets in this= section of our write-up for netdev:=0A> https://bobbriscoe.net/projects/la= tency/tcp-prague-netdev0x13.pdf#subsubsection.3.1.6=0A> And I've highlighte= d a couple of unexpected things that cropped up below.=0A> =0A> Asad's full= thesis:=0A> =0A> Ahmed, A., "Extending TCP for Low Round Trip= Delay",=0A> =0A> Masters Thesis, Uni Oslo , August 2019,=0A> = =0A> .=0A> Asad's t= hesis presentation:=0A> https://bobbriscoe.net/presents/1909submss/pres= ent_asadsa.pdf=0A> =0A> Code:=0A> https://bitbucket.org/asadsa/kernel42= 0/src/submss/=0A> Despite significant changes to basic TCP design principle= s, the diffs=0A> were not that great.=0A> =0A> A number of tricky problems = came up.=0A> =0A> * For instance, simple pacing when <1 ACK per RTT wasn't = that simple.=0A> Whenever there were bursts from cross-traffic, the consequ= ent burst in=0A> your own flow kept repeating in subsequent rounds. We real= ized this was=0A> because you never have a real ACK clock (you always set t= he next send=0A> time based on previous send times). So we set up the the n= ext send time=0A> but then re-adjusted it if/when the next ACK did actually= arrive.=0A> =0A> * The additive increase of one segment was the other main= problem. When=0A> you have such a small window, multiplicative decrease sc= ales fine, but=0A> an additive increase of 1 segment is a huge jump in comp= arison, when=0A> cwnd is a fraction of a segment. "Logarithmically scaled a= dditive=0A> increase" was our solution to that (basically, every time you s= et=0A> ssthresh, alter the additive increase constant using a formula that= =0A> scales logarithmically with ssthresh, so it's still roughly 1 for the= =0A> current Internet scale).=0A> =0A> What became of Asad's work?=0A> Alth= o the code finally worked pretty well {1}, we decided not to pursue=0A> it = further 'cos a minimum cwnd actually gives a trickle of throughput=0A> prot= ection against unresponsive flows (with the downside that it=0A> increases = queuing delay). That's not to say this isn't worth working on=0A> further, = but there was more to do to make it bullet proof, and we were=0A> in two mi= nds how important it was, so it worked its way down our=0A> priority list.= =0A> =0A> {Note 1: From memory, there was an outstanding problem with one f= low=0A> remaining dominant if you had step-ECN marking, which we worked out= was=0A> due to the logarithmically scaled additive increase, but we didn't= work=0A> on it further to fix it.}=0A> =0A> =0A> =0A> Bob=0A> =0A> =0A> --= =0A> ________________________________________________________________=0A> B= ob Briscoe http://bobbriscoe.net/=0A> =0A> ________________________________= _______________=0A> Ecn-sane mailing list=0A> Ecn-sane@lists.bufferbloat.ne= t=0A> https://lists.bufferbloat.net/listinfo/ecn-sane=0A> ------=_20210928181555000000_87796 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Upon thinking about th= is, here's a radical idea:

=0A

 

=0A

the expected time until a bottleneck link clears, that is, 0 p= ackets are in the queue to be sent on it, must be < t, where t is an Int= ernet-wide constant corresponding to the time it takes light to circle the = earth.

=0A

 

=0A

This is= a local constraint, one that is required of a router. It can be achieved i= n any of a variety of ways (for example choosing to route different flows o= n different paths that don't include the bottleneck link).

=0A

 

=0A

It need not be true at all t= imes - but when I say "expected time", I mean that the queue's behavior is = monitored so that this situation is quite rare over any interval of ten min= utes or more.

=0A

 

=0A

= If a bottleneck link is continuously full for more than the time it takes f= or packets on a fiber (< light speed) to circle the earth, it is in REAL= LY bad shape. That must never happen.

=0A

 

= =0A

Why is this important?

=0A

&= nbsp;

=0A

It's a matter of control theory - if the c= ontrol loop delay gets longer than its minimum, instability tends to take o= ver no matter what control discipline is used to manage the system.

=0A<= p style=3D"margin:0;padding:0;font-family: arial; font-size: 10pt; overflow= -wrap: break-word;"> 

=0A

Now, it is important = as hell to avoid bullshit research programs that try to "optimize" ustiliza= tion of link capacity at 100%. Those research programs focus on the absolut= e wrong measure - a proxy for "network capital cost" that is in fact the wr= ong measure of any real network operator's cost structure. The cost of medi= a (wires, airtime, ...) is a tiny fraction of most network operations' cost= in any real business or institution. We don't optimize highways by maximiz= ing the number of cars on every stretch of highway, for obvious reasons, bu= t also for non-obvious reasons.

=0A

 

=0A

Latency and lack of flexibiilty or  reconfigurabilit= y impose real costs on a system that are far more significant to end-user v= alue than the cost of the media.

=0A

 

=0AA sustained congestion of a bottleneck link is not a fea= ture, but a very serious operational engineering error. People should be fi= red if they don't prevent that from ever happening, or allow it to persist.=

=0A

 

=0A

This is why t= elcos, for example, design networks to handle the expected maximum traffic = with some excess apactity. This is why networks are constantly being upgrad= ed as load increases, *before* overloads occur.

=0A

=  

=0A

It's an incredibly dangerous and arrogant= assumption that operation in a congested mode is acceptable.

=0A

 

=0A

That's the rationale for th= e "radical proposal".

=0A

 

=0A

Sadly, academic thinkers (even ones who have worked in industry res= earch labs on minor aspects) get drawn into solving the wrong problem - opt= imizing the case that should never happen.

=0A

 = ;

=0A

Sure that's helpful - but only in the same sen= se that when designing systems where accidents need to have fallbacks one n= eeds to design the fallback system to work.

=0A

&nbs= p;

=0A

Operating at fully congested state - or desig= ning TCP to essencially come close to DDoS behavior on a bottleneck to get = a publishable paper - is missing the point.

=0A

&nbs= p;

=0A

 

=0A

On Monday, = September 27, 2021 10:50am, "Bob Briscoe" <research@bobbriscoe.net> s= aid:

=0A
=0A

> Dave,
>
> On 26/09/2021 21:08, Dave Taht wrote:<= br />> > ... an exploration of smaller mss sizes in response to persi= stent congestion
> >
> > This is in response to two d= eclarative statements in here that I've
> > long disagreed with,=
> > involving NOT shrinking the mss, and not trying to do pacin= g...
>
> I would still avoid shrinking the MSS, 'cos you d= on't know if the
> congestion constraint is the CPU, in which case = you'll make congestion
> worse. But we'll have to differ on that if= you disagree.
>
> I don't think that paper said don't do = pacing. In fact, it says "...pace
> the segments at less than one p= er round trip..."
>
> Whatever, that paper was the problem= statement, with just some ideas on
> how we were going to solve it= .
> after that, Asad (added to the distro) did his whole Masters th= esis on
> this - I suggest you look at his thesis and code (pointer= s below).
>
> Also soon after he'd finished, changes to BB= Rv2 were introduced to
> reduce queuing delay with large numbers of= flows. You might want to take
> a look at that too:
> http= s://datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-b= brv2#page=3D10
>
> >
> > https://www.bobbris= coe.net/projects/latency/sub-mss-w.pdf
> >
> > OTherw= ise, for a change, I largely agree with bob.
> >
> > = "No amount of AQM twiddling can fix this. The solution has to fix TCP."
> >
> > "nearly all TCP implementations cannot operate a= t less than two packets per
> RTT"
>
> Back to Asa= d's Master's thesis, we found that just pacing out the
> packets wa= sn't enough. There's a very brief summary of the 4 things we
> foun= d we had to do in 4 bullets in this section of our write-up for netdev:
> https://bobbriscoe.net/projects/latency/tcp-prague-netdev0x13.pdf#su= bsubsection.3.1.6
> And I've highlighted a couple of unexpected thi= ngs that cropped up below.
>
> Asad's full thesis:
&g= t;             =  
> Ahmed, A., "Extending TCP for Low Round Trip Delay",
= >            &nbs= p; 
> Masters Thesis, Uni Oslo , August 2019,
>  =             > <https://www.duo.uio.no/handle/10852/70966>.
> Asad's= thesis presentation:
>     https://bobbriscoe.net/p= resents/1909submss/present_asadsa.pdf
>
> Code:
> =     https://bitbucket.org/asadsa/kernel420/src/submss/
= > Despite significant changes to basic TCP design principles, the diffs<= br />> were not that great.
>
> A number of tricky prob= lems came up.
>
> * For instance, simple pacing when <1= ACK per RTT wasn't that simple.
> Whenever there were bursts from = cross-traffic, the consequent burst in
> your own flow kept repeati= ng in subsequent rounds. We realized this was
> because you never h= ave a real ACK clock (you always set the next send
> time based on = previous send times). So we set up the the next send time
> but the= n re-adjusted it if/when the next ACK did actually arrive.
>
= > * The additive increase of one segment was the other main problem. Whe= n
> you have such a small window, multiplicative decrease scales fi= ne, but
> an additive increase of 1 segment is a huge jump in compa= rison, when
> cwnd is a fraction of a segment. "Logarithmically sca= led additive
> increase" was our solution to that (basically, every= time you set
> ssthresh, alter the additive increase constant usin= g a formula that
> scales logarithmically with ssthresh, so it's st= ill roughly 1 for the
> current Internet scale).
>
&g= t; What became of Asad's work?
> Altho the code finally worked pret= ty well {1}, we decided not to pursue
> it further 'cos a minimum c= wnd actually gives a trickle of throughput
> protection against unr= esponsive flows (with the downside that it
> increases queuing dela= y). That's not to say this isn't worth working on
> further, but th= ere was more to do to make it bullet proof, and we were
> in two mi= nds how important it was, so it worked its way down our
> priority = list.
>
> {Note 1: From memory, there was an outstanding p= roblem with one flow
> remaining dominant if you had step-ECN marki= ng, which we worked out was
> due to the logarithmically scaled add= itive increase, but we didn't work
> on it further to fix it.}
>
>
>
> Bob
>
>
> = --
> ______________________________________________________________= __
> Bob Briscoe http://bobbriscoe.net/
>
> ______= _________________________________________
> Ecn-sane mailing list> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat= .net/listinfo/ecn-sane
>

=0A
------=_20210928181555000000_87796--