From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id AACEE3B29D for ; Wed, 29 Sep 2021 05:27:07 -0400 (EDT) Received: by mail-wr1-x435.google.com with SMTP id k7so3005497wrd.13 for ; Wed, 29 Sep 2021 02:27:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Oqp8zpXT+x/2TRRPAjMOMH6TX57SM4y3V6/+4luy1Cs=; b=WOrLG4ugb1E04owAPbAmi2Vb6unlJXELYQC+YQ8NFck3uDEeuD/rphsgMQG0ZM04u5 t8B2GQ6hFFkqDbjPSdwKYk3DrvlY2guGbCmLGHjY3xDSbj9vdv4+iVbVxqUjhlRrGG2R E4E5sBil1MpxHIiXKNusQj+zihIuWZnxhYMN4wtdFwldYa8yI9tyx27Y8km71tkbaonU V7O3B0M2VK6aEBzMYWp0g/28FtSgieSS1DvWXEbKk9GLMqk6aPxnYnZSNDHsJ0M5e1pG iMHK7zt8d9WUkV2oCqZqEaHrW9lc43zPcKWrNx4Yf9wiS43IT2wbrbUDzEr78ILKIqsE /WEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Oqp8zpXT+x/2TRRPAjMOMH6TX57SM4y3V6/+4luy1Cs=; b=p4o2Gcce1S62d+cINESApRTnoD9enE3TqBC5D6ZCdkvS40E9Uj4CstGWPFQ1lkNeQA ppnjhvTC+pRc3YTFLtKZgthCADHzCvGnAuneizsm2Nct87HsfJRULexkGnN8eOnemkFZ f2XfdYkCSQA6hcVc/jb9qdbipsGrugyws1tyyG7rg2Ch8b8xKjah//sChxkU1zGeCMCf 9bWNqeHDTCvA1ZQS624lDyKn1XGCnYq7uSVbEr8QIdt5O6veHw3StJX30Q//PsLgfpib sfhOB0iQEWJJdVbk8TMgOIR23u7oGZWfoFOkQ5ZYOsy0wmsmkgtOD3nJjpr5cjk950bo oGMg== X-Gm-Message-State: AOAM530NuEOpHqrSkYWEfpn1BP8WCMD6Mj6ldgiQ1HFmmB9g4sOi7Kk7 km01iYACwE+F++ywD0SNPpDvH4p/EU+5aHI0y6YqQw== X-Google-Smtp-Source: ABdhPJxDZ36qomMk61+u6uNt0VTMUzlH7h9c/mMck2K2RpJDkFf6bMp6G821jWmWBI2NvlHlkc5pSG6ZnlEeWPaia+A= X-Received: by 2002:adf:bc4f:: with SMTP id a15mr5691730wrh.105.1632907626230; Wed, 29 Sep 2021 02:27:06 -0700 (PDT) MIME-Version: 1.0 References: <1632867355.4986972@apps.rackspace.com> In-Reply-To: <1632867355.4986972@apps.rackspace.com> From: Vint Cerf Date: Wed, 29 Sep 2021 05:26:53 -0400 Message-ID: To: "David P. Reed" Cc: Bob Briscoe , "Mohit P. Tahiliani" , ECN-Sane , Asad Sajjad Ahmed Content-Type: multipart/alternative; boundary="000000000000041bc505cd1eefe7" Subject: Re: [Ecn-sane] paper idea: praising smaller packets X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Sep 2021 09:27:07 -0000 --000000000000041bc505cd1eefe7 Content-Type: text/plain; charset="UTF-8" thanks David - I really like your clear distinction between avoidance and optimized congestion. v On Tue, Sep 28, 2021 at 6:15 PM David P. Reed wrote: > Upon thinking about this, here's a radical idea: > > > > the expected time until a bottleneck link clears, that is, 0 packets are > in the queue to be sent on it, must be < t, where t is an Internet-wide > constant corresponding to the time it takes light to circle the earth. > > > > This is a local constraint, one that is required of a router. It can be > achieved in any of a variety of ways (for example choosing to route > different flows on different paths that don't include the bottleneck link). > > > > It need not be true at all times - but when I say "expected time", I mean > that the queue's behavior is monitored so that this situation is quite rare > over any interval of ten minutes or more. > > > > If a bottleneck link is continuously full for more than the time it takes > for packets on a fiber (< light speed) to circle the earth, it is in REALLY > bad shape. That must never happen. > > > > Why is this important? > > > > It's a matter of control theory - if the control loop delay gets longer > than its minimum, instability tends to take over no matter what control > discipline is used to manage the system. > > > > Now, it is important as hell to avoid bullshit research programs that try > to "optimize" ustilization of link capacity at 100%. Those research > programs focus on the absolute wrong measure - a proxy for "network capital > cost" that is in fact the wrong measure of any real network operator's cost > structure. The cost of media (wires, airtime, ...) is a tiny fraction of > most network operations' cost in any real business or institution. We don't > optimize highways by maximizing the number of cars on every stretch of > highway, for obvious reasons, but also for non-obvious reasons. > > > > Latency and lack of flexibiilty or reconfigurability impose real costs on > a system that are far more significant to end-user value than the cost of > the media. > > > > A sustained congestion of a bottleneck link is not a feature, but a very > serious operational engineering error. People should be fired if they don't > prevent that from ever happening, or allow it to persist. > > > > This is why telcos, for example, design networks to handle the expected > maximum traffic with some excess apactity. This is why networks are > constantly being upgraded as load increases, *before* overloads occur. > > > > It's an incredibly dangerous and arrogant assumption that operation in a > congested mode is acceptable. > > > > That's the rationale for the "radical proposal". > > > > Sadly, academic thinkers (even ones who have worked in industry research > labs on minor aspects) get drawn into solving the wrong problem - > optimizing the case that should never happen. > > > > Sure that's helpful - but only in the same sense that when designing > systems where accidents need to have fallbacks one needs to design the > fallback system to work. > > > > Operating at fully congested state - or designing TCP to essencially come > close to DDoS behavior on a bottleneck to get a publishable paper - is > missing the point. > > > > > > On Monday, September 27, 2021 10:50am, "Bob Briscoe" < > research@bobbriscoe.net> said: > > > Dave, > > > > On 26/09/2021 21:08, Dave Taht wrote: > > > ... an exploration of smaller mss sizes in response to persistent > congestion > > > > > > This is in response to two declarative statements in here that I've > > > long disagreed with, > > > involving NOT shrinking the mss, and not trying to do pacing... > > > > I would still avoid shrinking the MSS, 'cos you don't know if the > > congestion constraint is the CPU, in which case you'll make congestion > > worse. But we'll have to differ on that if you disagree. > > > > I don't think that paper said don't do pacing. In fact, it says "...pace > > the segments at less than one per round trip..." > > > > Whatever, that paper was the problem statement, with just some ideas on > > how we were going to solve it. > > after that, Asad (added to the distro) did his whole Masters thesis on > > this - I suggest you look at his thesis and code (pointers below). > > > > Also soon after he'd finished, changes to BBRv2 were introduced to > > reduce queuing delay with large numbers of flows. You might want to take > > a look at that too: > > > https://datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-bbrv2#page=10 > > > > > > > > https://www.bobbriscoe.net/projects/latency/sub-mss-w.pdf > > > > > > OTherwise, for a change, I largely agree with bob. > > > > > > "No amount of AQM twiddling can fix this. The solution has to fix TCP." > > > > > > "nearly all TCP implementations cannot operate at less than two > packets per > > RTT" > > > > Back to Asad's Master's thesis, we found that just pacing out the > > packets wasn't enough. There's a very brief summary of the 4 things we > > found we had to do in 4 bullets in this section of our write-up for > netdev: > > > https://bobbriscoe.net/projects/latency/tcp-prague-netdev0x13.pdf#subsubsection.3.1.6 > > And I've highlighted a couple of unexpected things that cropped up below. > > > > Asad's full thesis: > > > > Ahmed, A., "Extending TCP for Low Round Trip Delay", > > > > Masters Thesis, Uni Oslo , August 2019, > > > > . > > Asad's thesis presentation: > > https://bobbriscoe.net/presents/1909submss/present_asadsa.pdf > > > > Code: > > https://bitbucket.org/asadsa/kernel420/src/submss/ > > Despite significant changes to basic TCP design principles, the diffs > > were not that great. > > > > A number of tricky problems came up. > > > > * For instance, simple pacing when <1 ACK per RTT wasn't that simple. > > Whenever there were bursts from cross-traffic, the consequent burst in > > your own flow kept repeating in subsequent rounds. We realized this was > > because you never have a real ACK clock (you always set the next send > > time based on previous send times). So we set up the the next send time > > but then re-adjusted it if/when the next ACK did actually arrive. > > > > * The additive increase of one segment was the other main problem. When > > you have such a small window, multiplicative decrease scales fine, but > > an additive increase of 1 segment is a huge jump in comparison, when > > cwnd is a fraction of a segment. "Logarithmically scaled additive > > increase" was our solution to that (basically, every time you set > > ssthresh, alter the additive increase constant using a formula that > > scales logarithmically with ssthresh, so it's still roughly 1 for the > > current Internet scale). > > > > What became of Asad's work? > > Altho the code finally worked pretty well {1}, we decided not to pursue > > it further 'cos a minimum cwnd actually gives a trickle of throughput > > protection against unresponsive flows (with the downside that it > > increases queuing delay). That's not to say this isn't worth working on > > further, but there was more to do to make it bullet proof, and we were > > in two minds how important it was, so it worked its way down our > > priority list. > > > > {Note 1: From memory, there was an outstanding problem with one flow > > remaining dominant if you had step-ECN marking, which we worked out was > > due to the logarithmically scaled additive increase, but we didn't work > > on it further to fix it.} > > > > > > > > Bob > > > > > > -- > > ________________________________________________________________ > > Bob Briscoe http://bobbriscoe.net/ > > > > _______________________________________________ > > Ecn-sane mailing list > > Ecn-sane@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/ecn-sane > > > _______________________________________________ > Ecn-sane mailing list > Ecn-sane@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/ecn-sane > -- Please send any postal/overnight deliveries to: Vint Cerf 1435 Woodhurst Blvd McLean, VA 22102 703-448-0965 until further notice --000000000000041bc505cd1eefe7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
thanks David - I really like your clear distinction betwee= n avoidance and optimized congestion.=C2=A0

v
=

On Tue, Sep 28, 2021 at 6:15 PM David P. Reed <dpreed@deepplum.com> wrote:
=

Upon t= hinking about this, here's a radical idea:

=C2=A0=

the ex= pected time until a bottleneck link clears, that is, 0 packets are in the q= ueue to be sent on it, must be < t, where t is an Internet-wide constant= corresponding to the time it takes light to circle the earth.

=C2=A0=

This i= s a local constraint, one that is required of a router. It can be achieved = in any of a variety of ways (for example choosing to route different flows = on different paths that don't include the bottleneck link).

=C2=A0=

It nee= d not be true at all times - but when I say "expected time", I me= an that the queue's behavior is monitored so that this situation is qui= te rare over any interval of ten minutes or more.

=C2=A0=

If a b= ottleneck link is continuously full for more than the time it takes for pac= kets on a fiber (< light speed) to circle the earth, it is in REALLY bad= shape. That must never happen.

=C2=A0=

Why is= this important?

=C2=A0=

It'= ;s a matter of control theory - if the control loop delay gets longer than = its minimum, instability tends to take over no matter what control discipli= ne is used to manage the system.

=C2=A0=

Now, i= t is important as hell to avoid bullshit research programs that try to &quo= t;optimize" ustilization of link capacity at 100%. Those research prog= rams focus on the absolute wrong measure - a proxy for "network capita= l cost" that is in fact the wrong measure of any real network operator= 's cost structure. The cost of media (wires, airtime, ...) is a tiny fr= action of most network operations' cost in any real business or institu= tion. We don't optimize highways by maximizing the number of cars on ev= ery stretch of highway, for obvious reasons, but also for non-obvious reaso= ns.

=C2=A0=

Latenc= y and lack of flexibiilty or=C2=A0 reconfigurability impose real costs on a= system that are far more significant to end-user value than the cost of th= e media.

=C2=A0=

A sust= ained congestion of a bottleneck link is not a feature, but a very serious = operational engineering error. People should be fired if they don't pre= vent that from ever happening, or allow it to persist.

=C2=A0=

This i= s why telcos, for example, design networks to handle the expected maximum t= raffic with some excess apactity. This is why networks are constantly being= upgraded as load increases, *before* overloads occur.

=C2=A0=

It'= ;s an incredibly dangerous and arrogant assumption that operation in a cong= ested mode is acceptable.

=C2=A0=

That&#= 39;s the rationale for the "radical proposal".

=C2=A0=

Sadly,= academic thinkers (even ones who have worked in industry research labs on = minor aspects) get drawn into solving the wrong problem - optimizing the ca= se that should never happen.

=C2=A0=

Sure t= hat's helpful - but only in the same sense that when designing systems = where accidents need to have fallbacks one needs to design the fallback sys= tem to work.

=C2=A0=

Operat= ing at fully congested state - or designing TCP to essencially come close t= o DDoS behavior on a bottleneck to get a publishable paper - is missing the= point.

=C2=A0=

=C2=A0=

On Mon= day, September 27, 2021 10:50am, "Bob Briscoe" <research@bobbriscoe.net&= gt; said:

> D= ave,
>
> On 26/09/2021 21:08, Dave Taht wrote:
> > ..= . an exploration of smaller mss sizes in response to persistent congestion<= br>> >
> > This is in response to two declarative statements= in here that I've
> > long disagreed with,
> > invol= ving NOT shrinking the mss, and not trying to do pacing...
>
>= I would still avoid shrinking the MSS, 'cos you don't know if the<= br>> congestion constraint is the CPU, in which case you'll make con= gestion
> worse. But we'll have to differ on that if you disagree= .
>
> I don't think that paper said don't do pacing. I= n fact, it says "...pace
> the segments at less than one per rou= nd trip..."
>
> Whatever, that paper was the problem stat= ement, with just some ideas on
> how we were going to solve it.
&g= t; after that, Asad (added to the distro) did his whole Masters thesis on> this - I suggest you look at his thesis and code (pointers below).>
> Also soon after he'd finished, changes to BBRv2 were in= troduced to
> reduce queuing delay with large numbers of flows. You m= ight want to take
> a look at that too:
> https://datatracker.ietf.org/meeting/106/materi= als/slides-106-iccrg-update-on-bbrv2#page=3D10
>
> >> > https://www.bobbriscoe.net/projects/latency/sub-mss= -w.pdf
> >
> > OTherwise, for a change, I largely agr= ee with bob.
> >
> > "No amount of AQM twiddling can= fix this. The solution has to fix TCP."
> >
> > &qu= ot;nearly all TCP implementations cannot operate at less than two packets p= er
> RTT"
>
> Back to Asad's Master's thesi= s, we found that just pacing out the
> packets wasn't enough. The= re's a very brief summary of the 4 things we
> found we had to do= in 4 bullets in this section of our write-up for netdev:
> https://bobbriscoe.net/projects/latency/t= cp-prague-netdev0x13.pdf#subsubsection.3.1.6
> And I've highl= ighted a couple of unexpected things that cropped up below.
>
>= ; Asad's full thesis:
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
> Ahmed, A., "Extending = TCP for Low Round Trip Delay",
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
> Masters Thesis, Un= i Oslo , August 2019,
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
> <https://www.duo.uio.no/handle/10= 852/70966>.
> Asad's thesis presentation:
> =C2=A0= =C2=A0=C2=A0 https://bobbriscoe.net/presents/1909submss/pr= esent_asadsa.pdf
>
> Code:
> =C2=A0=C2=A0=C2=A0 https://bitbucket.org/asadsa/kernel420/src/submss/
> Despite si= gnificant changes to basic TCP design principles, the diffs
> were no= t that great.
>
> A number of tricky problems came up.
>=
> * For instance, simple pacing when <1 ACK per RTT wasn't t= hat simple.
> Whenever there were bursts from cross-traffic, the cons= equent burst in
> your own flow kept repeating in subsequent rounds. = We realized this was
> because you never have a real ACK clock (you a= lways set the next send
> time based on previous send times). So we s= et up the the next send time
> but then re-adjusted it if/when the ne= xt ACK did actually arrive.
>
> * The additive increase of one= segment was the other main problem. When
> you have such a small win= dow, multiplicative decrease scales fine, but
> an additive increase = of 1 segment is a huge jump in comparison, when
> cwnd is a fraction = of a segment. "Logarithmically scaled additive
> increase" = was our solution to that (basically, every time you set
> ssthresh, a= lter the additive increase constant using a formula that
> scales log= arithmically with ssthresh, so it's still roughly 1 for the
> cur= rent Internet scale).
>
> What became of Asad's work?
&= gt; Altho the code finally worked pretty well {1}, we decided not to pursue=
> it further 'cos a minimum cwnd actually gives a trickle of thr= oughput
> protection against unresponsive flows (with the downside th= at it
> increases queuing delay). That's not to say this isn'= t worth working on
> further, but there was more to do to make it bul= let proof, and we were
> in two minds how important it was, so it wor= ked its way down our
> priority list.
>
> {Note 1: From = memory, there was an outstanding problem with one flow
> remaining do= minant if you had step-ECN marking, which we worked out was
> due to = the logarithmically scaled additive increase, but we didn't work
>= ; on it further to fix it.}
>
>
>
> Bob
> =
>
> --
> ______________________________________________= __________________
> Bob Briscoe http://bobbriscoe.net/
>
> ______________= _________________________________
> Ecn-sane mailing list
> Ecn-sane@l= ists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/= ecn-sane
>

_______________________________________________
Ecn-sane mailing list
Ecn-san= e@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/ecn-sane


--
Please send any postal/ove= rnight deliveries to:
Vint Cerf
1435 Woodhurst Blvd=C2= =A0
McLean, VA 22102
703-448-0965

<= div>until further notice



=
--000000000000041bc505cd1eefe7--