From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path:
Received: from smtp91.iad3a.emailsrvr.com (smtp91.iad3a.emailsrvr.com
[173.203.187.91])
(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
(No client certificate requested)
by lists.bufferbloat.net (Postfix) with ESMTPS id 2F2613B29D
for ; Tue, 28 Sep 2021 18:15:56 -0400 (EDT)
Received: from app50.wa-webapps.iad3a (relay-webapps.rsapps.net
[172.27.255.140])
by smtp4.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 8E60D547F;
Tue, 28 Sep 2021 18:15:55 -0400 (EDT)
Received: from deepplum.com (localhost.localdomain [127.0.0.1])
by app50.wa-webapps.iad3a (Postfix) with ESMTP id 7A8E1600BC;
Tue, 28 Sep 2021 18:15:55 -0400 (EDT)
Received: by apps.rackspace.com
(Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com)
with HTTP; Tue, 28 Sep 2021 18:15:55 -0400 (EDT)
X-Auth-ID: dpreed@deepplum.com
Date: Tue, 28 Sep 2021 18:15:55 -0400 (EDT)
From: "David P. Reed"
To: "Bob Briscoe"
Cc: "Dave Taht" ,
"Mohit P. Tahiliani" ,
"Asad Sajjad Ahmed" ,
"ECN-Sane"
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_20210928181555000000_87796"
Importance: Normal
X-Priority: 3 (Normal)
X-Type: html
In-Reply-To:
References:
X-Client-IP: 209.6.168.128
Message-ID: <1632867355.4986972@apps.rackspace.com>
X-Mailer: webmail/19.0.13-RC
X-Classification-ID: 9b73869d-a9ac-4937-ad95-d6a0359fffb0-1-1
Subject: Re: [Ecn-sane] paper idea: praising smaller packets
X-BeenThere: ecn-sane@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion of explicit congestion notification's impact on the
Internet
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
X-List-Received-Date: Tue, 28 Sep 2021 22:15:56 -0000
------=_20210928181555000000_87796
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
=0AUpon thinking about this, here's a radical idea:=0A =0Athe expected time=
until a bottleneck link clears, that is, 0 packets are in the queue to be =
sent on it, must be < t, where t is an Internet-wide constant corresponding=
to the time it takes light to circle the earth.=0A =0AThis is a local cons=
traint, one that is required of a router. It can be achieved in any of a va=
riety of ways (for example choosing to route different flows on different p=
aths that don't include the bottleneck link).=0A =0AIt need not be true at =
all times - but when I say "expected time", I mean that the queue's behavio=
r is monitored so that this situation is quite rare over any interval of te=
n minutes or more.=0A =0AIf a bottleneck link is continuously full for more=
than the time it takes for packets on a fiber (< light speed) to circle th=
e earth, it is in REALLY bad shape. That must never happen.=0A =0AWhy is th=
is important?=0A =0AIt's a matter of control theory - if the control loop d=
elay gets longer than its minimum, instability tends to take over no matter=
what control discipline is used to manage the system.=0A =0ANow, it is imp=
ortant as hell to avoid bullshit research programs that try to "optimize" u=
stilization of link capacity at 100%. Those research programs focus on the =
absolute wrong measure - a proxy for "network capital cost" that is in fact=
the wrong measure of any real network operator's cost structure. The cost =
of media (wires, airtime, ...) is a tiny fraction of most network operation=
s' cost in any real business or institution. We don't optimize highways by =
maximizing the number of cars on every stretch of highway, for obvious reas=
ons, but also for non-obvious reasons.=0A =0ALatency and lack of flexibiilt=
y or reconfigurability impose real costs on a system that are far more sig=
nificant to end-user value than the cost of the media.=0A =0AA sustained co=
ngestion of a bottleneck link is not a feature, but a very serious operatio=
nal engineering error. People should be fired if they don't prevent that fr=
om ever happening, or allow it to persist.=0A =0AThis is why telcos, for ex=
ample, design networks to handle the expected maximum traffic with some exc=
ess apactity. This is why networks are constantly being upgraded as load in=
creases, *before* overloads occur.=0A =0AIt's an incredibly dangerous and a=
rrogant assumption that operation in a congested mode is acceptable.=0A =0A=
That's the rationale for the "radical proposal".=0A =0ASadly, academic thin=
kers (even ones who have worked in industry research labs on minor aspects)=
get drawn into solving the wrong problem - optimizing the case that should=
never happen.=0A =0ASure that's helpful - but only in the same sense that =
when designing systems where accidents need to have fallbacks one needs to =
design the fallback system to work.=0A =0AOperating at fully congested stat=
e - or designing TCP to essencially come close to DDoS behavior on a bottle=
neck to get a publishable paper - is missing the point.=0A =0A =0AOn Monday=
, September 27, 2021 10:50am, "Bob Briscoe" said:=
=0A=0A=0A=0A> Dave,=0A> =0A> On 26/09/2021 21:08, Dave Taht wrote:=0A> > ..=
. an exploration of smaller mss sizes in response to persistent congestion=
=0A> >=0A> > This is in response to two declarative statements in here that=
I've=0A> > long disagreed with,=0A> > involving NOT shrinking the mss, and=
not trying to do pacing...=0A> =0A> I would still avoid shrinking the MSS,=
'cos you don't know if the=0A> congestion constraint is the CPU, in which =
case you'll make congestion=0A> worse. But we'll have to differ on that if =
you disagree.=0A> =0A> I don't think that paper said don't do pacing. In fa=
ct, it says "...pace=0A> the segments at less than one per round trip..."=
=0A> =0A> Whatever, that paper was the problem statement, with just some id=
eas on=0A> how we were going to solve it.=0A> after that, Asad (added to th=
e distro) did his whole Masters thesis on=0A> this - I suggest you look at =
his thesis and code (pointers below).=0A> =0A> Also soon after he'd finishe=
d, changes to BBRv2 were introduced to=0A> reduce queuing delay with large =
numbers of flows. You might want to take=0A> a look at that too:=0A> https:=
//datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-bbr=
v2#page=3D10=0A> =0A> >=0A> > https://www.bobbriscoe.net/projects/latency/s=
ub-mss-w.pdf=0A> >=0A> > OTherwise, for a change, I largely agree with bob.=
=0A> >=0A> > "No amount of AQM twiddling can fix this. The solution has to =
fix TCP."=0A> >=0A> > "nearly all TCP implementations cannot operate at les=
s than two packets per=0A> RTT"=0A> =0A> Back to Asad's Master's thesis, we=
found that just pacing out the=0A> packets wasn't enough. There's a very b=
rief summary of the 4 things we=0A> found we had to do in 4 bullets in this=
section of our write-up for netdev:=0A> https://bobbriscoe.net/projects/la=
tency/tcp-prague-netdev0x13.pdf#subsubsection.3.1.6=0A> And I've highlighte=
d a couple of unexpected things that cropped up below.=0A> =0A> Asad's full=
thesis:=0A> =0A> Ahmed, A., "Extending TCP for Low Round Trip=
Delay",=0A> =0A> Masters Thesis, Uni Oslo , August 2019,=0A> =
=0A> .=0A> Asad's t=
hesis presentation:=0A> https://bobbriscoe.net/presents/1909submss/pres=
ent_asadsa.pdf=0A> =0A> Code:=0A> https://bitbucket.org/asadsa/kernel42=
0/src/submss/=0A> Despite significant changes to basic TCP design principle=
s, the diffs=0A> were not that great.=0A> =0A> A number of tricky problems =
came up.=0A> =0A> * For instance, simple pacing when <1 ACK per RTT wasn't =
that simple.=0A> Whenever there were bursts from cross-traffic, the consequ=
ent burst in=0A> your own flow kept repeating in subsequent rounds. We real=
ized this was=0A> because you never have a real ACK clock (you always set t=
he next send=0A> time based on previous send times). So we set up the the n=
ext send time=0A> but then re-adjusted it if/when the next ACK did actually=
arrive.=0A> =0A> * The additive increase of one segment was the other main=
problem. When=0A> you have such a small window, multiplicative decrease sc=
ales fine, but=0A> an additive increase of 1 segment is a huge jump in comp=
arison, when=0A> cwnd is a fraction of a segment. "Logarithmically scaled a=
dditive=0A> increase" was our solution to that (basically, every time you s=
et=0A> ssthresh, alter the additive increase constant using a formula that=
=0A> scales logarithmically with ssthresh, so it's still roughly 1 for the=
=0A> current Internet scale).=0A> =0A> What became of Asad's work?=0A> Alth=
o the code finally worked pretty well {1}, we decided not to pursue=0A> it =
further 'cos a minimum cwnd actually gives a trickle of throughput=0A> prot=
ection against unresponsive flows (with the downside that it=0A> increases =
queuing delay). That's not to say this isn't worth working on=0A> further, =
but there was more to do to make it bullet proof, and we were=0A> in two mi=
nds how important it was, so it worked its way down our=0A> priority list.=
=0A> =0A> {Note 1: From memory, there was an outstanding problem with one f=
low=0A> remaining dominant if you had step-ECN marking, which we worked out=
was=0A> due to the logarithmically scaled additive increase, but we didn't=
work=0A> on it further to fix it.}=0A> =0A> =0A> =0A> Bob=0A> =0A> =0A> --=
=0A> ________________________________________________________________=0A> B=
ob Briscoe http://bobbriscoe.net/=0A> =0A> ________________________________=
_______________=0A> Ecn-sane mailing list=0A> Ecn-sane@lists.bufferbloat.ne=
t=0A> https://lists.bufferbloat.net/listinfo/ecn-sane=0A>
------=_20210928181555000000_87796
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Upon thinking about th=
is, here's a radical idea:
=0A
=0Athe expected time until a bottleneck link clears, that is, 0 p=
ackets are in the queue to be sent on it, must be < t, where t is an Int=
ernet-wide constant corresponding to the time it takes light to circle the =
earth.
=0A
=0AThis is=
a local constraint, one that is required of a router. It can be achieved i=
n any of a variety of ways (for example choosing to route different flows o=
n different paths that don't include the bottleneck link).
=0A
=0AIt need not be true at all t=
imes - but when I say "expected time", I mean that the queue's behavior is =
monitored so that this situation is quite rare over any interval of ten min=
utes or more.
=0A
=0A=
If a bottleneck link is continuously full for more than the time it takes f=
or packets on a fiber (< light speed) to circle the earth, it is in REAL=
LY bad shape. That must never happen.
=0A
=
=0AWhy is this important?
=0A&=
nbsp;
=0AIt's a matter of control theory - if the c=
ontrol loop delay gets longer than its minimum, instability tends to take o=
ver no matter what control discipline is used to manage the system.
=0A<=
p style=3D"margin:0;padding:0;font-family: arial; font-size: 10pt; overflow=
-wrap: break-word;">
=0ANow, it is important =
as hell to avoid bullshit research programs that try to "optimize" ustiliza=
tion of link capacity at 100%. Those research programs focus on the absolut=
e wrong measure - a proxy for "network capital cost" that is in fact the wr=
ong measure of any real network operator's cost structure. The cost of medi=
a (wires, airtime, ...) is a tiny fraction of most network operations' cost=
in any real business or institution. We don't optimize highways by maximiz=
ing the number of cars on every stretch of highway, for obvious reasons, bu=
t also for non-obvious reasons.
=0A
=0ALatency and lack of flexibiilty or reconfigurabilit=
y impose real costs on a system that are far more significant to end-user v=
alue than the cost of the media.
=0A
=0AA sustained congestion of a bottleneck link is not a fea=
ture, but a very serious operational engineering error. People should be fi=
red if they don't prevent that from ever happening, or allow it to persist.=
=0A
=0AThis is why t=
elcos, for example, design networks to handle the expected maximum traffic =
with some excess apactity. This is why networks are constantly being upgrad=
ed as load increases, *before* overloads occur.
=0A=
=0AIt's an incredibly dangerous and arrogant=
assumption that operation in a congested mode is acceptable.
=0A
=0AThat's the rationale for th=
e "radical proposal".
=0A
=0ASadly, academic thinkers (even ones who have worked in industry res=
earch labs on minor aspects) get drawn into solving the wrong problem - opt=
imizing the case that should never happen.
=0A =
;
=0ASure that's helpful - but only in the same sen=
se that when designing systems where accidents need to have fallbacks one n=
eeds to design the fallback system to work.
=0A&nbs=
p;
=0AOperating at fully congested state - or desig=
ning TCP to essencially come close to DDoS behavior on a bottleneck to get =
a publishable paper - is missing the point.
=0A&nbs=
p;
=0A
=0AOn Monday, =
September 27, 2021 10:50am, "Bob Briscoe" <research@bobbriscoe.net> s=
aid:
=0A=0A
> Dave,
>
> On 26/09/2021 21:08, Dave Taht wrote:<=
br />> > ... an exploration of smaller mss sizes in response to persi=
stent congestion
> >
> > This is in response to two d=
eclarative statements in here that I've
> > long disagreed with,=
> > involving NOT shrinking the mss, and not trying to do pacin=
g...
>
> I would still avoid shrinking the MSS, 'cos you d=
on't know if the
> congestion constraint is the CPU, in which case =
you'll make congestion
> worse. But we'll have to differ on that if=
you disagree.
>
> I don't think that paper said don't do =
pacing. In fact, it says "...pace
> the segments at less than one p=
er round trip..."
>
> Whatever, that paper was the problem=
statement, with just some ideas on
> how we were going to solve it=
.
> after that, Asad (added to the distro) did his whole Masters th=
esis on
> this - I suggest you look at his thesis and code (pointer=
s below).
>
> Also soon after he'd finished, changes to BB=
Rv2 were introduced to
> reduce queuing delay with large numbers of=
flows. You might want to take
> a look at that too:
> http=
s://datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-b=
brv2#page=3D10
>
> >
> > https://www.bobbris=
coe.net/projects/latency/sub-mss-w.pdf
> >
> > OTherw=
ise, for a change, I largely agree with bob.
> >
> > =
"No amount of AQM twiddling can fix this. The solution has to fix TCP."
> >
> > "nearly all TCP implementations cannot operate a=
t less than two packets per
> RTT"
>
> Back to Asa=
d's Master's thesis, we found that just pacing out the
> packets wa=
sn't enough. There's a very brief summary of the 4 things we
> foun=
d we had to do in 4 bullets in this section of our write-up for netdev:
> https://bobbriscoe.net/projects/latency/tcp-prague-netdev0x13.pdf#su=
bsubsection.3.1.6
> And I've highlighted a couple of unexpected thi=
ngs that cropped up below.
>
> Asad's full thesis:
&g=
t; =
> Ahmed, A., "Extending TCP for Low Round Trip Delay",
=
> &nbs=
p;
> Masters Thesis, Uni Oslo , August 2019,
> =
> <https://www.duo.uio.no/handle/10852/70966>.
> Asad's=
thesis presentation:
> https://bobbriscoe.net/p=
resents/1909submss/present_asadsa.pdf
>
> Code:
> =
https://bitbucket.org/asadsa/kernel420/src/submss/
=
> Despite significant changes to basic TCP design principles, the diffs<=
br />> were not that great.
>
> A number of tricky prob=
lems came up.
>
> * For instance, simple pacing when <1=
ACK per RTT wasn't that simple.
> Whenever there were bursts from =
cross-traffic, the consequent burst in
> your own flow kept repeati=
ng in subsequent rounds. We realized this was
> because you never h=
ave a real ACK clock (you always set the next send
> time based on =
previous send times). So we set up the the next send time
> but the=
n re-adjusted it if/when the next ACK did actually arrive.
>
=
> * The additive increase of one segment was the other main problem. Whe=
n
> you have such a small window, multiplicative decrease scales fi=
ne, but
> an additive increase of 1 segment is a huge jump in compa=
rison, when
> cwnd is a fraction of a segment. "Logarithmically sca=
led additive
> increase" was our solution to that (basically, every=
time you set
> ssthresh, alter the additive increase constant usin=
g a formula that
> scales logarithmically with ssthresh, so it's st=
ill roughly 1 for the
> current Internet scale).
>
&g=
t; What became of Asad's work?
> Altho the code finally worked pret=
ty well {1}, we decided not to pursue
> it further 'cos a minimum c=
wnd actually gives a trickle of throughput
> protection against unr=
esponsive flows (with the downside that it
> increases queuing dela=
y). That's not to say this isn't worth working on
> further, but th=
ere was more to do to make it bullet proof, and we were
> in two mi=
nds how important it was, so it worked its way down our
> priority =
list.
>
> {Note 1: From memory, there was an outstanding p=
roblem with one flow
> remaining dominant if you had step-ECN marki=
ng, which we worked out was
> due to the logarithmically scaled add=
itive increase, but we didn't work
> on it further to fix it.}
>
>
>
> Bob
>
>
> =
--
> ______________________________________________________________=
__
> Bob Briscoe http://bobbriscoe.net/
>
> ______=
_________________________________________
> Ecn-sane mailing list> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat=
.net/listinfo/ecn-sane
>
=0A
------=_20210928181555000000_87796--