From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dpreed@reed.com>
Received: from smtp113.iad3a.emailsrvr.com (smtp113.iad3a.emailsrvr.com
	[173.203.187.113])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 824EA21F24D
	for <cerowrt-devel@lists.bufferbloat.net>;
	Sun, 25 May 2014 21:49:01 -0700 (PDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by smtp31.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	62CC54009F; Mon, 26 May 2014 00:49:00 -0400 (EDT)
X-Virus-Scanned: OK
Received: from app8.wa-webapps.iad3a (relay.iad3a.rsapps.net [172.27.255.110])
	by smtp31.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	47F9340092; Mon, 26 May 2014 00:49:00 -0400 (EDT)
Received: from reed.com (localhost.localdomain [127.0.0.1])
	by app8.wa-webapps.iad3a (Postfix) with ESMTP id 34977280042;
	Mon, 26 May 2014 00:49:00 -0400 (EDT)
Received: by apps.rackspace.com
	(Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) 
	with HTTP; Mon, 26 May 2014 00:49:00 -0400 (EDT)
Date: Mon, 26 May 2014 00:49:00 -0400 (EDT)
From: dpreed@reed.com
To: "Mikael Abrahamsson" <swmike@swm.pp.se>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_20140526004900000000_82900"
Importance: Normal
X-Priority: 3 (Normal)
X-Type: html
In-Reply-To: <alpine.DEB.2.02.1405260213590.29282@uplift.swm.pp.se>
References: <CABsdH_FMqARQQ7oT2gGE6PEZWk1E6b6CDGdBH958nL2=FmFv-A@mail.gmail.com>
	<alpine.DEB.2.02.1405251740360.29282@uplift.swm.pp.se> 
	<1401048053.664331760@apps.rackspace.com> 
	<alpine.DEB.2.02.1405260213590.29282@uplift.swm.pp.se>
Message-ID: <1401079740.21369945@apps.rackspace.com>
X-Mailer: webmail7.0
Cc: cerowrt-devel@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] Ubiquiti QOS
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 26 May 2014 04:49:01 -0000

------=_20140526004900000000_82900
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=0ALen Kleinrock and his student proved that the "optimal" state for throug=
hput in the internet is the 1-2 buffer case.  It's easy to think this throu=
gh...=0A =0AA simple intuition is that each node that qualifies as a "bottl=
eneck" (meaning that the input rate exceeds the service rate of the outboun=
d queue) will work optimally if it is in a double buffering state - that is=
 a complete packet comes in for the outbound link during the time that the =
current packet goes out.=0A =0AThat's topology independent.   It's equivale=
nt to saying that the number of packets in flight along a path in an optima=
l state between two endpoints is equal to the path's share of the bottlenec=
k link's capacity times the physical minimum RTT for the MTU packet - the a=
mount of "pipelining" that can be achieved along that path.=0A =0AHaving mo=
re buffering can only make the throughput lower or at best the same. In oth=
er words, you might get the same throughput with more buffering, but more l=
ikely the extra buffering will make things worse.  (the rare special case i=
s the "hot rod" scenario of maximum end-to-end throughput with no competing=
 flows.)=0A =0AThe real congestion management issue, which I described, is =
the unnecessary "bunching" of packets in a flow.  The bunching can be ameli=
orated at the source endpoint (or controlled by the receive endpoint transm=
itting an ack only when it receives a packet in the optimal state, but imme=
diately responding to it to increase the responsiveness of the control loop=
: analogous to "impedance matching" in complex networks of transmission lin=
es - bunching analogously corresponds to standing waves that reduce power t=
ransfer when impedance is not matched approximately.  The maximum power tra=
nsfer does not happen if some intermediate point includes a bad impedance m=
atch, storing energy that interferes with future energy transfer).=0A =0ABu=
nching has many causes, but it's better to remove it at the entry to the ne=
twork than to allow it to clog up latency of competing flows.=0A =0AI'm del=
iberately not using queueing theory descriptions, because the queueing theo=
ry and control theory associated with networks that can drop packets and ha=
ve finite buffering with end-to-end feedback congestion control is quite co=
mplex, especially for non-Poisson traffic - far beyond what is taught in el=
ementary queueing theory.=0A =0ABut if you want, I can dig that up for you.=
=0A =0AThe problem of understanding the network congestion phenomenon as a =
whole is that one can not carry over intuitions from a single, multi hop, l=
inear network of nodes to the global network congestion control problem.=0A=
 =0A[The reason a CDMA (wired) or CSMA (wireless) Ethernet has "collision-d=
riven" exponential-random back off is the same rationale - it's important t=
o de-bunch the various flows that are competing for the Ethernet segment.  =
The right level of randomness creates local de-bunching (or pacing) almost =
as effectively as a perfect, zero-latency admission control that knows the =
rates of all incoming queues. That is, when a packet ends, all senders with=
 a packet ready to transmit do so.  They all collide, and back off for diff=
erent times - de-bunching the packet arrivals that next time around. This m=
ay not achieve maximal throughput, however, because there's unnecessary blo=
cking of newly arrived packets on the "backed-off" NICs - but fixing that i=
s a different story, especially when the Ethernet is an internal path in th=
e Internet as a whole - there you need some kind of buffer limiting on each=
 NIC, and ideally to treat each "flow" as distinct "back-off" entity.]=0A =
=0AThe same basic idea - using collision-based back-off to force competing =
flows to de-bunch themselves - and keeping the end-to-end feedback loops ve=
ry, very short by avoiding any more than the optimal buffering, leads to a =
network that can operate at near-optimal throughput *and* near-optimal late=
ncy.=0A =0AThis is what I've been calling in my own terminology, a "ballist=
ic state" of the network - analogous to, but not the same as, a gaseous rat=
her than a liquid or solid phase of matter. The least congested state that =
has the most fluidity, and therefore the highest throughput of individual m=
olecules (rather than a liquid which transmits pressure very well, but does=
 not transmit individual tagged molecules very fast at all).=0A =0AThat's w=
hat Kleinrock and his student showed.  Counterintuitive though it may seem.=
 (It doesn't seem counterintuitive to me at all, but many, many network eng=
ineers are taught and continue to think that you need lots of buffering to =
maximize throughput).=0A =0AI conjecture that it's an achievable operating =
mode of the Internet based solely on end-to-end congestion-control algorith=
ms, probably not very different from TCP, running over a network where each=
 switch signals congestion to all flows passing through it.  It's probably =
the most desirable operating mode, because it maximizes throughput while mi=
nimizing latency simultaneously.  There's no inherent tradeoff between the =
two, except when you have control algorithms that don't deal with the key i=
ssues of bunching and unnecessarily slow feedback about "collision rate" am=
ong paths.=0A =0AIt would be instructive to try to prove, in a rigorous way=
, that this phase of network operation cannot be achieved with an end-to-en=
d control algorithm. The proof of achievability or the contrary proof of no=
n-achievability have not been produced.  But there is no doubt that this is=
 an achievable operational state if you have an Oracle who knows all future=
 traffic, and can globally schedule traffic: performance is optimal, with n=
o buffering needed.=0A =0AAttempting to prove or disprove my conjecture wou=
ld probably produce some really important insights.=0A =0AThe insight we al=
ready have is that adding buffering cannot increase throughput once the bot=
tleneck links are in "double-buffering" state. That state is "Pareto optima=
l" in a serious sense.=0A =0A =0A =0A =0A =0A =0A=0A=0AOn Sunday, May 25, 2=
014 8:18pm, "Mikael Abrahamsson" <swmike@swm.pp.se> said:=0A=0A=0A=0A> On S=
un, 25 May 2014, dpreed@reed.com wrote:=0A> =0A> > The optimum buffer state=
 for throughput is 1-2 packets worth - in other=0A> > words, if we have an =
MTU of 1500, 1500 - 3000 bytes. Only the bottleneck=0A> =0A> No, the optima=
l state for througbut is to have huge buffers and have them=0A> filled. The=
 optimal state for interactivity is to have very small buffers.=0A> FQ_CODE=
L tries to strike a balance between the two at 10ms of buffer. PIE=0A> does=
 the same around 20ms. In order for PIE to work properly I'd say you=0A> ne=
ed 50ms of buffering as a minimum, otherwise you're going to get 100%=0A> t=
ail drop and multiple sequential drops occasionally (which might be=0A> des=
ireable to keep interactivity good).=0A> =0A> My comment about 50ms is that=
 you seldom need a lot more.=0A> =0A> --=0A> Mikael Abrahamsson    email: s=
wmike@swm.pp.se=0A> _______________________________________________=0A> Cer=
owrt-devel mailing list=0A> Cerowrt-devel@lists.bufferbloat.net=0A> https:/=
/lists.bufferbloat.net/listinfo/cerowrt-devel=0A>
------=_20140526004900000000_82900
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<font face=3D"arial" size=3D"2"><p style=3D"margin:0;padding:0;">Len Kleinr=
ock and his student proved that the "optimal" state for throughput in the i=
nternet is the 1-2 buffer case. &nbsp;It's easy to think this through...</p=
>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padd=
ing:0;">A simple intuition is that each node that qualifies as a "bottlenec=
k" (meaning that the input rate exceeds the service rate of the outbound qu=
eue) will work optimally if it is in a double buffering state - that is a c=
omplete packet comes in for the outbound link during the time that the curr=
ent packet goes out.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p=
 style=3D"margin:0;padding:0;">That's topology independent. &nbsp; It's equ=
ivalent to saying that the number of packets in flight along a path in an o=
ptimal state between two endpoints is equal to the path's share of the bott=
leneck link's capacity times the physical minimum RTT for the MTU packet - =
the amount of "pipelining" that can be achieved along that path.</p>=0A<p s=
tyle=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">=
Having more buffering can only make the throughput lower or at best the sam=
e. In other words, you might get the same throughput with more buffering, b=
ut more likely the extra buffering will make things worse. &nbsp;(the rare =
special case is the "hot rod" scenario of maximum end-to-end throughput wit=
h no competing flows.)</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A=
<p style=3D"margin:0;padding:0;">The real congestion management issue, whic=
h I described, is the unnecessary "bunching" of packets in a flow. &nbsp;Th=
e bunching can be ameliorated at the source endpoint (or controlled by the =
receive endpoint transmitting an ack only when it receives a packet in the =
optimal state, but immediately responding to it to increase the responsiven=
ess of the control loop: analogous to "impedance matching" in complex netwo=
rks of transmission lines - bunching analogously corresponds to standing wa=
ves that reduce power transfer when impedance is not matched approximately.=
 &nbsp;The maximum power transfer does not happen if some intermediate poin=
t includes a bad impedance match, storing energy that interferes with futur=
e energy transfer).</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p =
style=3D"margin:0;padding:0;">Bunching has many causes, but it's better to =
remove it at the entry to the network than to allow it to clog up latency o=
f competing flows.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p s=
tyle=3D"margin:0;padding:0;">I'm deliberately not using queueing theory des=
criptions, because the queueing theory and control theory associated with n=
etworks that can drop packets and have finite buffering with end-to-end fee=
dback congestion control is quite complex, especially for non-Poisson traff=
ic - far beyond what is taught in elementary queueing theory.</p>=0A<p styl=
e=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">But=
 if you want, I can dig that up for you.</p>=0A<p style=3D"margin:0;padding=
:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">The problem of understan=
ding the network congestion phenomenon as a whole is that one can not carry=
 over intuitions from a single, multi hop, linear network of nodes to the g=
lobal network congestion control problem.</p>=0A<p style=3D"margin:0;paddin=
g:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">[The reason a CDMA (wir=
ed) or CSMA (wireless) Ethernet has "collision-driven" exponential-random b=
ack off is the same rationale - it's important to de-bunch the various flow=
s that are competing for the Ethernet segment. &nbsp;The right level of ran=
domness creates local de-bunching (or pacing) almost as effectively as a pe=
rfect, zero-latency admission control that knows the rates of all incoming =
queues. That is, when a packet ends, all senders with a packet ready to tra=
nsmit do so. &nbsp;They all collide, and back off for different times - de-=
bunching the packet arrivals that next time around. This may not achieve ma=
ximal throughput, however, because there's unnecessary blocking of newly ar=
rived packets on the "backed-off" NICs - but fixing that is a different sto=
ry, especially when the Ethernet is an internal path in the Internet as a w=
hole - there you need some kind of buffer limiting on each NIC, and ideally=
 to treat each "flow" as distinct "back-off" entity.]</p>=0A<p style=3D"mar=
gin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">The same ba=
sic idea - using collision-based back-off to force competing flows to de-bu=
nch themselves - and keeping the end-to-end feedback loops very, very short=
 by avoiding any more than the optimal buffering, leads to a network that c=
an operate at near-optimal throughput *and* near-optimal latency.</p>=0A<p =
style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;"=
>This is what I've been calling in my own terminology, a "ballistic state" =
of the network - analogous to, but not the same as, a gaseous rather than a=
 liquid or solid phase of matter. The least congested state that has the mo=
st fluidity, and therefore the highest throughput of individual molecules (=
rather than a liquid which transmits pressure very well, but does not trans=
mit individual tagged molecules very fast at all).</p>=0A<p style=3D"margin=
:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">That's what Kl=
einrock and his student showed. &nbsp;Counterintuitive though it may seem. =
(It doesn't seem counterintuitive to me at all, but many, many network engi=
neers are taught and continue to think that you need lots of buffering to m=
aximize throughput).</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p=
 style=3D"margin:0;padding:0;">I conjecture that it's an achievable operati=
ng mode of the Internet based solely on end-to-end congestion-control algor=
ithms, probably not very different from TCP, running over a network where e=
ach switch signals congestion to all flows passing through it. &nbsp;It's p=
robably the most desirable operating mode, because it maximizes throughput =
while minimizing latency simultaneously. &nbsp;There's no inherent tradeoff=
 between the two, except when you have control algorithms that don't deal w=
ith the key issues of bunching and unnecessarily slow feedback about "colli=
sion rate" among paths.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=
=0A<p style=3D"margin:0;padding:0;">It would be instructive to try to prove=
, in a rigorous way, that this phase of network operation cannot be achieve=
d with an end-to-end control algorithm. The proof of achievability or the c=
ontrary proof of non-achievability have not been produced. &nbsp;But there =
is no doubt that this is an achievable operational state if you have an Ora=
cle who knows all future traffic, and can globally schedule traffic: perfor=
mance is optimal, with no buffering needed.</p>=0A<p style=3D"margin:0;padd=
ing:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">Attempting to prove o=
r disprove my conjecture would probably produce some really important insig=
hts.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin=
:0;padding:0;">The insight we already have is that adding buffering cannot =
increase throughput once the bottleneck links are in "double-buffering" sta=
te. That state is "Pareto optimal" in a serious sense.</p>=0A<p style=3D"ma=
rgin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=
=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;paddi=
ng:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=
=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;"><br =
class=3D"WM_COMPOSE_SIGNATURE_START" /><br class=3D"WM_COMPOSE_SIGNATURE_EN=
D" /><br /><br />On Sunday, May 25, 2014 8:18pm, "Mikael Abrahamsson" &lt;s=
wmike@swm.pp.se&gt; said:<br /><br /></p>=0A<div id=3D"SafeStyles1401075831=
">=0A<p style=3D"margin:0;padding:0;">&gt; On Sun, 25 May 2014, dpreed@reed=
.com wrote:<br />&gt; <br />&gt; &gt; The optimum buffer state for throughp=
ut is 1-2 packets worth - in other<br />&gt; &gt; words, if we have an MTU =
of 1500, 1500 - 3000 bytes. Only the bottleneck<br />&gt; <br />&gt; No, th=
e optimal state for througbut is to have huge buffers and have them<br />&g=
t; filled. The optimal state for interactivity is to have very small buffer=
s.<br />&gt; FQ_CODEL tries to strike a balance between the two at 10ms of =
buffer. PIE<br />&gt; does the same around 20ms. In order for PIE to work p=
roperly I'd say you<br />&gt; need 50ms of buffering as a minimum, otherwis=
e you're going to get 100%<br />&gt; tail drop and multiple sequential drop=
s occasionally (which might be<br />&gt; desireable to keep interactivity g=
ood).<br />&gt; <br />&gt; My comment about 50ms is that you seldom need a =
lot more.<br />&gt; <br />&gt; --<br />&gt; Mikael Abrahamsson    email: sw=
mike@swm.pp.se<br />&gt; _______________________________________________<br=
 />&gt; Cerowrt-devel mailing list<br />&gt; Cerowrt-devel@lists.bufferbloa=
t.net<br />&gt; https://lists.bufferbloat.net/listinfo/cerowrt-devel<br />&=
gt;</p>=0A</div></font>
------=_20140526004900000000_82900--