From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp113.iad3a.emailsrvr.com (smtp113.iad3a.emailsrvr.com [173.203.187.113]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id 824EA21F24D for ; Sun, 25 May 2014 21:49:01 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp31.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 62CC54009F; Mon, 26 May 2014 00:49:00 -0400 (EDT) X-Virus-Scanned: OK Received: from app8.wa-webapps.iad3a (relay.iad3a.rsapps.net [172.27.255.110]) by smtp31.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 47F9340092; Mon, 26 May 2014 00:49:00 -0400 (EDT) Received: from reed.com (localhost.localdomain [127.0.0.1]) by app8.wa-webapps.iad3a (Postfix) with ESMTP id 34977280042; Mon, 26 May 2014 00:49:00 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Mon, 26 May 2014 00:49:00 -0400 (EDT) Date: Mon, 26 May 2014 00:49:00 -0400 (EDT) From: dpreed@reed.com To: "Mikael Abrahamsson" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20140526004900000000_82900" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: <1401048053.664331760@apps.rackspace.com> Message-ID: <1401079740.21369945@apps.rackspace.com> X-Mailer: webmail7.0 Cc: cerowrt-devel@lists.bufferbloat.net Subject: Re: [Cerowrt-devel] Ubiquiti QOS X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 May 2014 04:49:01 -0000 ------=_20140526004900000000_82900 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0ALen Kleinrock and his student proved that the "optimal" state for throug= hput in the internet is the 1-2 buffer case. It's easy to think this throu= gh...=0A =0AA simple intuition is that each node that qualifies as a "bottl= eneck" (meaning that the input rate exceeds the service rate of the outboun= d queue) will work optimally if it is in a double buffering state - that is= a complete packet comes in for the outbound link during the time that the = current packet goes out.=0A =0AThat's topology independent. It's equivale= nt to saying that the number of packets in flight along a path in an optima= l state between two endpoints is equal to the path's share of the bottlenec= k link's capacity times the physical minimum RTT for the MTU packet - the a= mount of "pipelining" that can be achieved along that path.=0A =0AHaving mo= re buffering can only make the throughput lower or at best the same. In oth= er words, you might get the same throughput with more buffering, but more l= ikely the extra buffering will make things worse. (the rare special case i= s the "hot rod" scenario of maximum end-to-end throughput with no competing= flows.)=0A =0AThe real congestion management issue, which I described, is = the unnecessary "bunching" of packets in a flow. The bunching can be ameli= orated at the source endpoint (or controlled by the receive endpoint transm= itting an ack only when it receives a packet in the optimal state, but imme= diately responding to it to increase the responsiveness of the control loop= : analogous to "impedance matching" in complex networks of transmission lin= es - bunching analogously corresponds to standing waves that reduce power t= ransfer when impedance is not matched approximately. The maximum power tra= nsfer does not happen if some intermediate point includes a bad impedance m= atch, storing energy that interferes with future energy transfer).=0A =0ABu= nching has many causes, but it's better to remove it at the entry to the ne= twork than to allow it to clog up latency of competing flows.=0A =0AI'm del= iberately not using queueing theory descriptions, because the queueing theo= ry and control theory associated with networks that can drop packets and ha= ve finite buffering with end-to-end feedback congestion control is quite co= mplex, especially for non-Poisson traffic - far beyond what is taught in el= ementary queueing theory.=0A =0ABut if you want, I can dig that up for you.= =0A =0AThe problem of understanding the network congestion phenomenon as a = whole is that one can not carry over intuitions from a single, multi hop, l= inear network of nodes to the global network congestion control problem.=0A= =0A[The reason a CDMA (wired) or CSMA (wireless) Ethernet has "collision-d= riven" exponential-random back off is the same rationale - it's important t= o de-bunch the various flows that are competing for the Ethernet segment. = The right level of randomness creates local de-bunching (or pacing) almost = as effectively as a perfect, zero-latency admission control that knows the = rates of all incoming queues. That is, when a packet ends, all senders with= a packet ready to transmit do so. They all collide, and back off for diff= erent times - de-bunching the packet arrivals that next time around. This m= ay not achieve maximal throughput, however, because there's unnecessary blo= cking of newly arrived packets on the "backed-off" NICs - but fixing that i= s a different story, especially when the Ethernet is an internal path in th= e Internet as a whole - there you need some kind of buffer limiting on each= NIC, and ideally to treat each "flow" as distinct "back-off" entity.]=0A = =0AThe same basic idea - using collision-based back-off to force competing = flows to de-bunch themselves - and keeping the end-to-end feedback loops ve= ry, very short by avoiding any more than the optimal buffering, leads to a = network that can operate at near-optimal throughput *and* near-optimal late= ncy.=0A =0AThis is what I've been calling in my own terminology, a "ballist= ic state" of the network - analogous to, but not the same as, a gaseous rat= her than a liquid or solid phase of matter. The least congested state that = has the most fluidity, and therefore the highest throughput of individual m= olecules (rather than a liquid which transmits pressure very well, but does= not transmit individual tagged molecules very fast at all).=0A =0AThat's w= hat Kleinrock and his student showed. Counterintuitive though it may seem.= (It doesn't seem counterintuitive to me at all, but many, many network eng= ineers are taught and continue to think that you need lots of buffering to = maximize throughput).=0A =0AI conjecture that it's an achievable operating = mode of the Internet based solely on end-to-end congestion-control algorith= ms, probably not very different from TCP, running over a network where each= switch signals congestion to all flows passing through it. It's probably = the most desirable operating mode, because it maximizes throughput while mi= nimizing latency simultaneously. There's no inherent tradeoff between the = two, except when you have control algorithms that don't deal with the key i= ssues of bunching and unnecessarily slow feedback about "collision rate" am= ong paths.=0A =0AIt would be instructive to try to prove, in a rigorous way= , that this phase of network operation cannot be achieved with an end-to-en= d control algorithm. The proof of achievability or the contrary proof of no= n-achievability have not been produced. But there is no doubt that this is= an achievable operational state if you have an Oracle who knows all future= traffic, and can globally schedule traffic: performance is optimal, with n= o buffering needed.=0A =0AAttempting to prove or disprove my conjecture wou= ld probably produce some really important insights.=0A =0AThe insight we al= ready have is that adding buffering cannot increase throughput once the bot= tleneck links are in "double-buffering" state. That state is "Pareto optima= l" in a serious sense.=0A =0A =0A =0A =0A =0A =0A=0A=0AOn Sunday, May 25, 2= 014 8:18pm, "Mikael Abrahamsson" said:=0A=0A=0A=0A> On S= un, 25 May 2014, dpreed@reed.com wrote:=0A> =0A> > The optimum buffer state= for throughput is 1-2 packets worth - in other=0A> > words, if we have an = MTU of 1500, 1500 - 3000 bytes. Only the bottleneck=0A> =0A> No, the optima= l state for througbut is to have huge buffers and have them=0A> filled. The= optimal state for interactivity is to have very small buffers.=0A> FQ_CODE= L tries to strike a balance between the two at 10ms of buffer. PIE=0A> does= the same around 20ms. In order for PIE to work properly I'd say you=0A> ne= ed 50ms of buffering as a minimum, otherwise you're going to get 100%=0A> t= ail drop and multiple sequential drops occasionally (which might be=0A> des= ireable to keep interactivity good).=0A> =0A> My comment about 50ms is that= you seldom need a lot more.=0A> =0A> --=0A> Mikael Abrahamsson email: s= wmike@swm.pp.se=0A> _______________________________________________=0A> Cer= owrt-devel mailing list=0A> Cerowrt-devel@lists.bufferbloat.net=0A> https:/= /lists.bufferbloat.net/listinfo/cerowrt-devel=0A> ------=_20140526004900000000_82900 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Len Kleinr= ock and his student proved that the "optimal" state for throughput in the i= nternet is the 1-2 buffer case.  It's easy to think this through...=0A

 

=0A

A simple intuition is that each node that qualifies as a "bottlenec= k" (meaning that the input rate exceeds the service rate of the outbound qu= eue) will work optimally if it is in a double buffering state - that is a c= omplete packet comes in for the outbound link during the time that the curr= ent packet goes out.

=0A

 

=0AThat's topology independent.   It's equ= ivalent to saying that the number of packets in flight along a path in an o= ptimal state between two endpoints is equal to the path's share of the bott= leneck link's capacity times the physical minimum RTT for the MTU packet - = the amount of "pipelining" that can be achieved along that path.

=0A

 

=0A

= Having more buffering can only make the throughput lower or at best the sam= e. In other words, you might get the same throughput with more buffering, b= ut more likely the extra buffering will make things worse.  (the rare = special case is the "hot rod" scenario of maximum end-to-end throughput wit= h no competing flows.)

=0A

 

=0A=

The real congestion management issue, whic= h I described, is the unnecessary "bunching" of packets in a flow.  Th= e bunching can be ameliorated at the source endpoint (or controlled by the = receive endpoint transmitting an ack only when it receives a packet in the = optimal state, but immediately responding to it to increase the responsiven= ess of the control loop: analogous to "impedance matching" in complex netwo= rks of transmission lines - bunching analogously corresponds to standing wa= ves that reduce power transfer when impedance is not matched approximately.=  The maximum power transfer does not happen if some intermediate poin= t includes a bad impedance match, storing energy that interferes with futur= e energy transfer).

=0A

 

=0A

Bunching has many causes, but it's better to = remove it at the entry to the network than to allow it to clog up latency o= f competing flows.

=0A

 

=0A

I'm deliberately not using queueing theory des= criptions, because the queueing theory and control theory associated with n= etworks that can drop packets and have finite buffering with end-to-end fee= dback congestion control is quite complex, especially for non-Poisson traff= ic - far beyond what is taught in elementary queueing theory.

=0A

 

=0A

But= if you want, I can dig that up for you.

=0A

 

=0A

The problem of understan= ding the network congestion phenomenon as a whole is that one can not carry= over intuitions from a single, multi hop, linear network of nodes to the g= lobal network congestion control problem.

=0A

 

=0A

[The reason a CDMA (wir= ed) or CSMA (wireless) Ethernet has "collision-driven" exponential-random b= ack off is the same rationale - it's important to de-bunch the various flow= s that are competing for the Ethernet segment.  The right level of ran= domness creates local de-bunching (or pacing) almost as effectively as a pe= rfect, zero-latency admission control that knows the rates of all incoming = queues. That is, when a packet ends, all senders with a packet ready to tra= nsmit do so.  They all collide, and back off for different times - de-= bunching the packet arrivals that next time around. This may not achieve ma= ximal throughput, however, because there's unnecessary blocking of newly ar= rived packets on the "backed-off" NICs - but fixing that is a different sto= ry, especially when the Ethernet is an internal path in the Internet as a w= hole - there you need some kind of buffer limiting on each NIC, and ideally= to treat each "flow" as distinct "back-off" entity.]

=0A

 

=0A

The same ba= sic idea - using collision-based back-off to force competing flows to de-bu= nch themselves - and keeping the end-to-end feedback loops very, very short= by avoiding any more than the optimal buffering, leads to a network that c= an operate at near-optimal throughput *and* near-optimal latency.

=0A

 

=0A

This is what I've been calling in my own terminology, a "ballistic state" = of the network - analogous to, but not the same as, a gaseous rather than a= liquid or solid phase of matter. The least congested state that has the mo= st fluidity, and therefore the highest throughput of individual molecules (= rather than a liquid which transmits pressure very well, but does not trans= mit individual tagged molecules very fast at all).

=0A

 

=0A

That's what Kl= einrock and his student showed.  Counterintuitive though it may seem. = (It doesn't seem counterintuitive to me at all, but many, many network engi= neers are taught and continue to think that you need lots of buffering to m= aximize throughput).

=0A

 

=0AI conjecture that it's an achievable operati= ng mode of the Internet based solely on end-to-end congestion-control algor= ithms, probably not very different from TCP, running over a network where e= ach switch signals congestion to all flows passing through it.  It's p= robably the most desirable operating mode, because it maximizes throughput = while minimizing latency simultaneously.  There's no inherent tradeoff= between the two, except when you have control algorithms that don't deal w= ith the key issues of bunching and unnecessarily slow feedback about "colli= sion rate" among paths.

=0A

 

= =0A

It would be instructive to try to prove= , in a rigorous way, that this phase of network operation cannot be achieve= d with an end-to-end control algorithm. The proof of achievability or the c= ontrary proof of non-achievability have not been produced.  But there = is no doubt that this is an achievable operational state if you have an Ora= cle who knows all future traffic, and can globally schedule traffic: perfor= mance is optimal, with no buffering needed.

=0A

 

=0A

Attempting to prove o= r disprove my conjecture would probably produce some really important insig= hts.

=0A

 

=0A

The insight we already have is that adding buffering cannot = increase throughput once the bottleneck links are in "double-buffering" sta= te. That state is "Pareto optimal" in a serious sense.

=0A

 

=0A

 

= =0A

 

=0A

 

=0A

 

=0A

 

=0A





On Sunday, May 25, 2014 8:18pm, "Mikael Abrahamsson" <s= wmike@swm.pp.se> said:

=0A
=0A

> On Sun, 25 May 2014, dpreed@reed= .com wrote:
>
> > The optimum buffer state for throughp= ut is 1-2 packets worth - in other
> > words, if we have an MTU = of 1500, 1500 - 3000 bytes. Only the bottleneck
>
> No, th= e optimal state for througbut is to have huge buffers and have them
&g= t; filled. The optimal state for interactivity is to have very small buffer= s.
> FQ_CODEL tries to strike a balance between the two at 10ms of = buffer. PIE
> does the same around 20ms. In order for PIE to work p= roperly I'd say you
> need 50ms of buffering as a minimum, otherwis= e you're going to get 100%
> tail drop and multiple sequential drop= s occasionally (which might be
> desireable to keep interactivity g= ood).
>
> My comment about 50ms is that you seldom need a = lot more.
>
> --
> Mikael Abrahamsson email: sw= mike@swm.pp.se
> _______________________________________________> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloa= t.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
&= gt;

=0A
------=_20140526004900000000_82900--