From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp65.iad3a.emailsrvr.com (smtp65.iad3a.emailsrvr.com [173.203.187.65]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id A0AAC21F247 for ; Wed, 28 May 2014 08:20:06 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp25.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 4D04DE00FE; Wed, 28 May 2014 11:20:05 -0400 (EDT) X-Virus-Scanned: OK Received: from app35.wa-webapps.iad3a (relay.iad3a.rsapps.net [172.27.255.110]) by smtp25.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 2B21AE00CA; Wed, 28 May 2014 11:20:05 -0400 (EDT) Received: from reed.com (localhost.localdomain [127.0.0.1]) by app35.wa-webapps.iad3a (Postfix) with ESMTP id 18D2B182B11; Wed, 28 May 2014 11:20:05 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Wed, 28 May 2014 11:20:05 -0400 (EDT) Date: Wed, 28 May 2014 11:20:05 -0400 (EDT) From: dpreed@reed.com To: "Jim Gettys" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20140528112005000000_26654" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: <1401048053.664331760@apps.rackspace.com> Message-ID: <1401290405.100110358@apps.rackspace.com> X-Mailer: webmail7.0 Cc: "cerowrt-devel@lists.bufferbloat.net" Subject: Re: [Cerowrt-devel] Ubiquiti QOS X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 May 2014 15:20:07 -0000 ------=_20140528112005000000_26654 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AI did not mean that "pacing". Sorry I used a generic term. I meant wha= t my longer description described - a specific mechanism for reducing bunch= ing that is essentially "cooperative" among all active flows through a bott= lenecked link. That's part of a "closed loop" control system driving each = TCP endpoint into a cooperative mode.=0A =0AThe thing you call "pacing" is = something quite different. It is disconnected from the TCP control loops i= nvolved, which basically means it is flying blind. Introducing that kind o= f "pacing" almost certainly reduces throughput, because it *delays* packets= .=0A =0AThe thing I called "pacing" is in no version of Linux that I know o= f. Give it a different name: "anti-bunching cooperation" or "timing phase = management for congestion reduction". Rather than *delaying* packets, it tr= ies to get packets to avoid bunching only when reducing window size, and do= ing so by tightening the control loop so that the sender transmits as *soon= * as it can, not by delaying sending after the sender dallies around not se= nding when it can.=0A =0A =0A =0A =0A =0A=0A=0AOn Tuesday, May 27, 2014 11:= 23am, "Jim Gettys" said:=0A=0A=0A=0A=0A=0A=0A=0AOn Sun= , May 25, 2014 at 4:00 PM, <[dpreed@reed.com](mailto:dpreed@reed.com)> wro= te:=0A=0ANot that it is directly relevant, but there is no essential reason= to require 50 ms. of buffering. That might be true of some particular QOS= -related router algorithm. 50 ms. is about all one can tolerate in any rou= ter between source and destination for today's networks - an upper-bound ra= ther than a minimum.=0A =0AThe optimum buffer state for throughput is 1-2 p= ackets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes= . Only the bottleneck buffer (the input queue to the lowest speed link alon= g the path) should have this much actually buffered. Buffering more than th= is increases end-to-end latency beyond its optimal state. Increased end-to= -end latency reduces the effectiveness of control loops, creating more cong= estion.=0A =0AThe rationale for having 50 ms. of buffering is probably to a= void disruption of bursty mixed flows where the bursts might persist for 50= ms. and then die. One reason for this is that source nodes run operating s= ystems that tend to release packets in bursts. That's a whole other discuss= ion - in an ideal world, source nodes would avoid bursty packet releases by= letting the control by the receiver window be "tight" timing-wise. That i= s, to transmit a packet immediately at the instant an ACK arrives increasin= g the window. This would pace the flow - current OS's tend (due to schedul= ing mismatches) to send bursts of packets, "catching up" on sending that co= uld have been spaced out and done earlier if the feedback from the receiver= 's window advancing were heeded.=0A=0A=E2=80=8B=0A =0AThat is, endpoint net= work stacks (TCP implementations) can worsen congestion by "dallying". The= ideal end-to-end flows occupying a congested router would have their packe= ts paced so that the packets end up being sent in the least bursty manner t= hat an application can support. The effect of this pacing is to move the "= backlog" for each flow quickly into the source node for that flow, which th= en provides back pressure on the application driving the flow, which ultima= tely is necessary to stanch congestion. The ideal congestion control mecha= nism slows the sender part of the application to a pace that can go through= the network without contributing to buffering.=0A=E2=80=8B=E2=80=8B=0A=E2= =80=8BPacing is in Linux 3.12(?). How long it will take to see widespread = deployment is another question, and as for other operating systems, who kno= ws.=0ASee: [https://lwn.net/Articles/564978/](https://lwn.net/Articles/5649= 78/)=0A=E2=80=8B=E2=80=8B=0A =0ACurrent network stacks (including Linux's) = don't achieve that goal - their pushback on application sources is minimal = - instead they accumulate buffering internal to the network implementation.= =0A=E2=80=8BThis is much, much less true than it once was. There have been= substantial changes in the Linux TCP stack in the last year or two, to avo= id generating packets before necessary. Again, how long it will take for p= eople to deploy this on Linux (and implement on other OS's) is a question.= =0A=E2=80=8B=0AThis contributes to end-to-end latency as well. But if you = think about it, this is almost as bad as switch-level bufferbloat in terms = of degrading user experience. The reason I say "almost" is that there are = tools, rarely used in practice, that allow an application to specify that b= uffering should not build up in the network stack (in the kernel or whereve= r it is). But the default is not to use those APIs, and to buffer way too = much.=0A =0ARemember, the network send stack can act similarly to a congest= ed switch (it is a switch among all the user applications running on that n= ode). IF there is a heavy file transfer, the file transfer's buffering act= s to increase latency for all other networked communications on that machin= e.=0A =0ATraditionally this problem has been thought of only as a within-no= de fairness issue, but in fact it has a big effect on the switches in betwe= en source and destination due to the lack of dispersed pacing of the packet= s at the source - in other words, the current design does nothing to stem t= he "burst groups" from a single source mentioned above.=0A =0ASo we do need= the source nodes to implement less "bursty" sending stacks. This is espec= ially true for multiplexed source nodes, such as web servers implementing t= housands of flows.=0A =0AA combination of codel-style switch-level buffer m= anagement and the stack at the sender being implemented to spread packets i= n a particular TCP flow out over time would improve things a lot. To achie= ve best throughput, the optimal way to spread packets out on an end-to-end = basis is to update the receive window (sending ACK) at the receive end as q= uickly as possible, and to respond to the updated receive window as quickly= as possible when it increases.=0A =0AJust like the "bufferbloat" issue, th= e problem is caused by applications like streaming video, file transfers an= d big web pages that the application programmer sees as not having a latenc= y requirement within the flow, so the application programmer does not have = an incentive to control pacing. Thus the operating system has got to push = back on the applications' flow somehow, so that the flow ends up paced once= it enters the Internet itself. So there's no real problem caused by large= buffering in the network stack at the endpoint, as long as the stack's del= ivery to the Internet is paced by some mechanism, e.g. tight management of = receive window control on an end-to-end basis.=0A =0AI don't think this can= be fixed by cerowrt, so this is out of place here. It's partially amelior= ated by cerowrt, if it aggressively drops packets from flows that burst wit= hout pacing. fq_codel does this, if the buffer size it aims for is small - = but the problem is that the OS stacks don't respond by pacing... they tend = to respond by bursting, not because TCP doesn't provide the mechanisms for = pacing, but because the OS stack doesn't transmit as soon as it is allowed = to - thus building up a burst unnecessarily.=0A =0ABursts on a flow are thu= s bad in general. They make congestion happen when it need not.=0A=E2=80= =8BBy far the biggest headache is what the Web does to the network. It has= turned the web into a burst generator.=0AA typical web page may have 10 (o= r even more images). See the "connections per page" plot in the link below= .=0AA browser downloads the base page, and then, over N connections, essent= ially simultaneously downloads those embedded objects. Many/most of them a= re small in size (4-10 packets). You never even get near slow start.=0ASo = you get an IW amount of data/TCP connection, with no pacing, and no congest= ion avoidance. It is easy to observe 50-100 packets (or more) back to back= at the bottleneck.=0AThis is (in practice) the amount you have to buffer t= oday: that burst of packets from a web page. Without flow queuing, you are= screwed. With it, it's annoying, but can be tolerated.=0AI go over this i= s detail in:=0A=0A[http://gettys.wordpress.com/2013/07/10/low-latency-requi= res-smart-queuing-traditional-aqm-is-not-enough/](http://gettys.wordpress.c= om/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-eno= ugh/)=E2=80=8B=0ASo far, I don't believe anyone has tried pacing the IW bur= st of packets. I'd certainly like to see that, but pacing needs to be acro= ss TCP connections (host pairs) to be possibly effective to outwit the gami= ng the web has done to the network.=0A- Jim=0A=0A=0A =0A =0A=0A=0A=0A=0AOn = Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mail= to:swmike@swm.pp.se)> said:=0A=0A=0A=0A> On Sun, 25 May 2014, Dane Medic wr= ote:=0A> =0A> > Is it true that devices with less than 64 MB can't handle Q= OS? ->=0A> > [https://lists.chambana.net/pipermail/commotion-dev/2014-May/0= 01816.html](https://lists.chambana.net/pipermail/commotion-dev/2014-May/001= 816.html)=0A > =0A> At gig speeds you need around 50ms worth of buffering. = 1 gigabit/s =3D=0A> 125 megabyte/s meaning for 50ms you need 6.25 megabyte = of buffer.=0A> =0A> I also don't see why performance and memory size would = be relevant, I'd=0A > say forwarding performance has more to do with CPU sp= eed than anything=0A> else.=0A> =0A> --=0A> Mikael Abrahamsson email: [s= wmike@swm.pp.se](mailto:swmike@swm.pp.se)=0A > ____________________________= ___________________=0A> Cerowrt-devel mailing list=0A> [Cerowrt-devel@lists= .bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net)=0A> [https://= lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net= /listinfo/cerowrt-devel)=0A >=0A___________________________________________= ____=0A Cerowrt-devel mailing list=0A[Cerowrt-devel@lists.bufferbloat.net](= mailto:Cerowrt-devel@lists.bufferbloat.net)=0A[https://lists.bufferbloat.ne= t/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-de= vel)=0A=0A ------=_20140528112005000000_26654 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I did not = mean that "pacing".  Sorry I used a generic term.  I meant what m= y longer description described - a specific mechanism for reducing bunching= that is essentially "cooperative" among all active flows through a bottlen= ecked link.  That's part of a "closed loop" control system driving eac= h TCP endpoint into a cooperative mode.

=0A

 

=0A

The thing you call "pacin= g" is something quite different.  It is disconnected from the TCP cont= rol loops involved, which basically means it is flying blind.  Introdu= cing that kind of "pacing" almost certainly reduces throughput, because it = *delays* packets.

=0A

 

=0A

The thing I called "pacing" is in no version of= Linux that I know of.  Give it a different name: "anti-bunching coope= ration" or "timing phase management for congestion reduction". Rather than = *delaying* packets, it tries to get packets to avoid bunching only when red= ucing window size, and doing so by tightening the control loop so that the = sender transmits as *soon* as it can, not by delaying sending after the sen= der dallies around not sending when it can.

=0A

 

=0A

 

=0A

 

=0A

&nbs= p;

=0A

 

=0A





On Tuesday, May 27, 2014 11:23am, "Jim Ge= ttys" <jg@freedesktop.org> said:

=0A
=0A
=0A


=0A
On Sun, May 25, 2014 at 4:00 PM, <dpreed@re= ed.com> wrote:
=0A
=0A

Not that it is directly relevant, but there is no essential reason to re= quire 50 ms. of buffering.  That might be true of some particular QOS-= related router algorithm.  50 ms. is about all one can tolerate in any= router between source and destination for today's networks - an upper-boun= d rather than a minimum.

=0A

 

=0A

 

=0A=

The rationale fo= r having 50 ms. of buffering is probably to avoid disruption of bursty mixe= d flows where the bursts might persist for 50 ms. and then die. One reason = for this is that source nodes run operating systems that tend to release pa= ckets in bursts. That's a whole other discussion - in an ideal world, sourc= e nodes would avoid bursty packet releases by letting the control by the re= ceiver window be "tight" timing-wise.  That is, to transmit a packet i= mmediately at the instant an ACK arrives increasing the window.  This = would pace the flow - current OS's tend (due to scheduling mismatches) to s= end bursts of packets, "catching up" on sending that could have been spaced= out and done earlier if the feedback from the receiver's window advancing = were heeded.

=0A
=0A
=0A
=E2=80=8B
=0A
=0A
=0A

 

=0A

That is, endpoint network stacks (TCP implementations= ) can worsen congestion by "dallying".  The ideal end-to-end flows occ= upying a congested router would have their packets paced so that the packet= s end up being sent in the least bursty manner that an application can supp= ort.  The effect of this pacing is to move the "backlog" for each flow= quickly into the source node for that flow, which then provides back press= ure on the application driving the flow, which ultimately is necessary to s= tanch congestion.  The ideal congestion control mechanism slows the se= nder part of the application to a pace that can go through the network with= out contributing to buffering.

=0A
=0A
=E2=80=8B=E2=80=8B
=0A
=E2=80=8BPacing is in Linux 3.12(?).  How long = it will take to see widespread deployment is another question, and as for o= ther operating systems, who knows.
=0A=0A
=E2=80=8B=E2=80=8B
=0A
=0A

&nb= sp;

=0A

Curren= t network stacks (including Linux's) don't achieve that goal - their pushba= ck on application sources is minimal - instead they accumulate buffering in= ternal to the network implementation.

=0A
=0A
=E2=80=8BThis is much, muc= h less true than it once was.  There have been substantial changes in = the Linux TCP stack in the last year or two, to avoid generating packets be= fore necessary.  Again, how long it will take for people to deploy thi= s on Linux (and implement on other OS's) is a question.
=0A
=E2=80=8B
=0A=0A

This contributes to end-to-end latency as we= ll.  But if you think about it, this is almost as bad as switch-level = bufferbloat in terms of degrading user experience.  The reason I say "= almost" is that there are tools, rarely used in practice, that allow an app= lication to specify that buffering should not build up in the network stack= (in the kernel or wherever it is).  But the default is not to use tho= se APIs, and to buffer way too much.

=0A

 

=0A

Remember, the network send stack can act similarly = to a congested switch (it is a switch among all the user applications runni= ng on that node).  IF there is a heavy file transfer, the file transfe= r's buffering acts to increase latency for all other networked communicatio= ns on that machine.

=0A

 

=0A

Traditionally this problem has been thought of only as a within-node= fairness issue, but in fact it has a big effect on the switches in between= source and destination due to the lack of dispersed pacing of the packets = at the source - in other words, the current design does nothing to stem the= "burst groups" from a single source mentioned above.

=0A

 

=0A

So we do need the source nodes to = implement less "bursty" sending stacks.  This is especially true for m= ultiplexed source nodes, such as web servers implementing thousands of flow= s.

=0A

 <= /p>=0A

A combinat= ion of codel-style switch-level buffer management and the stack at the send= er being implemented to spread packets in a particular TCP flow out over ti= me would improve things a lot.  To achieve best throughput, the optima= l way to spread packets out on an end-to-end basis is to update the receive= window (sending ACK) at the receive end as quickly as possible, and to res= pond to the updated receive window as quickly as possible when it increases= .

=0A

 =0A

Just like t= he "bufferbloat" issue, the problem is caused by applications like streamin= g video, file transfers and big web pages that the application programmer s= ees as not having a latency requirement within the flow, so the application= programmer does not have an incentive to control pacing.  Thus the op= erating system has got to push back on the applications' flow somehow, so t= hat the flow ends up paced once it enters the Internet itself.  So the= re's no real problem caused by large buffering in the network stack at the = endpoint, as long as the stack's delivery to the Internet is paced by some = mechanism, e.g. tight management of receive window control on an end-to-end= basis.

=0A

&n= bsp;

=0A

I don= 't think this can be fixed by cerowrt, so this is out of place here.  = It's partially ameliorated by cerowrt, if it aggressively drops packets fro= m flows that burst without pacing. fq_codel does this, if the buffer size i= t aims for is small - but the problem is that the OS stacks don't respond b= y pacing... they tend to respond by bursting, not because TCP doesn't provi= de the mechanisms for pacing, but because the OS stack doesn't transmit as = soon as it is allowed to - thus building up a burst unnecessarily.

=0A 

=0A

Bursts on a flow are = thus bad in general.  They make congestion happen when it need not.=0A=0A

=E2=80=8BBy far the biggest headache is what the Web does to the n= etwork.  It has turned the web into a burst generator.
=0A
A typical web page may ha= ve 10 (or even more images).  See the "connections per page" plot in t= he link below.
=0A
A browser downloads the base page, and then, over N connections, essen= tially simultaneously downloads those embedded objects.  Many/most of = them are small in size (4-10 packets).  You never even get near slow s= tart.
=0A
So y= ou get an IW amount of data/TCP connection, with no pacing, and no congesti= on avoidance.  It is easy to observe 50-100 packets (or more) back to = back at the bottleneck.
=0A
This is (in practice) the amount you have to buffer today: th= at burst of packets from a web page.  Without flow queuing, you are sc= rewed.  With it, it's annoying, but can be tolerated.
=0A
I go over this is detail i= n:
=0A
= =0A= =0A
So far, I don't= believe anyone has tried pacing the IW burst of packets.  I'd certain= ly like to see that, but pacing needs to be across TCP connections (host pa= irs) to be possibly effective to outwit the gaming the web has done to the = network.
=0A
-= Jim
=0A
=0A
= =0A
=0A

 

=0A

 

=0A

swmike@swm.p= p.se> said:

=0A
=0A

> On Sun, 25 May 2014, Dane Medic wrote:>
> > Is it true that devices with less than 64 MB can't = handle QOS? ->
> > https://lists.c= hambana.net/pipermail/commotion-dev/2014-May/001816.html
> > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = =3D
> 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buf= fer.
>
> I also don't see why performance and memory size = would be relevant, I'd
> say forwarding performance has more to do= with CPU speed than anything
> else.
>
> --
= > Mikael Abrahamsson email: swmike@swm.pp.se
> ______________________________= _________________
> Cerowrt-devel mailing list
> Cerowrt-deve= l@lists.bufferbloat.net
> https://lists.bufferbloat.net= /listinfo/cerowrt-devel
>

=0A
=0A
=0A
=0A
_______________________________________________
Cerowrt-dev= el mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferb= loat.net/listinfo/cerowrt-devel

=0A
=0A=0A
=0A
------=_20140528112005000000_26654--