From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dpreed@reed.com>
Received: from smtp97.iad3a.emailsrvr.com (smtp97.iad3a.emailsrvr.com
	[173.203.187.97])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 8899421F38C
	for <cerowrt-devel@lists.bufferbloat.net>;
	Thu, 29 May 2014 08:29:32 -0700 (PDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by smtp29.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	849FCF8129; Thu, 29 May 2014 11:29:31 -0400 (EDT)
X-Virus-Scanned: OK
Received: from app8.wa-webapps.iad3a (relay.iad3a.rsapps.net [172.27.255.110])
	by smtp29.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	D7FCBF8113; Thu, 29 May 2014 11:29:30 -0400 (EDT)
Received: from reed.com (localhost.localdomain [127.0.0.1])
	by app8.wa-webapps.iad3a (Postfix) with ESMTP id C2555280057;
	Thu, 29 May 2014 11:29:30 -0400 (EDT)
Received: by apps.rackspace.com
	(Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) 
	with HTTP; Thu, 29 May 2014 11:29:30 -0400 (EDT)
Date: Thu, 29 May 2014 11:29:30 -0400 (EDT)
From: dpreed@reed.com
To: "David P. Reed" <dpreed@reed.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_20140529112930000000_15668"
Importance: Normal
X-Priority: 3 (Normal)
X-Type: html
In-Reply-To: <3eb328c9-05fb-4594-81cc-71e6a623b977@katmail.1gravity.com>
References: <CABsdH_FMqARQQ7oT2gGE6PEZWk1E6b6CDGdBH958nL2=FmFv-A@mail.gmail.com>
	<alpine.DEB.2.02.1405251740360.29282@uplift.swm.pp.se> 
	<1401048053.664331760@apps.rackspace.com> 
	<CAGhGL2Bv-2m+7nvUBNt7CfDqh9diQrMc00Tb1-7-fH2JLYcU=g@mail.gmail.com> 
	<1401290405.100110358@apps.rackspace.com> 
	<alpine.DEB.2.02.1405281127450.32611@nftneq.ynat.uz> 
	<3eb328c9-05fb-4594-81cc-71e6a623b977@katmail.1gravity.com>
Message-ID: <1401377370.793326873@apps.rackspace.com>
X-Mailer: webmail7.0
Cc: "cerowrt-devel@lists.bufferbloat.net" <cerowrt-devel@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] Ubiquiti QOS
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Thu, 29 May 2014 15:29:33 -0000

------=_20140529112930000000_15668
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=0ANote: this is all about "how to achieve and sustain the ballistic phase =
that is optimal for Internet transport" in an end-to-end based control syst=
em like TCP.=0A =0AI think those who have followed this know that, but I wa=
nt to make it clear that I'm proposing a significant improvement that requi=
res changes at the OS stacks and changes in the switches' approach to conge=
stion signaling.  There are ways to phase it in gradually.  In "meshes", et=
c. it could probably be developed and deployed more quickly - but my though=
ts on co-existence with the current TCP stacks and current IP routers are f=
ar less precisely worked out.=0A =0AI am way too busy with my day job to do=
 what needs to be done ... but my sense is that the folks who reduce this t=
o practice will make a HUGE difference to Internet performance.  Bigger tha=
n getting bloat fixed, and to me that is a major, major potential triumph.=
=0A =0A=0A=0AOn Thursday, May 29, 2014 8:11am, "David P. Reed" <dpreed@reed=
.com> said:=0A=0A=0AECN-style signaling has the right properties ... just l=
ike TTL it can provide valid and current sampling of the packet ' s environ=
ment as it travels. The idea is to sample what is happening at a bottleneck=
 for the packet ' s flow.  The bottleneck is the link with the most likelih=
ood of a collision from flows sharing that link.=0A=0A A control - theoreti=
c estimator of recent collision likelihood is easy to do at each queue.  Al=
l active flows would receive that signal, with the busiest ones getting it =
most quickly. Also it is reasonable to count all potentially colliding flow=
s at all outbound queues, and report that.=0A=0A The estimator can then pro=
vide the signal that each flow responds to.=0A=0A The problem of "defectors=
" is best dealt with by punishment... An aggressive packet drop policy that=
 makes causing congestion reduce the cause's throughput and increases laten=
cy is the best kind of answer. Since the router can remember recent flow be=
havior, it can penalize recent flows.=0A=0A A Bloom style filter can rememb=
er flow statistics for both of these local policies. A great use for the me=
mory no longer misapplied to buffering....=0A=0A Simple?=0A=0A=0AOn May 28,=
 2014, David Lang <david@lang.hm> wrote:=0AOn Wed, 28 May 2014, dpreed@reed=
.com wrote:=0A=0AI did not mean that "pacing".  Sorry I used a generic term=
.  I meant what my =0Alonger description described - a specific mechanism f=
or reducing bunching that =0Ais essentially "cooperative" among all active =
flows through a bottlenecked =0Alink.  That's part of a "closed loop" contr=
ol system driving each TCP endpoint =0Ainto a cooperative mode.=0Ahow do yo=
u think we can get feedback from the bottleneck node to all the =0Adifferen=
t senders?=0A=0Awhat happens to the ones who try to play nice if one doesn'=
t?, including what =0Ahappens if one isn't just ignorant of the new coopera=
tive mode, but activly =0Atries to cheat? (as I understand it, this is the =
fatal flaw in many of the past =0Abuffering improvement proposals)=0A=0AWhi=
le the in-h ouserouter is the first bottleneck that user's traffic hits, th=
e =0Abigger problems happen when the bottleneck is in the peering between I=
SPs, many =0Ahops away from any sender, with many different senders competi=
ng for the =0Aavialable bandwidth.=0A=0AThis is where the new buffering app=
roaches win. If the traffic is below the =0Acongestion level, they add very=
 close to zero overhead, but when congestion =0Ahappens, they manage the re=
sulting buffers in a way that's works better for =0Apeople (allowing short,=
 fast connections to be fast with only a small impact on =0Avery long conne=
ctions)=0A=0ADavid Lang=0A=0AThe thing you call "pacing" is something quite=
 different.  It is disconnected =0Afrom the TCP control loops involved, whi=
ch basically means it is flying blind. =0AIntroducing that kind of "pacing"=
 almost certainly  reducesthroughput, because =0Ait *delays* packets.=0A=0A=
The thing I called "pacing" is in no version of Linux that I know of.  Give=
 it =0Aa different name: "anti-bunching cooperation" or "timing phase manag=
ement for =0Acongestion reduction". Rather than *delaying* packets, it trie=
s to get packets =0Ato avoid bunching only when reducing window size, and d=
oing so by tightening =0Athe control loop so that the sender transmits as *=
soon* as it can, not by =0Adelaying sending after the sender dallies around=
 not sending when it can.=0A=0A=0A=0A=0A=0A=0A=0AOn Tuesday, May 27, 2014 1=
1:23am, "Jim Gettys" <jg@freedesktop.org> said:=0A=0A=0A=0A=0A=0A=0A=0AOn S=
un, May 25, 2014 at 4:00 PM,  <[dpreed@reed.com](mailto:dpreed@reed.com)> w=
rote:=0A=0ANot that it is directly relevant, but there is no essential reas=
on to require 50 ms. of buffering.  That might be true of some particular Q=
OS-related router algorith m.  50ms. is about all one can tolerate in any r=
outer between source and destination for today's networks - an upper-bound =
rather than a minimum.=0A=0AThe optimum buffer state for throughput is 1-2 =
packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 byte=
s. Only the bottleneck buffer (the input queue to the lowest speed link alo=
ng the path) should have this much actually buffered. Buffering more than t=
his increases end-to-end latency beyond its optimal state.  Increased end-t=
o-end latency reduces the effectiveness of control loops, creating more con=
gestion.=0A=0AThe rationale for having 50 ms. of buffering is probably to a=
void disruption of bursty mixed flows where the bursts might persist for 50=
 ms. and then die. One reason for this is that source nodes run operating s=
ystems that tend to release packets in bursts. That's a whole other discuss=
ion - in an ideal world, source nodes would avoid bursty packet releases by=
 letting the control by the receiver  windowbe "tight" timing-wise.  That i=
s, to transmit a packet immediately at the instant an ACK arrives increasin=
g the window.  This would pace the flow - current OS's tend (due to schedul=
ing mismatches) to send bursts of packets, "catching up" on sending that co=
uld have been spaced out and done earlier if the feedback from the receiver=
's window advancing were heeded.=0A=0A=E2=80=8B=0A=0AThat is, endpoint netw=
ork stacks (TCP implementations) can worsen congestion by "dallying".  The =
ideal end-to-end flows occupying a congested router would have their packet=
s paced so that the packets end up being sent in the least bursty manner th=
at an application can support.  The effect of this pacing is to move the "b=
acklog" for each flow quickly into the source node for that flow, which the=
n provides back pressure on the application driving the flow, which ultimat=
ely is necessary to stanch congestion.  The ideal congestion control mechan=
ism slows the sender part of the application to a pac e thatcan go through =
the network without contributing to buffering.=0A=E2=80=8B=E2=80=8B=0A=E2=
=80=8BPacing is in Linux 3.12(?).  How long it will take to see widespread =
deployment is another question, and as for other operating systems, who kno=
ws.=0ASee: [[ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/5=
64978 )/]([ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/564=
978 )/)=0A=E2=80=8B=E2=80=8B=0A=0ACurrent network stacks (including Linux's=
) don't achieve that goal - their pushback on application sources is minima=
l - instead they accumulate buffering internal to the network implementatio=
n.=0A=E2=80=8BThis is much, much less true than it once was.  There have be=
en substantial changes in the Linux TCP stack in the last year or two, to a=
void generating packets before necessary.  Again, how long it will take for=
 people to deploy this on Linux (and implement on other OS's) is a question=
.=0A=E2=80=8B=0AThis contributes to end-to-end latency as well.  But if you=
 th inkabout it, this is almost as bad as switch-level bufferbloat in terms=
 of degrading user experience.  The reason I say "almost" is that there are=
 tools, rarely used in practice, that allow an application to specify that =
buffering should not build up in the network stack (in the kernel or wherev=
er it is).  But the default is not to use those APIs, and to buffer way too=
 much.=0A=0ARemember, the network send stack can act similarly to a congest=
ed switch (it is a switch among all the user applications running on that n=
ode).  IF there is a heavy file transfer, the file transfer's buffering act=
s to increase latency for all other networked communications on that machin=
e.=0A=0ATraditionally this problem has been thought of only as a within-nod=
e fairness issue, but in fact it has a big effect on the switches in betwee=
n source and destination due to the lack of dispersed pacing of the packets=
 at the source - in other words, the current design does nothing to stem th=
e "burst g roups"from a single source mentioned above.=0A=0ASo we do need t=
he source nodes to implement less "bursty" sending stacks.  This is especia=
lly true for multiplexed source nodes, such as web servers implementing tho=
usands of flows.=0A=0AA combination of codel-style switch-level buffer mana=
gement and the stack at the sender being implemented to spread packets in a=
 particular TCP flow out over time would improve things a lot.  To achieve =
best throughput, the optimal way to spread packets out on an end-to-end bas=
is is to update the receive window (sending ACK) at the receive end as quic=
kly as possible, and to respond to the updated receive window as quickly as=
 possible when it increases.=0A=0AJust like the "bufferbloat" issue, the pr=
oblem is caused by applications like streaming video, file transfers and bi=
g web pages that the application programmer sees as not having a latency re=
quirement within the flow, so the application programmer does not have an i=
ncentive to co ntrolpacing.  Thus the operating system has got to push back=
 on the applications' flow somehow, so that the flow ends up paced once it =
enters the Internet itself.  So there's no real problem caused by large buf=
fering in the network stack at the endpoint, as long as the stack's deliver=
y to the Internet is paced by some mechanism, e.g. tight management of rece=
ive window control on an end-to-end basis.=0A=0AI don't think this can be f=
ixed by cerowrt, so this is out of place here.  It's partially ameliorated =
by cerowrt, if it aggressively drops packets from flows that burst without =
pacing. fq_codel does this, if the buffer size it aims for is small - but t=
he problem is that the OS stacks don't respond by pacing... they tend to re=
spond by bursting, not because TCP doesn't provide the mechanisms for pacin=
g, but because the OS stack doesn't transmit as soon as it is allowed to - =
thus building up a burst unnecessarily.=0A=0ABursts on a flow are thus bad =
in general.  They makecongestion happen when it need not.=0A=E2=80=8BBy far=
 the biggest headache is what the Web does to the network.  It has turned t=
he web into a burst generator.=0AA typical web page may have 10 (or even mo=
re images).  See the "connections per page" plot in the link below.=0AA bro=
wser downloads the base page, and then, over N connections, essentially sim=
ultaneously downloads those embedded objects.  Many/most of them are small =
in size (4-10 packets).  You never even get near slow start.=0ASo you get a=
n IW amount of data/TCP connection, with no pacing, and no congestion avoid=
ance.  It is easy to observe 50-100 packets (or more) back to back at the b=
ottleneck.=0AThis is (in practice) the amount you have to buffer today: tha=
t burst of packets from a web page.  Without flow queuing, you are screwed.=
  With it, it's annoying, but can be tolerated.=0AI go over this is detail =
in:=0A=0A[[ http://gettys.wordpress.com/2013/07/10/low-latency-requires-sma=
rt-queuing-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/201=
3/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/=
]([ http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queui=
ng-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/2013/07/10/=
low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/)=E2=80=
=8B=0ASo far, I don't believe anyone has tried pacing the IW burst of packe=
ts.  I'd certainly like to see that, but pacing needs to be across TCP conn=
ections (host pairs) to be possibly effective to outwit the gaming the web =
has done to the network.=0A- Jim=0A=0A=0A=0A=0A=0A=0A=0A=0AOn Sunday, May 2=
5, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mailto:swmike@swm=
.pp.se)> said:=0A=0A=0A=0AOn Sun, 25 May 2014, Dane Medic wrote:=0A=0AIs it=
 true that devices with less than 64 MB can't handle QOS? ->=0A[[ https://l=
ists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https://l=
ists.chambana.net/pipermail/commotion-dev/2014-May/001816.html )]([ https:/=
/lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https:/=
/lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ))=0AAt gi=
g speeds you need around 50ms worth of buffering. 1 gigabit/s =3D=0A125 meg=
abyte/s meaning for 50ms you need 6.25 megabyte of buffer.=0A=0AI also don'=
t see why performance and memory size would be relevant, I'd=0Asay forwardi=
ng performance has more to do with CPU speed than anything=0Aelse.=0A=0A--=
=0AMikael Abrahamsson    email:[swmike@swm.pp.se](mailto:swmike@swm.pp.se)=
=0A=0ACerowrt-devel mailing list=0A[Cerowrt-devel@lists.bufferbloat.net](ma=
ilto:Cerowrt-devel@lists.bufferbloat.net)=0A[[ https://lists.bufferbloat.ne=
t/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-=
devel )]([ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://=
lists.bufferbloat.net/listinfo/cerowrt-devel ))=0A=0ACerowrt-devel mailing =
list=0A[Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.buf=
ferbloat.net)=0A[[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( =
https://lists.bufferbloat.net/listinfo/cerowrt-devel )]([ https://lists.buf=
ferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listin=
fo/cerowrt-devel ))=0A =0A=0ACerowrt-devel mailing list=0ACerowrt-devel@lis=
ts.bufferbloat.net=0A[ https://lists.bufferbloat.net/listinfo/cerowrt-devel=
 ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )=0A=0A-- Sent fro=
m my Android device with [ K-@ Mail ]( https://play.google.com/store/apps/d=
etails?id=3Dcom.onegravity.k10.pro2 ). Please excuse my brevity.
------=_20140529112930000000_15668
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<font face=3D"arial" size=3D"2"><p style=3D"margin:0;padding:0;">Note: this=
 is all about "how to achieve and sustain the ballistic phase that is optim=
al for Internet transport" in an end-to-end based control system like TCP.<=
/p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;pa=
dding:0;">I think those who have followed this know that, but I want to mak=
e it clear that I'm proposing a significant improvement that requires chang=
es at the OS stacks and changes in the switches' approach to congestion sig=
naling. &nbsp;There are ways to phase it in gradually. &nbsp;In "meshes", e=
tc. it could probably be developed and deployed more quickly - but my thoug=
hts on co-existence with the current TCP stacks and current IP routers are =
far less precisely worked out.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp=
;</p>=0A<p style=3D"margin:0;padding:0;">I am way too busy with my day job =
to do what needs to be done ... but my sense is that the folks who reduce t=
his to practice will make a HUGE difference to Internet performance. &nbsp;=
Bigger than getting bloat fixed, and to me that is a major, major potential=
 triumph.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<p style=3D"m=
argin:0;padding:0;"><br class=3D"WM_COMPOSE_SIGNATURE_START" /><br class=3D=
"WM_COMPOSE_SIGNATURE_END" /><br /><br />On Thursday, May 29, 2014 8:11am, =
"David P. Reed" &lt;dpreed@reed.com&gt; said:<br /><br /></p>=0A<div id=3D"=
SafeStyles1401376911">ECN-style signaling has the right properties ... just=
 like TTL it can provide valid and current sampling of the packet ' s envir=
onment as it travels. The idea is to sample what is happening at a bottlene=
ck for the packet ' s flow.&nbsp; The bottleneck is the link with the most =
likelihood of a collision from flows sharing that link.<br /><br /> A contr=
ol - theoretic estimator of recent collision likelihood is easy to do at ea=
ch queue.&nbsp; All active flows would receive that signal, with the busies=
t ones getting it most quickly. Also it is reasonable to count all potentia=
lly colliding flows at all outbound queues, and report that.<br /><br /> Th=
e estimator can then provide the signal that each flow responds to.<br /><b=
r /> The problem of "defectors" is best dealt with by punishment... An aggr=
essive packet drop policy that makes causing congestion reduce the cause's =
throughput and increases latency is the best kind of answer. Since the rout=
er can remember recent flow behavior, it can penalize recent flows.<br /><b=
r /> A Bloom style filter can remember flow statistics for both of these lo=
cal policies. A great use for the memory no longer misapplied to buffering.=
...<br /><br /> Simple?<br /><br />=0A<div class=3D"gmail_quote">On May 28,=
 2014, David Lang &lt;david@lang.hm&gt; wrote:=0A<blockquote class=3D"gmail=
_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid #cccccc;=
 padding-left: 1ex;">=0A<p class=3D"k10mail" style=3D"margin:0;padding:0;">=
On Wed, 28 May 2014, dpreed@reed.com wrote:<br /><br /><blockquote class=3D=
"gmail_quote" style=3D"margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #7=
29fcf; padding-left: 1ex;">I did not mean that "pacing".  Sorry I used a ge=
neric term.  I meant what my <br />longer description described - a specifi=
c mechanism for reducing bunching that <br />is essentially "cooperative" a=
mong all active flows through a bottlenecked <br />link.  That's part of a =
"closed loop" control system driving each TCP endpoint <br />into a coopera=
tive mode.</blockquote><br />how do you think we can get feedback from the =
bottleneck node to all the <br />different senders?<br /><br />what happens=
 to the ones who try to play nice if one doesn't?, including what <br />hap=
pens if one isn't just ignorant of the new cooperative mode, but activly <b=
r />tries to cheat? (as I understand it, this is the fatal flaw in many of =
the past <br />buffering improvement proposals)<br /><br />While the in-h=
=0A ouse=0Arouter is the first bottleneck that user's traffic hits, the <br=
 />bigger problems happen when the bottleneck is in the peering between ISP=
s, many <br />hops away from any sender, with many different senders compet=
ing for the <br />avialable bandwidth.<br /><br />This is where the new buf=
fering approaches win. If the traffic is below the <br />congestion level, =
they add very close to zero overhead, but when congestion <br />happens, th=
ey manage the resulting buffers in a way that's works better for <br />peop=
le (allowing short, fast connections to be fast with only a small impact on=
 <br />very long connections)<br /><br />David Lang<br /><br /><blockquote =
class=3D"gmail_quote" style=3D"margin: 0pt 0pt 1ex 0.8ex; border-left: 1px =
solid #729fcf; padding-left: 1ex;">The thing you call "pacing" is something=
 quite different.  It is disconnected <br />from the TCP control loops invo=
lved, which basically means it is flying blind. <br />Introducing that kind=
 of "pacing" almost certainly =0A reduces=0Athroughput, because <br />it *d=
elays* packets.<br /><br />The thing I called "pacing" is in no version of =
Linux that I know of.  Give it <br />a different name: "anti-bunching coope=
ration" or "timing phase management for <br />congestion reduction". Rather=
 than *delaying* packets, it tries to get packets <br />to avoid bunching o=
nly when reducing window size, and doing so by tightening <br />the control=
 loop so that the sender transmits as *soon* as it can, not by <br />delayi=
ng sending after the sender dallies around not sending when it can.<br /><b=
r /><br /><br /><br /><br /><br /><br />On Tuesday, May 27, 2014 11:23am, "=
Jim Gettys" &lt;jg@freedesktop.org&gt; said:<br /><br /><br /><br /><br /><=
br /><br /><br />On Sun, May 25, 2014 at 4:00 PM,  &lt;[dpreed@reed.com](ma=
ilto:dpreed@reed.com)&gt; wrote:<br /><br />Not that it is directly relevan=
t, but there is no essential reason to require 50 ms. of buffering.  That m=
ight be true of some particular QOS-related router algorith=0A m.  50=0Ams.=
 is about all one can tolerate in any router between source and destination=
 for today's networks - an upper-bound rather than a minimum.<br /><br />Th=
e optimum buffer state for throughput is 1-2 packets worth - in other words=
, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer =
(the input queue to the lowest speed link along the path) should have this =
much actually buffered. Buffering more than this increases end-to-end laten=
cy beyond its optimal state.  Increased end-to-end latency reduces the effe=
ctiveness of control loops, creating more congestion.<br /><br />The ration=
ale for having 50 ms. of buffering is probably to avoid disruption of burst=
y mixed flows where the bursts might persist for 50 ms. and then die. One r=
eason for this is that source nodes run operating systems that tend to rele=
ase packets in bursts. That's a whole other discussion - in an ideal world,=
 source nodes would avoid bursty packet releases by letting the control by =
the receiver=0A  window=0Abe "tight" timing-wise.  That is, to transmit a p=
acket immediately at the instant an ACK arrives increasing the window.  Thi=
s would pace the flow - current OS's tend (due to scheduling mismatches) to=
 send bursts of packets, "catching up" on sending that could have been spac=
ed out and done earlier if the feedback from the receiver's window advancin=
g were heeded.<br /><br />=E2=80=8B<br /><br />That is, endpoint network st=
acks (TCP implementations) can worsen congestion by "dallying".  The ideal =
end-to-end flows occupying a congested router would have their packets pace=
d so that the packets end up being sent in the least bursty manner that an =
application can support.  The effect of this pacing is to move the "backlog=
" for each flow quickly into the source node for that flow, which then prov=
ides back pressure on the application driving the flow, which ultimately is=
 necessary to stanch congestion.  The ideal congestion control mechanism sl=
ows the sender part of the application to a pac=0A e that=0Acan go through =
the network without contributing to buffering.<br />=E2=80=8B=E2=80=8B<br /=
>=E2=80=8BPacing is in Linux 3.12(?).  How long it will take to see widespr=
ead deployment is another question, and as for other operating systems, who=
 knows.<br />See: [<a href=3D"https://lwn.net/Articles/564978">https://lwn.=
net/Articles/564978</a>/](<a href=3D"https://lwn.net/Articles/564978">https=
://lwn.net/Articles/564978</a>/)<br />=E2=80=8B=E2=80=8B<br /><br />Current=
 network stacks (including Linux's) don't achieve that goal - their pushbac=
k on application sources is minimal - instead they accumulate buffering int=
ernal to the network implementation.<br />=E2=80=8BThis is much, much less =
true than it once was.  There have been substantial changes in the Linux TC=
P stack in the last year or two, to avoid generating packets before necessa=
ry.  Again, how long it will take for people to deploy this on Linux (and i=
mplement on other OS's) is a question.<br />=E2=80=8B<br />This contributes=
 to end-to-end latency as well.  But if you th=0A ink=0Aabout it, this is a=
lmost as bad as switch-level bufferbloat in terms of degrading user experie=
nce.  The reason I say "almost" is that there are tools, rarely used in pra=
ctice, that allow an application to specify that buffering should not build=
 up in the network stack (in the kernel or wherever it is).  But the defaul=
t is not to use those APIs, and to buffer way too much.<br /><br />Remember=
, the network send stack can act similarly to a congested switch (it is a s=
witch among all the user applications running on that node).  IF there is a=
 heavy file transfer, the file transfer's buffering acts to increase latenc=
y for all other networked communications on that machine.<br /><br />Tradit=
ionally this problem has been thought of only as a within-node fairness iss=
ue, but in fact it has a big effect on the switches in between source and d=
estination due to the lack of dispersed pacing of the packets at the source=
 - in other words, the current design does nothing to stem the "burst g=0A =
roups"=0Afrom a single source mentioned above.<br /><br />So we do need the=
 source nodes to implement less "bursty" sending stacks.  This is especiall=
y true for multiplexed source nodes, such as web servers implementing thous=
ands of flows.<br /><br />A combination of codel-style switch-level buffer =
management and the stack at the sender being implemented to spread packets =
in a particular TCP flow out over time would improve things a lot.  To achi=
eve best throughput, the optimal way to spread packets out on an end-to-end=
 basis is to update the receive window (sending ACK) at the receive end as =
quickly as possible, and to respond to the updated receive window as quickl=
y as possible when it increases.<br /><br />Just like the "bufferbloat" iss=
ue, the problem is caused by applications like streaming video, file transf=
ers and big web pages that the application programmer sees as not having a =
latency requirement within the flow, so the application programmer does not=
 have an incentive to co=0A ntrol=0Apacing.  Thus the operating system has =
got to push back on the applications' flow somehow, so that the flow ends u=
p paced once it enters the Internet itself.  So there's no real problem cau=
sed by large buffering in the network stack at the endpoint, as long as the=
 stack's delivery to the Internet is paced by some mechanism, e.g. tight ma=
nagement of receive window control on an end-to-end basis.<br /><br />I don=
't think this can be fixed by cerowrt, so this is out of place here.  It's =
partially ameliorated by cerowrt, if it aggressively drops packets from flo=
ws that burst without pacing. fq_codel does this, if the buffer size it aim=
s for is small - but the problem is that the OS stacks don't respond by pac=
ing... they tend to respond by bursting, not because TCP doesn't provide th=
e mechanisms for pacing, but because the OS stack doesn't transmit as soon =
as it is allowed to - thus building up a burst unnecessarily.<br /><br />Bu=
rsts on a flow are thus bad in general.  They make=0Acongestion happen when=
 it need not.<br />=E2=80=8BBy far the biggest headache is what the Web doe=
s to the network.  It has turned the web into a burst generator.<br />A typ=
ical web page may have 10 (or even more images).  See the "connections per =
page" plot in the link below.<br />A browser downloads the base page, and t=
hen, over N connections, essentially simultaneously downloads those embedde=
d objects.  Many/most of them are small in size (4-10 packets).  You never =
even get near slow start.<br />So you get an IW amount of data/TCP connecti=
on, with no pacing, and no congestion avoidance.  It is easy to observe 50-=
100 packets (or more) back to back at the bottleneck.<br />This is (in prac=
tice) the amount you have to buffer today: that burst of packets from a web=
 page.  Without flow queuing, you are screwed.  With it, it's annoying, but=
 can be tolerated.<br />I go over this is detail in:<br /><br />[<a href=3D=
"http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-=
traditional-aqm-is-not-enough">http://gettys.wordpress.com/2013/07/10/low-l=
atency-requires-smart-queuing-traditional-aqm-is-not-enough</a>/](<a href=
=3D"http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queui=
ng-traditional-aqm-is-not-enough">http://gettys.wordpress.com/2013/07/10/lo=
w-latency-requires-smart-queuing-traditional-aqm-is-not-enough</a>/)=E2=80=
=8B<br />So far, I don't believe anyone has tried pacing the IW burst of pa=
ckets.  I'd certainly like to see that, but pacing needs to be across TCP c=
onnections (host pairs) to be possibly effective to outwit the gaming the w=
eb has done to the network.<br />- Jim<br /><br /><br /><br /><br /><br /><=
br /><br /><br />On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" &lt;=
[swmike@swm.pp.se](mailto:swmike@swm.pp.se)&gt; said:<br /><br /><br /><br =
/><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 1ex 0.8ex; bor=
der-left: 1px solid #ad7f=0A a8; padding-left: 1ex;">On Sun, 25 May 2014, D=
ane Medic wrote:<br /><br /><blockquote class=3D"gmail_quote" style=3D"marg=
in: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #8ae234; padding-left: 1ex;">=
Is it true that devices with less than 64 MB can't handle QOS? -&gt;<br />[=
<a href=3D"https://lists.chambana.net/pipermail/commotion-dev/2014-May/0018=
16.html">https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816=
.html</a>](<a href=3D"https://lists.chambana.net/pipermail/commotion-dev/20=
14-May/001816.html">https://lists.chambana.net/pipermail/commotion-dev/2014=
-May/001816.html</a>)</blockquote><br />At gig speeds you need around 50ms =
worth of buffering. 1 gigabit/s =3D<br />125 megabyte/s meaning for 50ms yo=
u need 6.25 megabyte of buffer.<br /><br />I also don't see why performance=
 and memory size would be relevant, I'd<br />say forwarding performance has=
 more to do with CPU speed than anything<br />else.<br /><br />--<br />Mika=
el Abrahamsson    email:=0A[swmike@swm.pp.se](mailto:swmike@swm.pp.se)<br /=
><hr /><br />Cerowrt-devel mailing list<br />[Cerowrt-devel@lists.bufferblo=
at.net](mailto:Cerowrt-devel@lists.bufferbloat.net)<br />[<a href=3D"https:=
//lists.bufferbloat.net/listinfo/cerowrt-devel">https://lists.bufferbloat.n=
et/listinfo/cerowrt-devel</a>](<a href=3D"https://lists.bufferbloat.net/lis=
tinfo/cerowrt-devel">https://lists.bufferbloat.net/listinfo/cerowrt-devel</=
a>)</blockquote><br /><hr /><br />Cerowrt-devel mailing list<br />[Cerowrt-=
devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net)<br=
 />[<a href=3D"https://lists.bufferbloat.net/listinfo/cerowrt-devel">https:=
//lists.bufferbloat.net/listinfo/cerowrt-devel</a>](<a href=3D"https://list=
s.bufferbloat.net/listinfo/cerowrt-devel">https://lists.bufferbloat.net/lis=
tinfo/cerowrt-devel</a>)</blockquote></pre>=0A<p style=3D"margin:0;padding:=
0;margin-top: 2.5em; margin-bottom: 1em; border-bottom: 1px solid #000;">&n=
bsp;</p>=0A<p class=3D"k10mail" style=3D"margin:0;padding:0;"><hr /><br />C=
erowrt-devel mailing list<br />Cerowrt-devel@lists.bufferbloat.net<br /><a =
href=3D"https://lists.bufferbloat.net/listinfo/cerowrt-devel">https://lists=
.bufferbloat.net/listinfo/cerowrt-devel</a><br /></pre>=0A</blockquote>=0A<=
/div>=0A<br />-- Sent from my Android device with <strong><a href=3D"https:=
//play.google.com/store/apps/details?id=3Dcom.onegravity.k10.pro2">K-@ Mail=
</a></strong>. Please excuse my brevity.</div></font>
------=_20140529112930000000_15668--