From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dpreed@reed.com>
Received: from smtp113.iad3a.emailsrvr.com (smtp113.iad3a.emailsrvr.com
	[173.203.187.113])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 9A66E21F26C
	for <cerowrt-devel@lists.bufferbloat.net>;
	Wed, 28 May 2014 08:33:43 -0700 (PDT)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by smtp23.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	75AF0280115; Wed, 28 May 2014 11:33:42 -0400 (EDT)
X-Virus-Scanned: OK
Received: from app45.wa-webapps.iad3a (relay.iad3a.rsapps.net [172.27.255.110])
	by smtp23.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id
	574042800E6; Wed, 28 May 2014 11:33:42 -0400 (EDT)
Received: from reed.com (localhost.localdomain [127.0.0.1])
	by app45.wa-webapps.iad3a (Postfix) with ESMTP id 46D4038108C;
	Wed, 28 May 2014 11:33:42 -0400 (EDT)
Received: by apps.rackspace.com
	(Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) 
	with HTTP; Wed, 28 May 2014 11:33:42 -0400 (EDT)
Date: Wed, 28 May 2014 11:33:42 -0400 (EDT)
From: dpreed@reed.com
To: "Dave Taht" <dave.taht@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_20140528113342000000_53151"
Importance: Normal
X-Priority: 3 (Normal)
X-Type: html
In-Reply-To: <CAA93jw42uRwQyR_s_+B0HU7iTwwXLn=EEQpZy2foB_yf6DqQSQ@mail.gmail.com>
References: <CABsdH_FMqARQQ7oT2gGE6PEZWk1E6b6CDGdBH958nL2=FmFv-A@mail.gmail.com>
	<alpine.DEB.2.02.1405251740360.29282@uplift.swm.pp.se> 
	<1401048053.664331760@apps.rackspace.com> 
	<CAGhGL2Bv-2m+7nvUBNt7CfDqh9diQrMc00Tb1-7-fH2JLYcU=g@mail.gmail.com> 
	<CAA93jw42uRwQyR_s_+B0HU7iTwwXLn=EEQpZy2foB_yf6DqQSQ@mail.gmail.com>
Message-ID: <1401291222.288942@apps.rackspace.com>
X-Mailer: webmail7.0
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] Ubiquiti QOS
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Wed, 28 May 2014 15:33:44 -0000

------=_20140528113342000000_53151
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=0ASame concern I mentioned with Jim's message.   I was not clear what I me=
ant by "pacing" in the context of optimization of latency while preserving =
throughput.  It is NOT just a matter of spreading packets out in time that =
I was talking about.   It is a matter of doing so without reducing throughp=
ut.  That means transmitting as *early* as possible while avoiding congesti=
on.  Building a "backlog" and then artificially spreading it out by "add-on=
 pacing" will definitely reduce throughput below the flow's fair share of t=
he bottleneck resource.=0A =0AIt is pretty clear to me that you can't get t=
o a minimal latency, optimal throughput control algorithm by a series of "a=
dd ons" in LART.  It requires rethinking of the control discipline, and cha=
nges to get more information about congestion earlier, without ever allowin=
g a buffer queue to build up in intermediate nodes - since that destroys la=
tency by definition.=0A =0AAs long as you require buffers to grow at bottle=
neck links in order to get measurements of congestion, you probably are stu=
ck with long-time-constant control loops, and as long as you encourage buff=
ering at OS send stacks you are even worse off at the application layer.=0A=
 =0AThe problem is in the assumption that buffer queueing is the only possi=
ble answer.  The "pacing" being included in Linux is just another way to bu=
ild bigger buffers (on the sending host), by taking control away from the T=
CP control loop.=0A =0A =0A=0A=0AOn Tuesday, May 27, 2014 1:31pm, "Dave Tah=
t" <dave.taht@gmail.com> said:=0A=0A=0A=0A> This has been a good thread, an=
d I'm sorry it was mostly on=0A> cerowrt-devel rather than the main list...=
=0A> =0A> It is not clear from observing google's deployment that pacing of=
 the=0A> IW is not in use. I see=0A> clear 1ms boundaries for individual fl=
ows on much lower than iw10=0A> boundaries. (e.g. I see 1-4=0A> packets at =
a time arrive at 1ms intervals - but this could be an=0A> artifact of the c=
apture, intermediate=0A> devices, etc)=0A> =0A> sch_fq comes with explicit =
support for spreading out the initial=0A> window, (by default it allows a f=
ull iw10 burst however) and tcp small=0A> queues and pacing-aware tcps and =
the tso fixes and stuff we don't know=0A> about all are collaborating to re=
duce the web burst size...=0A> =0A> sch_fq_codel used as the host/router qd=
isc basically does spread out=0A> any flow if there is a bottleneck on the =
link. The pacing stuff=0A> spreads flow delivery out across an estimate of =
srtt by clock tick...=0A> =0A> It makes tremendous sense to pace out a flow=
 if you are hitting the=0A> wire at 10gbit and know you are stepping down t=
o 100mbit or less on=0A> the end device - that 100x difference in rate is m=
eaningful... and at=0A> the same time to get full throughput out of 10gbit =
some level of tso=0A> offloads is needed... and the initial guess=0A> at th=
e right pace is hard to get right before a couple RTTs go by.=0A> =0A> I lo=
ok forward to learning what's up.=0A> =0A> On Tue, May 27, 2014 at 8:23 AM,=
 Jim Gettys <jg@freedesktop.org> wrote:=0A> >=0A> >=0A> >=0A> > On Sun, May=
 25, 2014 at 4:00 PM, <dpreed@reed.com> wrote:=0A> >>=0A> >> Not that it is=
 directly relevant, but there is no essential reason to=0A> >> require 50 m=
s. of buffering.  That might be true of some particular=0A> >> QOS-related =
router algorithm.  50 ms. is about all one can tolerate in=0A> any=0A> >> r=
outer between source and destination for today's networks - an=0A> upper-bo=
und=0A> >> rather than a minimum.=0A> >>=0A> >>=0A> >>=0A> >> The optimum b=
uffer state for throughput is 1-2 packets worth - in other=0A> >> words, if=
 we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck=0A> >> buff=
er (the input queue to the lowest speed link along the path) should=0A> hav=
e=0A> >> this much actually buffered. Buffering more than this increases=0A=
> end-to-end=0A> >> latency beyond its optimal state.  Increased end-to-end=
 latency reduces=0A> the=0A> >> effectiveness of control loops, creating mo=
re congestion.=0A> =0A> This misses an important facet of modern macs (wifi=
, wireless, cable, and gpon),=0A> which which can aggregate 32k or more in =
packets.=0A> =0A> So the ideal size in those cases is much larger than a MT=
U, and has additional=0A> factors governing the ideal - such as the probabi=
lity of a packet loss inducing=0A> a retransmit....=0A> =0A> Ethernet, sure=
.=0A> =0A> >>=0A> >>=0A> >>=0A> >> The rationale for having 50 ms. of buffe=
ring is probably to avoid=0A> >> disruption of bursty mixed flows where the=
 bursts might persist for 50=0A> ms.=0A> >> and then die. One reason for th=
is is that source nodes run operating=0A> systems=0A> >> that tend to relea=
se packets in bursts. That's a whole other discussion -=0A> in=0A> >> an id=
eal world, source nodes would avoid bursty packet releases by=0A> letting=
=0A> >> the control by the receiver window be "tight" timing-wise.  That is=
, to=0A> >> transmit a packet immediately at the instant an ACK arrives inc=
reasing=0A> the=0A> >> window.  This would pace the flow - current OS's ten=
d (due to scheduling=0A> >> mismatches) to send bursts of packets, "catchin=
g up" on sending that=0A> could=0A> >> have been spaced out and done earlie=
r if the feedback from the=0A> receiver's=0A> >> window advancing were heed=
ed.=0A> =0A> This loop has got ever tighter since linux 3.3, to where it's =
really as tight=0A> as a modern cpu scheduler can get it. (or so I keep thi=
nking -=0A> but successive improvements in linux tcp keep proving me wrong.=
 :)=0A> =0A> I am really in awe of linux tcp these days. Recently I was ben=
chmarking=0A> windows and macos. Windows only got 60% of the throughput lin=
ux tcp=0A> did at gigE speeds, and osx had a lot of issues at 10mbit and be=
low,=0A> stretch acks and holding the window too high for the path)=0A> =0A=
> I keep hoping better ethernet hardware will arrive that can mix flows=0A>=
 even more.=0A> =0A> >>=0A> >>=0A> >>=0A> >> That is, endpoint network stac=
ks (TCP implementations) can worsen=0A> >> congestion by "dallying".  The i=
deal end-to-end flows occupying a=0A> congested=0A> >> router would have th=
eir packets paced so that the packets end up being=0A> sent=0A> >> in the l=
east bursty manner that an application can support.  The effect=0A> of=0A> =
>> this pacing is to move the "backlog" for each flow quickly into the=0A> =
source=0A> >> node for that flow, which then provides back pressure on the =
application=0A> >> driving the flow, which ultimately is necessary to stanc=
h congestion. =0A> The=0A> >> ideal congestion control mechanism slows the =
sender part of the=0A> application=0A> >> to a pace that can go through the=
 network without contributing to=0A> buffering.=0A> >=0A> >=0A> > Pacing is=
 in Linux 3.12(?).  How long it will take to see widespread=0A> > deploymen=
t is another question, and as for other operating systems, who=0A> > knows.=
=0A> >=0A> > See: https://lwn.net/Articles/564978/=0A> =0A> Steinar drove s=
ome of this with persistence and results...=0A> =0A> http://www.linux-suppo=
rt.com/cms/steinar-h-gunderson-paced-tcp-and-the-fq-scheduler/=0A> =0A> >>=
=0A> >>=0A> >>=0A> >> Current network stacks (including Linux's) don't achi=
eve that goal -=0A> their=0A> >> pushback on application sources is minimal=
 - instead they accumulate=0A> >> buffering internal to the network impleme=
ntation.=0A> >=0A> >=0A> > This is much, much less true than it once was.  =
There have been substantial=0A> > changes in the Linux TCP stack in the las=
t year or two, to avoid generating=0A> > packets before necessary.  Again, =
how long it will take for people to deploy=0A> > this on Linux (and impleme=
nt on other OS's) is a question.=0A> =0A> The data centers I'm in (linode, =
isc, google cloud) seem to be=0A> tracking modern kernels pretty good...=0A=
> =0A> >>=0A> >> This contributes to end-to-end latency as well.  But if yo=
u think about=0A> >> it, this is almost as bad as switch-level bufferbloat =
in terms of=0A> degrading=0A> >> user experience.  The reason I say "almost=
" is that there are tools,=0A> rarely=0A> >> used in practice, that allow a=
n application to specify that buffering=0A> should=0A> >> not build up in t=
he network stack (in the kernel or wherever it is). =0A> But=0A> >> the def=
ault is not to use those APIs, and to buffer way too much.=0A> >>=0A> >>=0A=
> >>=0A> >> Remember, the network send stack can act similarly to a congest=
ed switch=0A> >> (it is a switch among all the user applications running on=
 that node). =0A> IF=0A> >> there is a heavy file transfer, the file transf=
er's buffering acts to=0A> >> increase latency for all other networked comm=
unications on that machine.=0A> >>=0A> >>=0A> >>=0A> >> Traditionally this =
problem has been thought of only as a within-node=0A> >> fairness issue, bu=
t in fact it has a big effect on the switches in=0A> between=0A> >> source =
and destination due to the lack of dispersed pacing of the packets=0A> at=
=0A> >> the source - in other words, the current design does nothing to ste=
m the=0A> >> "burst groups" from a single source mentioned above.=0A> >>=0A=
> >>=0A> >>=0A> >> So we do need the source nodes to implement less "bursty=
" sending=0A> stacks.=0A> >> This is especially true for multiplexed source=
 nodes, such as web=0A> servers=0A> >> implementing thousands of flows.=0A>=
 >>=0A> >>=0A> >>=0A> >> A combination of codel-style switch-level buffer m=
anagement and the=0A> stack=0A> >> at the sender being implemented to sprea=
d packets in a particular TCP=0A> flow=0A> >> out over time would improve t=
hings a lot.  To achieve best throughput,=0A> the=0A> >> optimal way to spr=
ead packets out on an end-to-end basis is to update=0A> the=0A> >> receive =
window (sending ACK) at the receive end as quickly as possible,=0A> and=0A>=
 >> to respond to the updated receive window as quickly as possible when it=
=0A> >> increases.=0A> >>=0A> >>=0A> >>=0A> >> Just like the "bufferbloat" =
issue, the problem is caused by applications=0A> >> like streaming video, f=
ile transfers and big web pages that the=0A> application=0A> >> programmer =
sees as not having a latency requirement within the flow, so=0A> the=0A> >>=
 application programmer does not have an incentive to control pacing. =0A> =
Thus=0A> >> the operating system has got to push back on the applications' =
flow=0A> somehow,=0A> >> so that the flow ends up paced once it enters the =
Internet itself.  So=0A> >> there's no real problem caused by large bufferi=
ng in the network stack=0A> at=0A> >> the endpoint, as long as the stack's =
delivery to the Internet is paced=0A> by=0A> >> some mechanism, e.g. tight =
management of receive window control on an=0A> >> end-to-end basis.=0A> >>=
=0A> >>=0A> >>=0A> >> I don't think this can be fixed by cerowrt, so this i=
s out of place=0A> here.=0A> >> It's partially ameliorated by cerowrt, if i=
t aggressively drops packets=0A> from=0A> >> flows that burst without pacin=
g. fq_codel does this, if the buffer size=0A> it=0A> >> aims for is small -=
 but the problem is that the OS stacks don't respond=0A> by=0A> >> pacing..=
. they tend to respond by bursting, not because TCP doesn't=0A> provide=0A>=
 >> the mechanisms for pacing, but because the OS stack doesn't transmit as=
=0A> soon=0A> >> as it is allowed to - thus building up a burst unnecessari=
ly.=0A> >>=0A> >>=0A> >>=0A> >> Bursts on a flow are thus bad in general.  =
They make congestion happen=0A> >> when it need not.=0A> >=0A> >=0A> > By f=
ar the biggest headache is what the Web does to the network.  It has=0A> > =
turned the web into a burst generator.=0A> >=0A> > A typical web page may h=
ave 10 (or even more images).  See the "connections=0A> > per page" plot in=
 the link below.=0A> >=0A> > A browser downloads the base page, and then, o=
ver N connections, essentially=0A> > simultaneously downloads those embedde=
d objects.  Many/most of them are=0A> > small in size (4-10 packets).  You =
never even get near slow start.=0A> >=0A> > So you get an IW amount of data=
/TCP connection, with no pacing, and no=0A> > congestion avoidance.  It is =
easy to observe 50-100 packets (or more) back=0A> > to back at the bottlene=
ck.=0A> >=0A> > This is (in practice) the amount you have to buffer today: =
that burst of=0A> > packets from a web page.  Without flow queuing, you are=
 screwed.  With it,=0A> > it's annoying, but can be tolerated.=0A> >=0A> >=
=0A> > I go over this is detail in:=0A> >=0A> >=0A> http://gettys.wordpress=
.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-e=
nough/=0A> >=0A> > So far, I don't believe anyone has tried pacing the IW b=
urst of packets.=0A> > I'd certainly like to see that, but pacing needs to =
be across TCP=0A> > connections (host pairs) to be possibly effective to ou=
twit the gaming the=0A> > web has done to the network.=0A> >               =
                                                   - Jim=0A> >=0A> >>=0A> >=
>=0A> >>=0A> >>=0A> >>=0A> >>=0A> >>=0A> >>=0A> >> On Sunday, May 25, 2014 =
11:42am, "Mikael Abrahamsson"=0A> <swmike@swm.pp.se>=0A> >> said:=0A> >>=0A=
> >> > On Sun, 25 May 2014, Dane Medic wrote:=0A> >> >=0A> >> > > Is it tru=
e that devices with less than 64 MB can't handle QOS?=0A> ->=0A> >> > >=0A>=
 >> > >=0A> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001=
816.html=0A> >> >=0A> >> > At gig speeds you need around 50ms worth of buff=
ering. 1 gigabit/s=0A> =3D=0A> >> > 125 megabyte/s meaning for 50ms you nee=
d 6.25 megabyte of buffer.=0A> >> >=0A> >> > I also don't see why performan=
ce and memory size would be relevant,=0A> I'd=0A> >> > say forwarding perfo=
rmance has more to do with CPU speed than=0A> anything=0A> >> > else.=0A> >=
> >=0A> >> > --=0A> >> > Mikael Abrahamsson email: swmike@swm.pp.se=0A> >> =
> _______________________________________________=0A> >> > Cerowrt-devel ma=
iling list=0A> >> > Cerowrt-devel@lists.bufferbloat.net=0A> >> > https://li=
sts.bufferbloat.net/listinfo/cerowrt-devel=0A> >> >=0A> >>=0A> >>=0A> >> __=
_____________________________________________=0A> >> Cerowrt-devel mailing =
list=0A> >> Cerowrt-devel@lists.bufferbloat.net=0A> >> https://lists.buffer=
bloat.net/listinfo/cerowrt-devel=0A> >>=0A> >=0A> >=0A> > _________________=
______________________________=0A> > Cerowrt-devel mailing list=0A> > Cerow=
rt-devel@lists.bufferbloat.net=0A> > https://lists.bufferbloat.net/listinfo=
/cerowrt-devel=0A> >=0A> =0A> =0A> =0A> --=0A> Dave T=C3=A4ht=0A> =0A> NSFW=
:=0A> https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_=
indecent.article=0A>
------=_20140528113342000000_53151
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<font face=3D"arial" size=3D"2"><p style=3D"margin:0;padding:0;">Same conce=
rn I mentioned with Jim's message. &nbsp; I was not clear what I meant by "=
pacing" in the context of optimization of latency while preserving throughp=
ut. &nbsp;It is NOT just a matter of spreading packets out in time that I w=
as talking about. &nbsp; It is a matter of doing so without reducing throug=
hput. &nbsp;That means transmitting as *early* as possible while avoiding c=
ongestion. &nbsp;Building a "backlog" and then artificially spreading it ou=
t by "add-on pacing" will definitely reduce throughput below the flow's fai=
r share of the bottleneck resource.</p>=0A<p style=3D"margin:0;padding:0;">=
&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">It is pretty clear to me that=
 you can't get to a minimal latency, optimal throughput control algorithm b=
y a series of "add ons" in LART. &nbsp;It requires rethinking of the contro=
l discipline, and changes to get more information about congestion earlier,=
 without ever allowing a buffer queue to build up in intermediate nodes - s=
ince that destroys latency by definition.</p>=0A<p style=3D"margin:0;paddin=
g:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">As long as you require =
buffers to grow at bottleneck links in order to get measurements of congest=
ion, you probably are stuck with long-time-constant control loops, and as l=
ong as you encourage buffering at OS send stacks you are even worse off at =
the application layer.</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A=
<p style=3D"margin:0;padding:0;">The problem is in the assumption that buff=
er queueing is the only possible answer. &nbsp;The "pacing" being included =
in Linux is just another way to build bigger buffers (on the sending host),=
 by taking control away from the TCP control loop.</p>=0A<p style=3D"margin=
:0;padding:0;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;">&nbsp;</p>=0A<=
p style=3D"margin:0;padding:0;"><br class=3D"WM_COMPOSE_SIGNATURE_START" />=
<br class=3D"WM_COMPOSE_SIGNATURE_END" /><br /><br />On Tuesday, May 27, 20=
14 1:31pm, "Dave Taht" &lt;dave.taht@gmail.com&gt; said:<br /><br /></p>=0A=
<div id=3D"SafeStyles1401290428">=0A<p style=3D"margin:0;padding:0;">&gt; T=
his has been a good thread, and I'm sorry it was mostly on<br />&gt; cerowr=
t-devel rather than the main list...<br />&gt; <br />&gt; It is not clear f=
rom observing google's deployment that pacing of the<br />&gt; IW is not in=
 use. I see<br />&gt; clear 1ms boundaries for individual flows on much low=
er than iw10<br />&gt; boundaries. (e.g. I see 1-4<br />&gt; packets at a t=
ime arrive at 1ms intervals - but this could be an<br />&gt; artifact of th=
e capture, intermediate<br />&gt; devices, etc)<br />&gt; <br />&gt; sch_fq=
 comes with explicit support for spreading out the initial<br />&gt; window=
, (by default it allows a full iw10 burst however) and tcp small<br />&gt; =
queues and pacing-aware tcps and the tso fixes and stuff we don't know<br /=
>&gt; about all are collaborating to reduce the web burst size...<br />&gt;=
 <br />&gt; sch_fq_codel used as the host/router qdisc basically does sprea=
d out<br />&gt; any flow if there is a bottleneck on the link. The pacing s=
tuff<br />&gt; spreads flow delivery out across an estimate of srtt by cloc=
k tick...<br />&gt; <br />&gt; It makes tremendous sense to pace out a flow=
 if you are hitting the<br />&gt; wire at 10gbit and know you are stepping =
down to 100mbit or less on<br />&gt; the end device - that 100x difference =
in rate is meaningful... and at<br />&gt; the same time to get full through=
put out of 10gbit some level of tso<br />&gt; offloads is needed... and the=
 initial guess<br />&gt; at the right pace is hard to get right before a co=
uple RTTs go by.<br />&gt; <br />&gt; I look forward to learning what's up.=
<br />&gt; <br />&gt; On Tue, May 27, 2014 at 8:23 AM, Jim Gettys &lt;jg@fr=
eedesktop.org&gt; wrote:<br />&gt; &gt;<br />&gt; &gt;<br />&gt; &gt;<br />=
&gt; &gt; On Sun, May 25, 2014 at 4:00 PM, &lt;dpreed@reed.com&gt; wrote:<b=
r />&gt; &gt;&gt;<br />&gt; &gt;&gt; Not that it is directly relevant, but =
there is no essential reason to<br />&gt; &gt;&gt; require 50 ms. of buffer=
ing.  That might be true of some particular<br />&gt; &gt;&gt; QOS-related =
router algorithm.  50 ms. is about all one can tolerate in<br />&gt; any<br=
 />&gt; &gt;&gt; router between source and destination for today's networks=
 - an<br />&gt; upper-bound<br />&gt; &gt;&gt; rather than a minimum.<br />=
&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; The =
optimum buffer state for throughput is 1-2 packets worth - in other<br />&g=
t; &gt;&gt; words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the b=
ottleneck<br />&gt; &gt;&gt; buffer (the input queue to the lowest speed li=
nk along the path) should<br />&gt; have<br />&gt; &gt;&gt; this much actua=
lly buffered. Buffering more than this increases<br />&gt; end-to-end<br />=
&gt; &gt;&gt; latency beyond its optimal state.  Increased end-to-end laten=
cy reduces<br />&gt; the<br />&gt; &gt;&gt; effectiveness of control loops,=
 creating more congestion.<br />&gt; <br />&gt; This misses an important fa=
cet of modern macs (wifi, wireless, cable, and gpon),<br />&gt; which which=
 can aggregate 32k or more in packets.<br />&gt; <br />&gt; So the ideal si=
ze in those cases is much larger than a MTU, and has additional<br />&gt; f=
actors governing the ideal - such as the probability of a packet loss induc=
ing<br />&gt; a retransmit....<br />&gt; <br />&gt; Ethernet, sure.<br />&g=
t; <br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;=
&gt; The rationale for having 50 ms. of buffering is probably to avoid<br /=
>&gt; &gt;&gt; disruption of bursty mixed flows where the bursts might pers=
ist for 50<br />&gt; ms.<br />&gt; &gt;&gt; and then die. One reason for th=
is is that source nodes run operating<br />&gt; systems<br />&gt; &gt;&gt; =
that tend to release packets in bursts. That's a whole other discussion -<b=
r />&gt; in<br />&gt; &gt;&gt; an ideal world, source nodes would avoid bur=
sty packet releases by<br />&gt; letting<br />&gt; &gt;&gt; the control by =
the receiver window be "tight" timing-wise.  That is, to<br />&gt; &gt;&gt;=
 transmit a packet immediately at the instant an ACK arrives increasing<br =
/>&gt; the<br />&gt; &gt;&gt; window.  This would pace the flow - current O=
S's tend (due to scheduling<br />&gt; &gt;&gt; mismatches) to send bursts o=
f packets, "catching up" on sending that<br />&gt; could<br />&gt; &gt;&gt;=
 have been spaced out and done earlier if the feedback from the<br />&gt; r=
eceiver's<br />&gt; &gt;&gt; window advancing were heeded.<br />&gt; <br />=
&gt; This loop has got ever tighter since linux 3.3, to where it's really a=
s tight<br />&gt; as a modern cpu scheduler can get it. (or so I keep think=
ing -<br />&gt; but successive improvements in linux tcp keep proving me wr=
ong. :)<br />&gt; <br />&gt; I am really in awe of linux tcp these days. Re=
cently I was benchmarking<br />&gt; windows and macos. Windows only got 60%=
 of the throughput linux tcp<br />&gt; did at gigE speeds, and osx had a lo=
t of issues at 10mbit and below,<br />&gt; stretch acks and holding the win=
dow too high for the path)<br />&gt; <br />&gt; I keep hoping better ethern=
et hardware will arrive that can mix flows<br />&gt; even more.<br />&gt; <=
br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;=
 That is, endpoint network stacks (TCP implementations) can worsen<br />&gt=
; &gt;&gt; congestion by "dallying".  The ideal end-to-end flows occupying =
a<br />&gt; congested<br />&gt; &gt;&gt; router would have their packets pa=
ced so that the packets end up being<br />&gt; sent<br />&gt; &gt;&gt; in t=
he least bursty manner that an application can support.  The effect<br />&g=
t; of<br />&gt; &gt;&gt; this pacing is to move the "backlog" for each flow=
 quickly into the<br />&gt; source<br />&gt; &gt;&gt; node for that flow, w=
hich then provides back pressure on the application<br />&gt; &gt;&gt; driv=
ing the flow, which ultimately is necessary to stanch congestion. <br />&gt=
; The<br />&gt; &gt;&gt; ideal congestion control mechanism slows the sende=
r part of the<br />&gt; application<br />&gt; &gt;&gt; to a pace that can g=
o through the network without contributing to<br />&gt; buffering.<br />&gt=
; &gt;<br />&gt; &gt;<br />&gt; &gt; Pacing is in Linux 3.12(?).  How long =
it will take to see widespread<br />&gt; &gt; deployment is another questio=
n, and as for other operating systems, who<br />&gt; &gt; knows.<br />&gt; =
&gt;<br />&gt; &gt; See: https://lwn.net/Articles/564978/<br />&gt; <br />&=
gt; Steinar drove some of this with persistence and results...<br />&gt; <b=
r />&gt; http://www.linux-support.com/cms/steinar-h-gunderson-paced-tcp-and=
-the-fq-scheduler/<br />&gt; <br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&g=
t; &gt;&gt;<br />&gt; &gt;&gt; Current network stacks (including Linux's) d=
on't achieve that goal -<br />&gt; their<br />&gt; &gt;&gt; pushback on app=
lication sources is minimal - instead they accumulate<br />&gt; &gt;&gt; bu=
ffering internal to the network implementation.<br />&gt; &gt;<br />&gt; &g=
t;<br />&gt; &gt; This is much, much less true than it once was.  There hav=
e been substantial<br />&gt; &gt; changes in the Linux TCP stack in the las=
t year or two, to avoid generating<br />&gt; &gt; packets before necessary.=
  Again, how long it will take for people to deploy<br />&gt; &gt; this on =
Linux (and implement on other OS's) is a question.<br />&gt; <br />&gt; The=
 data centers I'm in (linode, isc, google cloud) seem to be<br />&gt; track=
ing modern kernels pretty good...<br />&gt; <br />&gt; &gt;&gt;<br />&gt; &=
gt;&gt; This contributes to end-to-end latency as well.  But if you think a=
bout<br />&gt; &gt;&gt; it, this is almost as bad as switch-level bufferblo=
at in terms of<br />&gt; degrading<br />&gt; &gt;&gt; user experience.  The=
 reason I say "almost" is that there are tools,<br />&gt; rarely<br />&gt; =
&gt;&gt; used in practice, that allow an application to specify that buffer=
ing<br />&gt; should<br />&gt; &gt;&gt; not build up in the network stack (=
in the kernel or wherever it is). <br />&gt; But<br />&gt; &gt;&gt; the def=
ault is not to use those APIs, and to buffer way too much.<br />&gt; &gt;&g=
t;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; Remember, the n=
etwork send stack can act similarly to a congested switch<br />&gt; &gt;&gt=
; (it is a switch among all the user applications running on that node). <b=
r />&gt; IF<br />&gt; &gt;&gt; there is a heavy file transfer, the file tra=
nsfer's buffering acts to<br />&gt; &gt;&gt; increase latency for all other=
 networked communications on that machine.<br />&gt; &gt;&gt;<br />&gt; &gt=
;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; Traditionally this problem has =
been thought of only as a within-node<br />&gt; &gt;&gt; fairness issue, bu=
t in fact it has a big effect on the switches in<br />&gt; between<br />&gt=
; &gt;&gt; source and destination due to the lack of dispersed pacing of th=
e packets<br />&gt; at<br />&gt; &gt;&gt; the source - in other words, the =
current design does nothing to stem the<br />&gt; &gt;&gt; "burst groups" f=
rom a single source mentioned above.<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<=
br />&gt; &gt;&gt;<br />&gt; &gt;&gt; So we do need the source nodes to imp=
lement less "bursty" sending<br />&gt; stacks.<br />&gt; &gt;&gt; This is e=
specially true for multiplexed source nodes, such as web<br />&gt; servers<=
br />&gt; &gt;&gt; implementing thousands of flows.<br />&gt; &gt;&gt;<br /=
>&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; A combination of codel=
-style switch-level buffer management and the<br />&gt; stack<br />&gt; &gt=
;&gt; at the sender being implemented to spread packets in a particular TCP=
<br />&gt; flow<br />&gt; &gt;&gt; out over time would improve things a lot=
.  To achieve best throughput,<br />&gt; the<br />&gt; &gt;&gt; optimal way=
 to spread packets out on an end-to-end basis is to update<br />&gt; the<br=
 />&gt; &gt;&gt; receive window (sending ACK) at the receive end as quickly=
 as possible,<br />&gt; and<br />&gt; &gt;&gt; to respond to the updated re=
ceive window as quickly as possible when it<br />&gt; &gt;&gt; increases.<b=
r />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; =
Just like the "bufferbloat" issue, the problem is caused by applications<br=
 />&gt; &gt;&gt; like streaming video, file transfers and big web pages tha=
t the<br />&gt; application<br />&gt; &gt;&gt; programmer sees as not havin=
g a latency requirement within the flow, so<br />&gt; the<br />&gt; &gt;&gt=
; application programmer does not have an incentive to control pacing. <br =
/>&gt; Thus<br />&gt; &gt;&gt; the operating system has got to push back on=
 the applications' flow<br />&gt; somehow,<br />&gt; &gt;&gt; so that the f=
low ends up paced once it enters the Internet itself.  So<br />&gt; &gt;&gt=
; there's no real problem caused by large buffering in the network stack<br=
 />&gt; at<br />&gt; &gt;&gt; the endpoint, as long as the stack's delivery=
 to the Internet is paced<br />&gt; by<br />&gt; &gt;&gt; some mechanism, e=
.g. tight management of receive window control on an<br />&gt; &gt;&gt; end=
-to-end basis.<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br =
/>&gt; &gt;&gt; I don't think this can be fixed by cerowrt, so this is out =
of place<br />&gt; here.<br />&gt; &gt;&gt; It's partially ameliorated by c=
erowrt, if it aggressively drops packets<br />&gt; from<br />&gt; &gt;&gt; =
flows that burst without pacing. fq_codel does this, if the buffer size<br =
/>&gt; it<br />&gt; &gt;&gt; aims for is small - but the problem is that th=
e OS stacks don't respond<br />&gt; by<br />&gt; &gt;&gt; pacing... they te=
nd to respond by bursting, not because TCP doesn't<br />&gt; provide<br />&=
gt; &gt;&gt; the mechanisms for pacing, but because the OS stack doesn't tr=
ansmit as<br />&gt; soon<br />&gt; &gt;&gt; as it is allowed to - thus buil=
ding up a burst unnecessarily.<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&=
gt; &gt;&gt;<br />&gt; &gt;&gt; Bursts on a flow are thus bad in general.  =
They make congestion happen<br />&gt; &gt;&gt; when it need not.<br />&gt; =
&gt;<br />&gt; &gt;<br />&gt; &gt; By far the biggest headache is what the =
Web does to the network.  It has<br />&gt; &gt; turned the web into a burst=
 generator.<br />&gt; &gt;<br />&gt; &gt; A typical web page may have 10 (o=
r even more images).  See the "connections<br />&gt; &gt; per page" plot in=
 the link below.<br />&gt; &gt;<br />&gt; &gt; A browser downloads the base=
 page, and then, over N connections, essentially<br />&gt; &gt; simultaneou=
sly downloads those embedded objects.  Many/most of them are<br />&gt; &gt;=
 small in size (4-10 packets).  You never even get near slow start.<br />&g=
t; &gt;<br />&gt; &gt; So you get an IW amount of data/TCP connection, with=
 no pacing, and no<br />&gt; &gt; congestion avoidance.  It is easy to obse=
rve 50-100 packets (or more) back<br />&gt; &gt; to back at the bottleneck.=
<br />&gt; &gt;<br />&gt; &gt; This is (in practice) the amount you have to=
 buffer today: that burst of<br />&gt; &gt; packets from a web page.  Witho=
ut flow queuing, you are screwed.  With it,<br />&gt; &gt; it's annoying, b=
ut can be tolerated.<br />&gt; &gt;<br />&gt; &gt;<br />&gt; &gt; I go over=
 this is detail in:<br />&gt; &gt;<br />&gt; &gt;<br />&gt; http://gettys.w=
ordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-=
is-not-enough/<br />&gt; &gt;<br />&gt; &gt; So far, I don't believe anyone=
 has tried pacing the IW burst of packets.<br />&gt; &gt; I'd certainly lik=
e to see that, but pacing needs to be across TCP<br />&gt; &gt; connections=
 (host pairs) to be possibly effective to outwit the gaming the<br />&gt; &=
gt; web has done to the network.<br />&gt; &gt;                            =
                                      - Jim<br />&gt; &gt;<br />&gt; &gt;&g=
t;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&=
gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;=
&gt; On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson"<br />&gt; &lt;sw=
mike@swm.pp.se&gt;<br />&gt; &gt;&gt; said:<br />&gt; &gt;&gt;<br />&gt; &g=
t;&gt; &gt; On Sun, 25 May 2014, Dane Medic wrote:<br />&gt; &gt;&gt; &gt;<=
br />&gt; &gt;&gt; &gt; &gt; Is it true that devices with less than 64 MB c=
an't handle QOS?<br />&gt; -&gt;<br />&gt; &gt;&gt; &gt; &gt;<br />&gt; &gt=
;&gt; &gt; &gt;<br />&gt; https://lists.chambana.net/pipermail/commotion-de=
v/2014-May/001816.html<br />&gt; &gt;&gt; &gt;<br />&gt; &gt;&gt; &gt; At g=
ig speeds you need around 50ms worth of buffering. 1 gigabit/s<br />&gt; =
=3D<br />&gt; &gt;&gt; &gt; 125 megabyte/s meaning for 50ms you need 6.25 m=
egabyte of buffer.<br />&gt; &gt;&gt; &gt;<br />&gt; &gt;&gt; &gt; I also d=
on't see why performance and memory size would be relevant,<br />&gt; I'd<b=
r />&gt; &gt;&gt; &gt; say forwarding performance has more to do with CPU s=
peed than<br />&gt; anything<br />&gt; &gt;&gt; &gt; else.<br />&gt; &gt;&g=
t; &gt;<br />&gt; &gt;&gt; &gt; --<br />&gt; &gt;&gt; &gt; Mikael Abrahamss=
on email: swmike@swm.pp.se<br />&gt; &gt;&gt; &gt; ________________________=
_______________________<br />&gt; &gt;&gt; &gt; Cerowrt-devel mailing list<=
br />&gt; &gt;&gt; &gt; Cerowrt-devel@lists.bufferbloat.net<br />&gt; &gt;&=
gt; &gt; https://lists.bufferbloat.net/listinfo/cerowrt-devel<br />&gt; &gt=
;&gt; &gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt;<br />&gt; &gt;&gt; _______=
________________________________________<br />&gt; &gt;&gt; Cerowrt-devel m=
ailing list<br />&gt; &gt;&gt; Cerowrt-devel@lists.bufferbloat.net<br />&gt=
; &gt;&gt; https://lists.bufferbloat.net/listinfo/cerowrt-devel<br />&gt; &=
gt;&gt;<br />&gt; &gt;<br />&gt; &gt;<br />&gt; &gt; ______________________=
_________________________<br />&gt; &gt; Cerowrt-devel mailing list<br />&g=
t; &gt; Cerowrt-devel@lists.bufferbloat.net<br />&gt; &gt; https://lists.bu=
fferbloat.net/listinfo/cerowrt-devel<br />&gt; &gt;<br />&gt; <br />&gt; <b=
r />&gt; <br />&gt; --<br />&gt; Dave T=C3=A4ht<br />&gt; <br />&gt; NSFW:<=
br />&gt; https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0=
296_indecent.article<br />&gt;</p>=0A</div></font>
------=_20140528113342000000_53151--