From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp111.iad.emailsrvr.com (smtp111.iad.emailsrvr.com [207.97.245.111]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id 0D66321F11D; Sat, 24 Nov 2012 08:36:43 -0800 (PST) Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp31.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id DB8A93E01C1; Sat, 24 Nov 2012 11:36:41 -0500 (EST) X-Virus-Scanned: OK Received: from legacy7.wa-web.iad1a (legacy7.wa-web.iad1a.rsapps.net [192.168.2.216]) by smtp31.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id 95E3A3E0131; Sat, 24 Nov 2012 11:36:41 -0500 (EST) Received: from reed.com (localhost [127.0.0.1]) by legacy7.wa-web.iad1a (Postfix) with ESMTP id 81B463200B0; Sat, 24 Nov 2012 11:36:41 -0500 (EST) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Sat, 24 Nov 2012 11:36:41 -0500 (EST) Date: Sat, 24 Nov 2012 11:36:41 -0500 (EST) From: dpreed@reed.com To: "Dave Taht" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20121124113641000000_73103" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: <20121123221842.GD2829@linux.vnet.ibm.com> <87a9u7amon.fsf@toke.dk> Message-ID: <1353775001.529817528@apps.rackspace.com> X-Mailer: webmail7.0 Cc: Paolo Valente , =?utf-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , Eric Raymond , codel@lists.bufferbloat.net, cerowrt-devel@lists.bufferbloat.net, bloat , paulmck@linux.vnet.ibm.com, John Crispin Subject: Re: [Codel] =?utf-8?q?=5BCerowrt-devel=5D_FQ=5FCodel_lwn_draft_articl?= =?utf-8?q?e_review?= X-BeenThere: codel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: CoDel AQM discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Nov 2012 16:36:43 -0000 ------=_20121124113641000000_73103 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AAll the points below make sense. Ideally you want to measure the TCP F= Q Codel interaction in the "real world". Throughput benchmarks are irrelev= ant, the equivalent of Hot Rod amateur dragstrip competitions among cars t= hat cannot even turn corners.=0A =0ABeyond being hard, there is no "agreed = upon" standard for testing "real world" performance - which is why academic= s who care little about anything other than publishing go for the "Hot Rod"= stuff.=0A =0AIn your lwn posting, I think it is worth pointing out that "w= rongheaded benchmarks" were exactly what drove the folks who created the b= ufferbloat problem in the first place. And those people are still alive an= d kicking (in the wrong direction). But that's how you get tenure.=0A =0AT= he other issue is "KISS". I would *seriously* suggest that the idea of "c= lassification" not get too entangled with the problem at this point.=0A =0A= Classification has many downsides, most of which will just confuse the inve= ntors, adding what is probably an unnecessarily complex space of design alt= ernatives. If you must discuss classification (which is another academic w= et dream), discuss it as "future research".=0A =0ATwo classes (latency crit= ical, and latency as short as possible) should be enough in a network that = for "control loop" reasons wants to have minimal control latencies *all of = the time*. I'm not sure that two is the desired state - I tend to think 1 = class is better on an end-to-end basis.=0A =0AIf you want to stabilize thin= gs with faster control loops, just order all queues by "packet entry" times= tamps, and move ECN-style marking towards "head-marking" - that is signalin= g congestion in packets that are being transmitted if any packets are queue= d behind them.=0A =0AThat creates the most responsive control loops possibl= e on an end-to-end basis for TCP and other congestion-managing protocols.= =0A =0A-----Original Message-----=0AFrom: "Dave Taht" = =0ASent: Saturday, November 24, 2012 11:19am=0ATo: "Toke H=C3=B8iland-J=C3= =B8rgensen" =0ACc: "Paolo Valente" = , "Eric Raymond" , codel@lists.bufferbloat.net, cerowrt-de= vel@lists.bufferbloat.net, "bloat" , paulmck@l= inux.vnet.ibm.com, "David Woodhouse" , "John Crispin" = =0ASubject: Re: [Cerowrt-devel] FQ_Codel lwn draft arti= cle review=0A=0A=0A=0AOn Sat, Nov 24, 2012 at 1:07 AM, Toke H=C3=B8iland-J= =C3=B8rgensen wrote:=0A> "Paul E. McKenney" writes:=0A>=0A>> I am using these two in a new "Effectiveness = of FQ-CoDel" section.=0A>> Chrome can display .svg, and if it becomes a pro= blem, I am sure that=0A>> they can be converted. Please let me know if som= e other data would=0A>> make the point better.=0A=0AMy primary focus has be= en on making the kind of internet over a=0Abillion people have, function be= tter, that with <10Mbit uplinks. While=0Ait's nice to show an improvement o= n 100Mbit, gigE and higher, I'd=0Arather talk to the 10Mbit and below cases= whenever possible.=0A=0A>=0A> If you are just trying to show the "ideal" e= ffectiveness of fq_codel,=0A> two attached graphs are from some old tests w= e did at the UDS showing a=0A> simple ethernet link between two laptops wit= h a single stream going in=0A> each direction. This is of course by no mean= s a real-world test, but on=0A> the other hand they show a very visible fac= tor ~4 improvement in=0A> latency.=0A>=0A> These are the same graphs Dave u= sed in his slides, but also in a 100mbit=0A> version.=0A=0AAs noted above, = 10Mbit is better to show. Secondly, in looking over=0Athe 10Mbit graph, I r= ealized that we could also keep injecting new=0Atcps at intervals of every = 5 seconds, for shorter periods, to observe=0Awhat happens.=0A=0AAnd more i= mportantly, I'd like to avoid falling into the trap that so=0Amuch network = research falls into, which is blithely benchmarking lots=0Aof long duration= TCP traffic,=0Arather than the kinds of network traffic we actually see in= the real=0Aworld. A real world web page might have a hundred or more dns l= ookups=0Aand a hundred tcp streams, the vast majority of which are so short= as=0Ato not get out of slow start.=0A=0ANow - seeing/measuring/graphing th= at - is *hard* - which is why it is=0Aso rarely done. Because it's hard, bu= t accurately measures the real=0Aworld, says it should be done.=0A=0AHoweve= r, I can see leveraging the clean 10Mbit trace or a (better)=0Aasymmetric 2= 4/5.5 case, and while pounding it with the existing,=0Asimple code for 1 fu= ll rate up, 1 full rate down, and a CIR stream for=0Avoice - impacting that= plot with chrome web page benchmark or=0Asomething similar.=0A=0AIndirectl= y observing the web load effects on that graph, while timing=0Aweb page com= pletion, would be good, when comparing pfifo_fast and=0Avarious aqm variant= s.=0A=0A=0A>> Also, I know what ICMP is, but the UDP variants are new to me= . Could=0A>> you please expand the "EF", "BK", "BE", and "CSS" acronyms?= =0A>=0A> The UDP ping times are simply roundtrips/second (as measured by ne= tperf)=0A> converted to ping times. The acronyms are diffserv markings, i.e= .=0A> EF=3Dexpedited forwarding, BK=3Dbulk (CS1 marking), BE=3Dbest effort = (no=0A> marking).=0A=0AThe classification tests are in there for a number o= f reasons.=0A=0A0) I needed multiple streams in the test anyway.=0A=0A1) Ma= ny people keep insisting that classification can work. It=0Adoesn't. It nev= er has. Not over the wild and wooly internet. It only=0Ararely does any goo= d at all even on internal networks. It sometimes=0Aworks on some kinds of u= dp streams, but that's it. The bulk of the=0Aproblem is the massive packet = streams modern offloads generate, and=0Abreaking those up, everywhere possi= ble, any time possible.=0A=0AI had put up a graph last week, that showed ea= ch classification bucket=0Afor a tcp stream being totally ignored...=0A=0A2= ) Theoretically wireless 802.11e SHOULD respect classification. In=0Afact, = it does, on the ath9k, to a large extent. However, on the iwl I=0Ahave, BE,= BK traffic get completely starved by VO, and VI traffic,=0Awhich is someth= ing of a bug. I'm certain that due to inadaquate=0Atesting, 802.11e classif= ication is largely broken in the field, and=0AI'd hoped this test would bri= ng that out to more people.=0A=0A3) I don't mind at an effort to make class= ification work, particularly=0Afor traffic clearly marked background, such = as bittorrent often is.=0APerhaps this is an opportunity to get IPv6 done u= p right, as it seems=0Athe diffserv bits are much more rarely fiddled with = in transit.=0A=0A> The UDP ping tests tend to not work so well on a loaded = link,=0A> however, since netperf stops sending packets after detecting=0A> = (excessive(?)) loss. Which is why you see only see the UDP ping times on=0A= > the first part of the graph.=0A=0ANetperf stops UDP_STREAM exchanges afte= r the first lost udp packet.=0AThis is not helpful.=0A=0AI keep noting that= the next phase of the rrul development is to find a=0Agood pair of CIR one= way measurements that look a bit like voip.=0AEither that test can get add= ed to netperf or we use another tool, or=0Awe create one, and I keep hoping= for recommendations from various=0Apeople on this list. Come on, something= like this=0Aexists? Anybody?=0A=0AAnother reason for a UDP based voip-like= ping test is that icmp is=0Afrequently handled differently than other sort= s of streams.=0A=0AA TCP based ping test used to be in there (and should go= back) as it=0Ashows the impact of packet loss on TCP behavior. (that said,= the=0ATCP_RR test is roughly equivalent)=0A=0AAfter staring at the tons of= data collected over the past year, on=0Awifi, I'm willing to strongly sugg= est we just drop TCP packets after=0A500ms in the wifi stack, period, as th= at exceeds the round trip=0Atimeout...=0A=0A=0A> The markings are also used= on the TCP flows, as seen in the legend for=0A> the up/downloads.=0A>=0A>>= All sessions were started at T+5, then?=0A>=0A> The pings start right away= , the transfers start at T+5 seconds. Looks=0A> like the first ~five second= s of transfer is being cut off on those=0A> graphs.=0A=0ARamping up to 10K = packets is silly at gigE, and looks like an outlier.=0A=0A> I think what ha= ppens is that one of the streams (the turquoise=0A> one) starts up faster t= han the other ones, consuming all the bandwidth=0A> for the first couple of= seconds until they adjust to the same level.=0A=0AI'm not willing to draw = this conclusion from this graph, and need=0Ato/would like someone else to/ = setup a test in a controlled=0Aenvironment. the wrapper scripts=0Acan dump = the raw data and I can manually plot using gnuplot or a=0Aspreadsheet, but = it's tedious...=0A=0A> These initial values are then scaled off the graph a= s outlier values.=0A=0AHuge need for cdf plots and to present the outliers.= In fact I'd like=0Agraphs that just presented the outliers. Another way to= approach it=0Awould be, instead of creating static graphs, to use somethin= g like the=0Ads3.js and incorporate the ability to zoom=0Ain, around, and s= o on, on multiple data sets. Or leverage mlab's tools.=0A=0AI am no better = at javascript than python.=0A=0A> If=0A> you zoom in on the beginning of th= e graph you can see the turquoise line=0A> coming down from far off the sca= le in one direction, while the rest come=0A> From off the bottom.=0A=0ANot = willing to draw any conclusions. I am.=0A=0A>> Please see attached for upda= te including .git directory.=0A>=0A> I got a little lost in all the lists o= f SFQ, but other than that I found=0A> it quite readable. The diagrams of t= he queuing algorithms are a tad big,=0A> though, I think. :)=0A=0AI would l= ike to take some serious time to make them better. I'm=0Agraphically hopele= ss, however I know what I like, and a picture does=0Atell a thousand words.= =0A=0A>=0A> When is the article going to be published?=0A=0AWell, jon stron= gly indicated he'd take an article, and I told him that=0Aonce I found a th= eme, co-authors, and time, I'd talk to him again. We=0Aseem to be making ra= pid progress due to paul stepping up and your=0Agraphing tools.=0A=0ASo as = for publication: when it's done, would be my guess! I would like=0Athis to = be the best presentation, possible, and also address some FUD=0Aspread by t= he recent Cisco PIE presentation.=0A=0AThat said, I do feel the need for fo= rmal publication in a dead-tree=0Ajournal somewhere, which could talk to so= me of the interesting stuff=0Alike beating tcp global synchronization (fina= lly), and the RTT info,=0Aand maybe also explore the few known flaws of fq_= codel...=0A=0A=0A=0A=0A-- =0ADave T=C3=A4ht=0A=0AFixing bufferbloat with ce= rowrt: http://www.teklibre.com/cerowrt/subscribe.html=0A___________________= ____________________________=0ACerowrt-devel mailing list=0ACerowrt-devel@l= ists.bufferbloat.net=0Ahttps://lists.bufferbloat.net/listinfo/cerowrt-devel ------=_20121124113641000000_73103 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

= All the points below make sense.   Ideally you want to measure th= e TCP FQ Codel interaction in the "real world".  Throughput benchmarks= are irrelevant, the equivalent of  Hot Rod amateur dragstrip competit= ions among cars that cannot even turn corners.

=0A

 

=0A

Beyond being hard,= there is no "agreed upon" standard for testing "real world" performance - = which is why academics who care little about anything other than publishing= go for the "Hot Rod" stuff.

=0A

 <= /p>=0A

In your lwn posting, I think it is w= orth pointing out that "wrongheaded benchmarks"  were exactly what dro= ve the folks who created the bufferbloat problem in the first place.  = And those people are still alive and kicking (in the wrong direction). = ; But that's how you get tenure.

=0A

&nb= sp;

=0A

The other issue is "KISS". =   I would *seriously* suggest that the idea of "classification" not ge= t too entangled with the problem at this point.

=0A

 

=0A

Classification ha= s many downsides, most of which will just confuse the inventors, adding wha= t is probably an unnecessarily complex space of design alternatives.  = If you must discuss classification (which is another academic wet dream), d= iscuss it as "future research".

=0A

&nbs= p;

=0A

Two classes (latency critical, an= d latency as short as possible) should be enough in a network that for "con= trol loop" reasons wants to have minimal control latencies *all of the time= *.  I'm not sure that two is the desired state - I tend to think 1 cla= ss is better on an end-to-end basis.

=0A

 

=0A

If you want to stabilize thi= ngs with faster control loops, just order all queues by "packet entry" time= stamps, and move ECN-style marking towards "head-marking" - that is signali= ng congestion in packets that are being transmitted if any packets are queu= ed behind them.

=0A

 

=0A

That creates the most responsive control loops po= ssible on an end-to-end basis for TCP and other congestion-managing protoco= ls.

=0A

 

=0A

-----Original Message-----
From: "Dave Taht" <dave.ta= ht@gmail.com>
Sent: Saturday, November 24, 2012 11:19am
To: "T= oke H=C3=B8iland-J=C3=B8rgensen" <toke@toke.dk>
Cc: "Paolo Valen= te" <paolo.valente@unimore.it>, "Eric Raymond" <esr@thyrsus.com>= ;, codel@lists.bufferbloat.net, cerowrt-devel@lists.bufferbloat.net, "bloat= " <bloat@lists.bufferbloat.net>, paulmck@linux.vnet.ibm.com, "David W= oodhouse" <dwmw2@infradead.org>, "John Crispin" <blogic@openwrt.or= g>
Subject: Re: [Cerowrt-devel] FQ_Codel lwn draft article review

=0A
=0A

On Sat, Nov 24, 2012 at 1:07 AM, Toke H=C3=B8iland-J=C3=B8rgense= n <toke@toke.dk> wrote:
> "Paul E. McKenney" <paulmck@linu= x.vnet.ibm.com> writes:
>
>> I am using these two in = a new "Effectiveness of FQ-CoDel" section.
>> Chrome can display= .svg, and if it becomes a problem, I am sure that
>> they can b= e converted. Please let me know if some other data would
>> mak= e the point better.

My primary focus has been on making the kind= of internet over a
billion people have, function better, that with &l= t;10Mbit uplinks. While
it's nice to show an improvement on 100Mbit, g= igE and higher, I'd
rather talk to the 10Mbit and below cases whenever= possible.

>
> If you are just trying to show the "id= eal" effectiveness of fq_codel,
> two attached graphs are from some= old tests we did at the UDS showing a
> simple ethernet link betwe= en two laptops with a single stream going in
> each direction. This= is of course by no means a real-world test, but on
> the other han= d they show a very visible factor ~4 improvement in
> latency.
>
> These are the same graphs Dave used in his slides, but also= in a 100mbit
> version.

As noted above, 10Mbit is bette= r to show. Secondly, in looking over
the 10Mbit graph, I realized that= we could also keep injecting new
tcps at intervals of every 5 seconds= , for shorter periods, to observe
what happens.

And more i= mportantly, I'd like to avoid falling into the trap that so
much netwo= rk research falls into, which is blithely benchmarking lots
of long du= ration TCP traffic,
rather than the kinds of network traffic we actual= ly see in the real
world. A real world web page might have a hundred o= r more dns lookups
and a hundred tcp streams, the vast majority of whi= ch are so short as
to not get out of slow start.

Now - seei= ng/measuring/graphing that - is *hard* - which is why it is
so rarely = done. Because it's hard, but accurately measures the real
world, says = it should be done.

However, I can see leveraging the clean 10Mbi= t trace or a (better)
asymmetric 24/5.5 case, and while pounding it wi= th the existing,
simple code for 1 full rate up, 1 full rate down, and= a CIR stream for
voice - impacting that plot with chrome web page ben= chmark or
something similar.

Indirectly observing the web l= oad effects on that graph, while timing
web page completion, would be = good, when comparing pfifo_fast and
various aqm variants.

<= br />>> Also, I know what ICMP is, but the UDP variants are new to me= . Could
>> you please expand the "EF", "BK", "BE", and "CSS" ac= ronyms?
>
> The UDP ping times are simply roundtrips/second= (as measured by netperf)
> converted to ping times. The acronyms a= re diffserv markings, i.e.
> EF=3Dexpedited forwarding, BK=3Dbulk (= CS1 marking), BE=3Dbest effort (no
> marking).

The class= ification tests are in there for a number of reasons.

0) I neede= d multiple streams in the test anyway.

1) Many people keep insis= ting that classification can work. It
doesn't. It never has. Not over = the wild and wooly internet. It only
rarely does any good at all even = on internal networks. It sometimes
works on some kinds of udp streams,= but that's it. The bulk of the
problem is the massive packet streams = modern offloads generate, and
breaking those up, everywhere possible, = any time possible.

I had put up a graph last week, that showed e= ach classification bucket
for a tcp stream being totally ignored...
2) Theoretically wireless 802.11e SHOULD respect classification. I= n
fact, it does, on the ath9k, to a large extent. However, on the iwl = I
have, BE, BK traffic get completely starved by VO, and VI traffic,which is something of a bug. I'm certain that due to inadaquate
te= sting, 802.11e classification is largely broken in the field, and
I'd = hoped this test would bring that out to more people.

3) I don't = mind at an effort to make classification work, particularly
for traffi= c clearly marked background, such as bittorrent often is.
Perhaps this= is an opportunity to get IPv6 done up right, as it seems
the diffserv= bits are much more rarely fiddled with in transit.

> The UDP= ping tests tend to not work so well on a loaded link,
> however, s= ince netperf stops sending packets after detecting
> (excessive(?))= loss. Which is why you see only see the UDP ping times on
> the fi= rst part of the graph.

Netperf stops UDP_STREAM exchanges after = the first lost udp packet.
This is not helpful.

I keep noti= ng that the next phase of the rrul development is to find a
good pair = of CIR one way measurements that look a bit like voip.
Either that tes= t can get added to netperf or we use another tool, or
we create one, a= nd I keep hoping for recommendations from various
people on this list.= Come on, something like this
exists? Anybody?

Another reas= on for a UDP based voip-like ping test is that icmp is
frequently hand= led differently than other sorts of streams.

A TCP based ping te= st used to be in there (and should go back) as it
shows the impact of = packet loss on TCP behavior. (that said, the
TCP_RR test is roughly eq= uivalent)

After staring at the tons of data collected over the p= ast year, on
wifi, I'm willing to strongly suggest we just drop TCP pa= ckets after
500ms in the wifi stack, period, as that exceeds the round= trip
timeout...


> The markings are also used on t= he TCP flows, as seen in the legend for
> the up/downloads.
&g= t;
>> All sessions were started at T+5, then?
>
>= ; The pings start right away, the transfers start at T+5 seconds. Looks
> like the first ~five seconds of transfer is being cut off on those> graphs.

Ramping up to 10K packets is silly at gigE, and= looks like an outlier.

> I think what happens is that one of= the streams (the turquoise
> one) starts up faster than the other = ones, consuming all the bandwidth
> for the first couple of seconds= until they adjust to the same level.

I'm not willing to draw th= is conclusion from this graph, and need
to/would like someone else to/= setup a test in a controlled
environment. the wrapper scripts
ca= n dump the raw data and I can manually plot using gnuplot or a
spreads= heet, but it's tedious...

> These initial values are then sca= led off the graph as outlier values.

Huge need for cdf plots and= to present the outliers. In fact I'd like
graphs that just presented = the outliers. Another way to approach it
would be, instead of creating= static graphs, to use something like the
ds3.js and incorporate the a= bility to zoom
in, around, and so on, on multiple data sets. Or levera= ge mlab's tools.

I am no better at javascript than python.
=
> If
> you zoom in on the beginning of the graph you can s= ee the turquoise line
> coming down from far off the scale in one d= irection, while the rest come
> From off the bottom.

Not= willing to draw any conclusions. I am.

>> Please see atta= ched for update including .git directory.
>
> I got a littl= e lost in all the lists of SFQ, but other than that I found
> it qu= ite readable. The diagrams of the queuing algorithms are a tad big,
&g= t; though, I think. :)

I would like to take some serious time to= make them better. I'm
graphically hopeless, however I know what I lik= e, and a picture does
tell a thousand words.

>
>= When is the article going to be published?

Well, jon strongly i= ndicated he'd take an article, and I told him that
once I found a them= e, co-authors, and time, I'd talk to him again. We
seem to be making r= apid progress due to paul stepping up and your
graphing tools.
So as for publication: when it's done, would be my guess! I would like<= br />this to be the best presentation, possible, and also address some FUD<= br />spread by the recent Cisco PIE presentation.

That said, I d= o feel the need for formal publication in a dead-tree
journal somewher= e, which could talk to some of the interesting stuff
like beating tcp = global synchronization (finally), and the RTT info,
and maybe also exp= lore the few known flaws of fq_codel...




-- Dave T=C3=A4ht

Fixing bufferbloat with cerowrt: http://www.te= klibre.com/cerowrt/subscribe.html
____________________________________= ___________
Cerowrt-devel mailing list
Cerowrt-devel@lists.buffer= bloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

=0A<= /div>
------=_20121124113641000000_73103--