[Cerowrt-devel] FQ_Codel lwn draft article review

dpreed at reed.com dpreed at reed.com
Sat Nov 24 11:36:41 EST 2012

All the points below make sense.   Ideally you want to measure the TCP FQ Codel interaction in the "real world".  Throughput benchmarks are irrelevant, the equivalent of  Hot Rod amateur dragstrip competitions among cars that cannot even turn corners.
Beyond being hard, there is no "agreed upon" standard for testing "real world" performance - which is why academics who care little about anything other than publishing go for the "Hot Rod" stuff.
In your lwn posting, I think it is worth pointing out that "wrongheaded benchmarks"  were exactly what drove the folks who created the bufferbloat problem in the first place.  And those people are still alive and kicking (in the wrong direction).  But that's how you get tenure.
The other issue is "KISS".   I would *seriously* suggest that the idea of "classification" not get too entangled with the problem at this point.
Classification has many downsides, most of which will just confuse the inventors, adding what is probably an unnecessarily complex space of design alternatives.  If you must discuss classification (which is another academic wet dream), discuss it as "future research".
Two classes (latency critical, and latency as short as possible) should be enough in a network that for "control loop" reasons wants to have minimal control latencies *all of the time*.  I'm not sure that two is the desired state - I tend to think 1 class is better on an end-to-end basis.
If you want to stabilize things with faster control loops, just order all queues by "packet entry" timestamps, and move ECN-style marking towards "head-marking" - that is signaling congestion in packets that are being transmitted if any packets are queued behind them.
That creates the most responsive control loops possible on an end-to-end basis for TCP and other congestion-managing protocols.
-----Original Message-----
From: "Dave Taht" <dave.taht at gmail.com>
Sent: Saturday, November 24, 2012 11:19am
To: "Toke Høiland-Jørgensen" <toke at toke.dk>
Cc: "Paolo Valente" <paolo.valente at unimore.it>, "Eric Raymond" <esr at thyrsus.com>, codel at lists.bufferbloat.net, cerowrt-devel at lists.bufferbloat.net, "bloat" <bloat at lists.bufferbloat.net>, paulmck at linux.vnet.ibm.com, "David Woodhouse" <dwmw2 at infradead.org>, "John Crispin" <blogic at openwrt.org>
Subject: Re: [Cerowrt-devel] FQ_Codel lwn draft article review

On Sat, Nov 24, 2012 at 1:07 AM, Toke Høiland-Jørgensen <toke at toke.dk> wrote:
> "Paul E. McKenney" <paulmck at linux.vnet.ibm.com> writes:
>> I am using these two in a new "Effectiveness of FQ-CoDel" section.
>> Chrome can display .svg, and if it becomes a problem, I am sure that
>> they can be converted.  Please let me know if some other data would
>> make the point better.

My primary focus has been on making the kind of internet over a
billion people have, function better, that with <10Mbit uplinks. While
it's nice to show an improvement on 100Mbit, gigE and higher, I'd
rather talk to the 10Mbit and below cases whenever possible.

> If you are just trying to show the "ideal" effectiveness of fq_codel,
> two attached graphs are from some old tests we did at the UDS showing a
> simple ethernet link between two laptops with a single stream going in
> each direction. This is of course by no means a real-world test, but on
> the other hand they show a very visible factor ~4 improvement in
> latency.
> These are the same graphs Dave used in his slides, but also in a 100mbit
> version.

As noted above, 10Mbit is better to show. Secondly, in looking over
the 10Mbit graph, I realized that we could also keep injecting new
tcps at intervals of every 5 seconds, for shorter  periods, to observe
what happens.

And more importantly, I'd like to avoid falling into the trap that so
much network research falls into, which is blithely benchmarking lots
of long duration TCP traffic,
rather than the kinds of network traffic we actually see in the real
world. A real world web page might have a hundred or more dns lookups
and a hundred tcp streams, the vast majority of which are so short as
to not get out of slow start.

Now - seeing/measuring/graphing that - is *hard* - which is why it is
so rarely done. Because it's hard, but accurately measures the real
world, says it should be done.

However, I can see leveraging the clean 10Mbit trace or a (better)
asymmetric 24/5.5 case, and while pounding it with the existing,
simple code for 1 full rate up, 1 full rate down, and a CIR stream for
voice - impacting that plot with chrome web page benchmark or
something similar.

Indirectly observing the web load effects on that graph, while timing
web page completion, would be good, when comparing pfifo_fast and
various aqm variants.

>> Also, I know what ICMP is, but the UDP variants are new to me.  Could
>> you please expand the "EF", "BK", "BE", and "CSS" acronyms?
> The UDP ping times are simply roundtrips/second (as measured by netperf)
> converted to ping times. The acronyms are diffserv markings, i.e.
> EF=expedited forwarding, BK=bulk (CS1 marking), BE=best effort (no
> marking).

The classification tests are in there for a number of reasons.

0) I needed multiple streams in the test anyway.

1) Many people keep insisting that classification can work. It
doesn't. It never has. Not over the wild and wooly internet. It only
rarely does any good at all even on internal networks. It sometimes
works on some kinds of udp streams, but that's it. The bulk of the
problem is the massive packet streams modern offloads generate, and
breaking those up, everywhere possible, any time possible.

I had put up a graph last week, that showed each classification bucket
for a tcp stream being totally ignored...

2) Theoretically wireless 802.11e SHOULD respect classification. In
fact, it does, on the ath9k, to a large extent. However, on the iwl I
have, BE, BK traffic get completely starved by VO, and VI traffic,
which is something of a bug. I'm certain that due to inadaquate
testing, 802.11e classification is largely broken in the field, and
I'd hoped this test would bring that out to more people.

3) I don't mind at an effort to make classification work, particularly
for traffic clearly marked background, such as bittorrent often is.
Perhaps this is an opportunity to get IPv6 done up right, as it seems
the diffserv bits are much more rarely fiddled with in transit.

> The UDP ping tests tend to not work so well on a loaded link,
> however, since netperf stops sending packets after detecting
> (excessive(?)) loss. Which is why you see only see the UDP ping times on
> the first part of the graph.

Netperf stops UDP_STREAM exchanges after the first lost udp packet.
This is not helpful.

I keep noting that the next phase of the rrul development is to find a
good pair of CIR one way measurements that look a bit like voip.
Either that test can get added to netperf or we use another tool, or
we create one, and I keep hoping for recommendations from various
people on this list. Come on, something like this
exists? Anybody?

Another reason for a UDP based voip-like ping test is that icmp is
frequently handled differently than other sorts of streams.

A TCP based ping test used to be in there (and should go back) as it
shows the impact of packet loss on TCP behavior. (that said, the
TCP_RR test is roughly equivalent)

After staring at the tons of data collected over the past year, on
wifi, I'm willing to strongly suggest we just drop TCP packets after
500ms in the wifi stack, period, as that exceeds the round trip

> The markings are also used on the TCP flows, as seen in the legend for
> the up/downloads.
>> All sessions were started at T+5, then?
> The pings start right away, the transfers start at T+5 seconds. Looks
> like the first ~five seconds of transfer is being cut off on those
> graphs.

Ramping up to 10K packets is silly at gigE, and looks like an outlier.

> I think what happens is that one of the streams (the turquoise
> one) starts up faster than the other ones, consuming all the bandwidth
> for the first couple of seconds until they adjust to the same level.

I'm not willing to draw this conclusion from this graph, and need
to/would like someone else to/ setup a test in a controlled
environment. the wrapper scripts
can dump the raw data and I can manually plot using gnuplot or a
spreadsheet, but it's tedious...

> These initial values are then scaled off the graph as outlier values.

Huge need for cdf plots and to present the outliers. In fact I'd like
graphs that just presented the outliers. Another way to approach it
would be, instead of creating static graphs, to use something like the
ds3.js and incorporate the ability to zoom
in, around, and so on, on multiple data sets. Or leverage mlab's tools.

I am no better at javascript than python.

> If
> you zoom in on the beginning of the graph you can see the turquoise line
> coming down from far off the scale in one direction, while the rest come
> From off the bottom.

Not willing to draw any conclusions. I am.

>> Please see attached for update including .git directory.
> I got a little lost in all the lists of SFQ, but other than that I found
> it quite readable. The diagrams of the queuing algorithms are a tad big,
> though, I think. :)

I would like to take some serious time to make them better. I'm
graphically hopeless, however I know what I like, and a picture does
tell a thousand words.

> When is the article going to be published?

Well, jon strongly indicated he'd take an article, and I told him that
once I found a theme, co-authors, and time, I'd talk to him again. We
seem to be making rapid progress due to paul stepping up and your
graphing tools.

So as for publication: when it's done, would be my guess! I would like
this to be the best presentation, possible, and also address some FUD
spread by the recent Cisco PIE presentation.

That said, I do feel the need for formal publication in a dead-tree
journal somewhere, which could talk to some of the interesting stuff
like beating tcp global synchronization (finally), and the RTT info,
and maybe also explore the few known flaws of fq_codel...

Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
Cerowrt-devel mailing list
Cerowrt-devel at lists.bufferbloat.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/cerowrt-devel/attachments/20121124/a9ea839d/attachment-0002.html>

More information about the Cerowrt-devel mailing list