[Cerowrt-devel] FQ_Codel lwn draft article review

Sat Nov 24 11:19:37 EST 2012

On Sat, Nov 24, 2012 at 1:07 AM, Toke Høiland-Jørgensen <toke at toke.dk> wrote:
> "Paul E. McKenney" <paulmck at linux.vnet.ibm.com> writes:
>
>> I am using these two in a new "Effectiveness of FQ-CoDel" section.
>> Chrome can display .svg, and if it becomes a problem, I am sure that
>> they can be converted.  Please let me know if some other data would
>> make the point better.

My primary focus has been on making the kind of internet over a
billion people have, function better, that with <10Mbit uplinks. While
it's nice to show an improvement on 100Mbit, gigE and higher, I'd
rather talk to the 10Mbit and below cases whenever possible.

>
> If you are just trying to show the "ideal" effectiveness of fq_codel,
> two attached graphs are from some old tests we did at the UDS showing a
> simple ethernet link between two laptops with a single stream going in
> each direction. This is of course by no means a real-world test, but on
> the other hand they show a very visible factor ~4 improvement in
> latency.
>
> These are the same graphs Dave used in his slides, but also in a 100mbit
> version.

As noted above, 10Mbit is better to show. Secondly, in looking over
the 10Mbit graph, I realized that we could also keep injecting new
tcps at intervals of every 5 seconds, for shorter  periods, to observe
what happens.

And more importantly, I'd like to avoid falling into the trap that so
much network research falls into, which is blithely benchmarking lots
of long duration TCP traffic,
rather than the kinds of network traffic we actually see in the real
world. A real world web page might have a hundred or more dns lookups
and a hundred tcp streams, the vast majority of which are so short as
to not get out of slow start.

Now - seeing/measuring/graphing that - is *hard* - which is why it is
so rarely done. Because it's hard, but accurately measures the real
world, says it should be done.

However, I can see leveraging the clean 10Mbit trace or a (better)
asymmetric 24/5.5 case, and while pounding it with the existing,
simple code for 1 full rate up, 1 full rate down, and a CIR stream for
voice - impacting that plot with chrome web page benchmark or
something similar.

Indirectly observing the web load effects on that graph, while timing
web page completion, would be good, when comparing pfifo_fast and
various aqm variants.

>> Also, I know what ICMP is, but the UDP variants are new to me.  Could
>> you please expand the "EF", "BK", "BE", and "CSS" acronyms?
>
> The UDP ping times are simply roundtrips/second (as measured by netperf)
> converted to ping times. The acronyms are diffserv markings, i.e.
> EF=expedited forwarding, BK=bulk (CS1 marking), BE=best effort (no
> marking).

The classification tests are in there for a number of reasons.

0) I needed multiple streams in the test anyway.

1) Many people keep insisting that classification can work. It
doesn't. It never has. Not over the wild and wooly internet. It only
rarely does any good at all even on internal networks. It sometimes
works on some kinds of udp streams, but that's it. The bulk of the
problem is the massive packet streams modern offloads generate, and
breaking those up, everywhere possible, any time possible.

I had put up a graph last week, that showed each classification bucket
for a tcp stream being totally ignored...

2) Theoretically wireless 802.11e SHOULD respect classification. In
fact, it does, on the ath9k, to a large extent. However, on the iwl I
have, BE, BK traffic get completely starved by VO, and VI traffic,
which is something of a bug. I'm certain that due to inadaquate
testing, 802.11e classification is largely broken in the field, and
I'd hoped this test would bring that out to more people.

3) I don't mind at an effort to make classification work, particularly
for traffic clearly marked background, such as bittorrent often is.
Perhaps this is an opportunity to get IPv6 done up right, as it seems
the diffserv bits are much more rarely fiddled with in transit.

> The UDP ping tests tend to not work so well on a loaded link,
> however, since netperf stops sending packets after detecting
> (excessive(?)) loss. Which is why you see only see the UDP ping times on
> the first part of the graph.

Netperf stops UDP_STREAM exchanges after the first lost udp packet.
This is not helpful.

I keep noting that the next phase of the rrul development is to find a
good pair of CIR one way measurements that look a bit like voip.
Either that test can get added to netperf or we use another tool, or
we create one, and I keep hoping for recommendations from various
people on this list. Come on, something like this
exists? Anybody?

Another reason for a UDP based voip-like ping test is that icmp is
frequently handled differently than other sorts of streams.

A TCP based ping test used to be in there (and should go back) as it
shows the impact of packet loss on TCP behavior. (that said, the
TCP_RR test is roughly equivalent)

After staring at the tons of data collected over the past year, on
wifi, I'm willing to strongly suggest we just drop TCP packets after
500ms in the wifi stack, period, as that exceeds the round trip
timeout...

> The markings are also used on the TCP flows, as seen in the legend for
> the up/downloads.
>
>> All sessions were started at T+5, then?
>
> The pings start right away, the transfers start at T+5 seconds. Looks
> like the first ~five seconds of transfer is being cut off on those
> graphs.

Ramping up to 10K packets is silly at gigE, and looks like an outlier.

> I think what happens is that one of the streams (the turquoise
> one) starts up faster than the other ones, consuming all the bandwidth
> for the first couple of seconds until they adjust to the same level.

I'm not willing to draw this conclusion from this graph, and need
to/would like someone else to/ setup a test in a controlled
environment. the wrapper scripts
can dump the raw data and I can manually plot using gnuplot or a
spreadsheet, but it's tedious...

> These initial values are then scaled off the graph as outlier values.

Huge need for cdf plots and to present the outliers. In fact I'd like
graphs that just presented the outliers. Another way to approach it
would be, instead of creating static graphs, to use something like the
ds3.js and incorporate the ability to zoom
in, around, and so on, on multiple data sets. Or leverage mlab's tools.

I am no better at javascript than python.

> If
> you zoom in on the beginning of the graph you can see the turquoise line
> coming down from far off the scale in one direction, while the rest come
> From off the bottom.

Not willing to draw any conclusions. I am.

>> Please see attached for update including .git directory.
>
> I got a little lost in all the lists of SFQ, but other than that I found
> it quite readable. The diagrams of the queuing algorithms are a tad big,
> though, I think. :)

I would like to take some serious time to make them better. I'm
graphically hopeless, however I know what I like, and a picture does
tell a thousand words.

>
> When is the article going to be published?

Well, jon strongly indicated he'd take an article, and I told him that
once I found a theme, co-authors, and time, I'd talk to him again. We
seem to be making rapid progress due to paul stepping up and your
graphing tools.

So as for publication: when it's done, would be my guess! I would like
this to be the best presentation, possible, and also address some FUD
spread by the recent Cisco PIE presentation.

That said, I do feel the need for formal publication in a dead-tree
journal somewhere, which could talk to some of the interesting stuff
like beating tcp global synchronization (finally), and the RTT info,
and maybe also explore the few known flaws of fq_codel...

-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html