[Cerowrt-devel] FQ_Codel lwn draft article review

Paul E. McKenney paulmck at linux.vnet.ibm.com
Wed Nov 28 12:20:59 EST 2012

On Tue, Nov 27, 2012 at 05:03:02PM -0500, Jim Gettys wrote:
> Some points worth making:

OK, I have the pen again.  (Just when you all thought it was safe!)

> 1) It is important to point out that (and how) fq_codel avoids starvation:
> unpleasant as elephant flows are, it would be very unfriendly to never
> service them at all until they time out.

Fair point -- I had alluded to this, but not really explained it.

Interestingly enough, sch_fq_codel.c uses a cute trick to make this
happen.  When a low-bandwidth flow empties, it is added to the end
of the elephant list.  This guarantees that the list of low-bandwidth
flows eventually empties, which forces at least a partial scan of the
elephant list.

Of course, this also means that Andrew's bounds on jitter are a bit
optimistic.  It would be possible to make sch_fq_codel.c work the
way that Andrew said it does, for example, by keeping a pointer in
fq_codel_sched_data that references the next flow to access.  From what
I can see, the price is a slight increase in overhead of the common
dequeue case.

Of course, if there are only a small number of elephant flows, the
increased jitter will not be that large.

In the meantime, the draft LWN article now explains the approach used
in the Linux kernel and states its importance in avoiding starvation.

> 2) "fairness" is not necessarily what we ultimately want at all; you'd
> really like to penalize those who induce congestion the most.  But we don't
> currently have a solution (though Bob Briscoe at BT thinks he does, and is
> seeing if he can get it out from under a BT patent), so the current
> fq_codel round robins ultimately until/unless we can do something like
> Bob's idea.  This is a local information only subset of the ideas he's been
> working on in the congestion exposure (conex) group at the IETF.

This of course depends on the definition of fairness.  For example,
completely random dropping of packets can be considered to be fair on a
per-packet basis, but since the heavy flows that are inducing congestion
the most have the most packets, they will tend to be the most heavily
penalized.  Leaky-bucket schemes can also be considered to be fair,
but they also preferentially penalize heavy flows that exhaust their
supply of tokens.

And FQ-CoDel's definition of fairness can be thought of as a tradeoff
between latency and reliability on the one hand and bandwidth on the
other.  Low-bandwidth flows will tend to have low latency and low drop
rates, while elephant flows will suffer worse latency and drop rate,
but in exchange will enjoy higher bandwidths.

> 3) "fairness" is always in the eyes of the beholder (and should be left to
> the beholder to determine). "fairness" depends on where in the network you
> are.  While being "fair" among TCP flows is sensible default policy for a
> host, else where in the network it may not be/usually isn't.
> Two examples:
> o at a home router, you probably want to be "fair" according to transmit
> opportunities.  We really don't want a single system remote from the router
> to be able to starve the network so that devices near the router get much
> less bandwidth than you might hope/expect.
> What is more, you probably want to account for a single host using many
> flows, and regulate that they not be able to "hog" bandwidth in the home
> environment, but only use their "fair" share.
> o at an ISP, you must to be "fair" between customers; it is best to leave
> the judgement of "fairness" at finer granularity (e.g. host and TCP flows)
> to the points closer to the customer's systems, so that they can enforce
> whatever definition of "fair" they need to themselves.

Right now, the draft LWN article doesn't mention "fair" in connection with
FQ-CoDel.  Instead, I am adding a paragraph calling out the tradeoff
between high bandwidth on the one hand (for the elephants) and low latency
and low packet-loss rate on the other (for the light flows).

Fair enough?  ;-)

> Algorithms like fq_codel can be/should be adjusted to the circumstances.

For example, increasing the quantum for connections with less than
4Mbit/s bandwidth?

> And therefore exactly what you choose to hash against to form the buckets
> will vary depending on where you are.  That at least one step (at the
> user's device) of this be TCP flow "fair" does have the great advantage of
> helping the RTT unfairness problem that violates the principle of "least
> surprise", such as that routinely seen in places like New Zealand.
> This is why I have so many problems using the word "fair" near this
> algorithm.  "fair" is impossible to define, overloaded in people's mind
> with TCP fair queuing, not even desirable much of the time, and by
> definition and design, even today's fq_codel isn't fair to lots of things,
> and the same basic algorithm can/should be tweaked in lots of directions
> depending on what we need to do.  Calling this "smart" queuing or some such
> would be better.

I really don't think we need to be quite -that- worried about the word
"fair".  Most other performance-related words have similar problems.
For example:

o	"bandwidth": Is that 10Mbit/second continuous?  Or average over
	each second?  Minute?  Hour?  Day?  Or is it the maximum
	bandwidth you can hope for when travelling downhill with the
	wind at your back?

o	"efficiency":  Efficiency of exactly what desired output vs.
	exactly what inputs?  And is that average efficiency or (again)
	the best efficiency you can hope for when travelling downhill
	with the wind at your back?  Or measured efficiency under
	conditions of typical use?

o	"cost effectiveness":  Who is paying?  What are the alternatives
	and what are their costs?  How is cost measured, in money, time
	consumed, lives lost, or something else?  How are the costs and
	benefits spread over time?  Am I being asked to make a large and
	definite payment now for some hoped-for long-term benefit, or am
	I accepting an immediate benefit in return for some future cost
	that might accumulate to an arbitrarily large value over time?
	Or are the costs and benefits realized at roughly the same time?

Besides, if we carefully avoid all mention of the word "fair", the
question "but is it fair" will with very high probability come up in
the comments to the article.  Furthermore, if we avoid any word that
some marketing department somewhere has successfully abused, we won't
have any words left to use.  ;-)

> When you've done another round on the document, I'll do a more detailed
> read.

Sounds good!  I expect to have another version by the end of the weekend.

							Thanx, Paul

>                              - Jim
> On Fri, Nov 23, 2012 at 5:18 PM, Paul E. McKenney <
> paulmck at linux.vnet.ibm.com> wrote:
> > On Fri, Nov 23, 2012 at 09:57:34AM +0100, Dave Taht wrote:
> > > David Woodhouse and I fiddled a lot with adsl and openwrt and a
> > > variety of drivers and network layers in a typical bonded adsl stack
> > > yesterday. The complexity of it all makes my head hurt. I'm happy that
> > > a newly BQL'd ethernet driver (for the geos and qemu) emerged from it,
> > > which he submitted to netdev...
> >
> > Cool!!!  ;-)
> >
> > > I made a recording of us last night discussing the layers, which I
> > > will produce and distribute later...
> > >
> > > Anyway, along the way, we fiddled a lot with trying to analyze where
> > > the 350ms or so of added latency was coming from in the traverse geo's
> > > adsl implementation and overlying stack....
> > >
> > > Plots: http://david.woodhou.se/dwmw2-netperf-plots.tar.gz
> > >
> > > Note: 1:
> > >
> > > The  netperf sample rate on the rrul test needs to be higher than
> > > 100ms in order to get a decent result at sub 10Mbit speeds.
> > >
> > > Note 2:
> > >
> > > The two nicest graphs here are nofq.svg vs fq.svg, which were taken on
> > > a gigE link from a Mac running Linux to another gigE link. (in other
> > > words, NOT on the friggin adsl link) (firefox can display svg, I don't
> > > know what else) I find the T+10 delay before stream start in the
> > > fq.svg graph suspicious and think the "throw out the outlier" code in
> > > the netperf-wrapper code is at fault. Prior to that, codel is merely
> > > buffering up things madly, which can also be seen in the pfifo_fast
> > > behavior, with 1000pkts it's default.
> >
> > I am using these two in a new "Effectiveness of FQ-CoDel" section.
> > Chrome can display .svg, and if it becomes a problem, I am sure that
> > they can be converted.  Please let me know if some other data would
> > make the point better.
> >
> > I am assuming that the colored throughput spikes are due to occasional
> > packet losses.  Please let me know if this interpretation is overly naive.
> >
> > Also, I know what ICMP is, but the UDP variants are new to me.  Could
> > you please expand the "EF", "BK", "BE", and "CSS" acronyms?
> >
> > > (Arguably, the default queue length in codel can be reduced from 10k
> > > packets to something more reasonable at GigE speeds)
> > >
> > > (the indicator that it's the graph, not the reality, is that the
> > > fq.svg pings and udp start at T+5 and grow minimally, as is usual with
> > > fq_codel.)
> >
> > All sessions were started at T+5, then?
> >
> > > As for the *.ps graphs, well, they would take david's network topology
> > > to explain, and were conducted over a variety of circumstances,
> > > including wifi, with more variables in play than I care to think
> > > about.
> > >
> > > We didn't really get anywhere on digging deeper. As we got to purer
> > > tests - with a minimal number of boxes, running pure ethernet,
> > > switched over a couple of switches, even in the simplest two box case,
> > > my HTB based "ceroshaper" implementation had multiple problems in
> > > cutting median latencies below 100ms, on this very slow ADSL link.
> > > David suspects problems on the path along the carrier backbone as a
> > > potential issue, and the only way to measure that is with two one way
> > > trip time measurements (rather than rtt), time synced via ntp... I
> > > keep hoping to find a rtp test, but I'm open to just about any option
> > > at this point. anyone?
> > >
> > > We also found a probable bug in mtr in that multiple mtrs on the same
> > > box don't co-exist.
> >
> > I must confess that I am not seeing all that clear a difference between
> > the behaviors of ceroshaper and FQ-CoDel.  Maybe somewhat better latencies
> > for FQ-CoDel, but not unambiguously so.
> >
> > > Moving back to more scientific clarity and simpler tests...
> > >
> > > The two graphs, taken a few weeks back, on pages 5 and 6 of this:
> > >
> > >
> > http://www.teklibre.com/~d/bloat/Not_every_packet_is_sacred-Battling_Bufferbloat_on_wifi.pdf
> > >
> > > appear to show the advantage of fq_codel fq + codel + head drop over
> > > tail drop during the slow start period on a 10Mbit link - (see how
> > > squiggly slow start is on pfifo fast?) as well as the marvelous
> > > interstream latency that can be achieved with BQL=3000 (on a 10 mbit
> > > link.)  Even that latency can be halved by reducing BQL to 1500, which
> > > is just fine on a 10mbit. Below those rates I'd like to be rid of BQL
> > > entirely, and just have a single packet outstanding... in everything
> > > from adsl to cable...
> > >
> > > That said, I'd welcome other explanations of the squiggly slowstart
> > > pfifo_fast behavior before I put that explanation on the slide.... ECN
> > > was in play here, too. I can redo this test easily, it's basically
> > > running a netperf TCP_RR for 70 seconds, and starting up a TCP_MAERTS
> > > and TCP_STREAM for 60 seconds a T+5, after hammering down on BQL's
> > > limit and the link speeds on two sides of a directly connected laptop
> > > connection.
> >
> > I must defer to others on this one.  I do note the much lower latencies
> > on slide 6 compared to slide 5, though.
> >
> > Please see attached for update including .git directory.
> >
> >                                                         Thanx, Paul
> >
> > > ethtool -s eth0 advertise 0x002 # 10 Mbit
> > >
> >
> > _______________________________________________
> > Cerowrt-devel mailing list
> > Cerowrt-devel at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cerowrt-devel
> >
> >

More information about the Cerowrt-devel mailing list