[Codel] [Bloat] Describing fq_codel
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Sun Feb 9 16:36:24 EST 2014
On Sun, Feb 09, 2014 at 09:09:38PM +0100, Jesper Dangaard Brouer wrote:
> (top post)
> Thank you Toke for this excellent description, it is really good!
> AFAIKR Paul also did a description of fq_codel, but with a focus on the
> SFQ part. I actually think, these two descriptions could be combined.
> Perhaps Paul can give us a link to his desc?
Please see attached for a tarball. I dropped this effort about a year
ago when it became apparent that I was working to conflicting sets
of requirements. This is not all that unusual, but I had not budgeted
sufficient time to resolve them. I believe that others (e.g., Dave Taht
and Jim Gettys, as well as Toke) have done more since.
> And where do you plan to put this excellent description (so its more
> accessible to mankind)?
> On Thu, 06 Feb 2014 15:23:03 +0100
> Toke Høiland-Jørgensen <toke at toke.dk> wrote:
> > I recently had occasion to describe the scheduling mechanism in fq_codel
> > in plain text. I thought this might be useful to have for general
> > reference somewhere, so I'm posting it here to elicit comments and have
> > a plan to post it to the bufferbloat wiki somewhere afterwards.
> > See below.
> > -Toke
> > FQ_CODEL
> > The fq_codel queueing discipline in Linux is an implementation of a
> > somewhat modified stochastic fairness queueing algorithm, with CoDel
> > added as an AQM for the individual queues. As such, it consists of two
> > parts: the scheduler which selects which queue to dequeue a packet from,
> > and the CoDel AQM which works on each of the queues built in to the
> > qdisc. The subtleties of fq_codel are mostly in the scheduling part,
> > which is the subject of this description. For a description of CoDel,
> > refer to Kathy and Van's paper.
> > The interaction between the scheduler and the CoDel algorithm are fairly
> > straight forward: At initiation (i.e. at boot, or when fq_codel is first
> > installed on the interface, or its parameters are changed), the list of
> > queues is initialised so that each queue has a separate set of CoDel
> > state variables. By default, 1024 queues are created, and packets are
> > hashed into them at enqueue time. Each queue maintains the CoDel state
> > variables throughout its lifetime, and so acts the same as the non-fq
> > CoDel variant would (including retaining the control law state when the
> > queue drains, etc).
> > As for the scheduler, it is different from a straight-forward conceptual
> > SFQ scheduler in a number of respects:
> > - fq_codel is byte-based, employing a deficit round-robin mechanism
> > between queues. The quantum is configurable, but defaults to the
> > interface MTU. This means that if one flow sends packets of size
> > MTU/3, and another sends MTU-sized packets, the first flow will
> > dequeue three packets each time it gets a turn, whereas the second
> > flow only dequeues one. This is kept track of by maintaining a byte
> > dequeue deficit for each queue, which is first initialised to the
> > quantum value, and decreased by the packet size on each dequeue from
> > the queue.
> > - Queue space is shared: there's a global limit on the number of packets
> > the queues can hold, but not one per queue. If the space runs out at
> > enqueue time, the queue with the largest number of *bytes* in it will
> > get a packet dropped. This means that the packet being enqueued will
> > pretty much never be dropped; rather a different packet is dropped,
> > and not necessarily from the same queue. Packets are always dropped
> > from the head of a queue.
> > - fq_codel maintains two lists of active queues, called "new" and "old"
> > queues ("new" and "old" being the terms used in the code). When a
> > packet is added to a queue that is not currently active, that queue is
> > added to the list of new queues. This is the source of some subtlety
> > in the packet scheduling at dequeue time, explained below.
> > - Most of fq_codel's scheduling is done at packet dequeue time. It
> > consists of three parts: selecting a queue from which to dequeue a
> > packet, actually dequeuing it (employing the CoDel algorithm in the
> > process), and some final bookkeeping.
> > For the first part, the scheduler first looks at the list of new
> > queues; for each queue in that list, if that queue has a negative
> > deficit (i.e. it has already dequeued a quantum of bytes (or more)),
> > its deficit is increased by one quantum, and the queue is put onto the
> > end of the list of old queues, and the routine starts again.
> > Otherwise, that queue is selected for dequeue. If the list of new
> > queues is empty, the scheduler proceeds down the list of old queues in
> > the same fashion (checking the deficit, and either selecting the queue
> > for dequeuing, or increasing the deficit and putting the queue back at
> > the end of the list).
> > After having selected a queue from which to dequeue a packet, the
> > CoDel algorithm is invoked on that queue. This applies the CoDel
> > control law in the usual fashion, and may discard one or more packets
> > from the head of the queue, before returning the packet that should be
> > dequeued (or nothing if the queue is or becomes empty while being
> > handled by the CoDel algorithm).
> > Finally, if the CoDel algorithm did not return a packet, the queue is
> > empty, and the scheduler does one of two things: if the queue selected
> > for dequeue came from the list of new queues, it is moved to the end
> > of the list of old queues. If instead it came from the list of old
> > queues, that queue is removed from the list, to be added back (as a
> > new queue) the next time a packet arrives that hashes to that queue.
> > Then, (since no packet was available for dequeue), the whole dequeue
> > process is restarted from the beginning.
> > If, instead, the scheduler *did* get a packet back from the CoDel
> > algorithm, it updates the byte deficit for the selected queue before
> > returning the packet to the lower layers of the kernel networking
> > stack for sending.
> > The new/old queue distinction has a particular consequence for queues
> > that don't build up more than a quantum bytes before being visited by
> > the scheduler (as a new queue), *and* then stay empty until the
> > scheduler comes back around to it in the list of old queues: Such queues
> > will get removed from the list, and then re-added as a new queue each
> > time a packet arrives for it, and so will always get priority over
> > queues that do not empty out each round. Exactly how much data a flow
> > has to send to keep its queue in this state is somewhat difficult to
> > reason about, because it depends on both the egress link speed and the
> > number of concurrent flows. This makes it harder to reason about the
> > behaviour of fq_codel. However, in practice many things that are
> > beneficial to have prioritised for typical internet use (ACKs, DNS
> > lookups, interactive SSH, HTTP requests; and also ICMP pings) *tend* to
> > fall in this category, which is why fq_codel performs so well for many
> > practical applications.
> > For those interested in examining the behaviour of fq_codel in more
> > detail, the code can be found in your local Linux source tree, in
> > net/sched/sch_fq_codel.c. While some of it can be somewhat of a
> > challenge to comprehend, it overall is a very instructive example of a
> > practical implementation of a queueing algorithm in a modern operating
> > system.
> >  http://queue.acm.org/detail.cfm?id=2209336
> >  For a description of SFQ, refer to your system man page `man tc-sfq`
> > or to the paper by Paul McKenney:
> > http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf
> >  See http://en.wikipedia.org/wiki/Deficit_round_robin
> >  Or online at
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Sr. Network Kernel Developer at Red Hat
> Author of http://www.iptv-analyzer.org
> LinkedIn: http://www.linkedin.com/in/brouer
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 428019 bytes
More information about the Codel