[Codel] Describing fq_codel

Thu Feb 6 06:23:03 PST 2014

I recently had occasion to describe the scheduling mechanism in fq_codel
in plain text. I thought this might be useful to have for general
reference somewhere, so I'm posting it here to elicit comments and have
a plan to post it to the bufferbloat wiki somewhere afterwards.

See below.

-Toke

FQ_CODEL

The fq_codel queueing discipline in Linux is an implementation of a
somewhat modified stochastic fairness queueing algorithm, with CoDel
added as an AQM for the individual queues. As such, it consists of two
parts: the scheduler which selects which queue to dequeue a packet from,
and the CoDel AQM which works on each of the queues built in to the
qdisc. The subtleties of fq_codel are mostly in the scheduling part,
which is the subject of this description. For a description of CoDel,
refer to Kathy and Van's paper[1].

The interaction between the scheduler and the CoDel algorithm are fairly
straight forward: At initiation (i.e. at boot, or when fq_codel is first
installed on the interface, or its parameters are changed), the list of
queues is initialised so that each queue has a separate set of CoDel
state variables. By default, 1024 queues are created, and packets are
hashed into them at enqueue time. Each queue maintains the CoDel state
variables throughout its lifetime, and so acts the same as the non-fq
CoDel variant would (including retaining the control law state when the
queue drains, etc).

As for the scheduler, it is different from a straight-forward conceptual
SFQ scheduler[2] in a number of respects:

- fq_codel is byte-based, employing a deficit round-robin mechanism[3]
  between queues. The quantum is configurable, but defaults to the
  interface MTU. This means that if one flow sends packets of size
  MTU/3, and another sends MTU-sized packets, the first flow will
  dequeue three packets each time it gets a turn, whereas the second
  flow only dequeues one. This is kept track of by maintaining a byte
  dequeue deficit for each queue, which is first initialised to the
  quantum value, and decreased by the packet size on each dequeue from
  the queue.

- Queue space is shared: there's a global limit on the number of packets
  the queues can hold, but not one per queue. If the space runs out at
  enqueue time, the queue with the largest number of *bytes* in it will
  get a packet dropped. This means that the packet being enqueued will
  pretty much never be dropped; rather a different packet is dropped,
  and not necessarily from the same queue. Packets are always dropped
  from the head of a queue.

- fq_codel maintains two lists of active queues, called "new" and "old"
  queues ("new" and "old" being the terms used in the code). When a
  packet is added to a queue that is not currently active, that queue is
  added to the list of new queues. This is the source of some subtlety
  in the packet scheduling at dequeue time, explained below.

- Most of fq_codel's scheduling is done at packet dequeue time. It
  consists of three parts: selecting a queue from which to dequeue a
  packet, actually dequeuing it (employing the CoDel algorithm in the
  process), and some final bookkeeping.

  For the first part, the scheduler first looks at the list of new
  queues; for each queue in that list, if that queue has a negative
  deficit (i.e. it has already dequeued a quantum of bytes (or more)),
  its deficit is increased by one quantum, and the queue is put onto the
  end of the list of old queues, and the routine starts again.
  Otherwise, that queue is selected for dequeue. If the list of new
  queues is empty, the scheduler proceeds down the list of old queues in
  the same fashion (checking the deficit, and either selecting the queue
  for dequeuing, or increasing the deficit and putting the queue back at
  the end of the list).

  After having selected a queue from which to dequeue a packet, the
  CoDel algorithm is invoked on that queue. This applies the CoDel
  control law in the usual fashion, and may discard one or more packets
  from the head of the queue, before returning the packet that should be
  dequeued (or nothing if the queue is or becomes empty while being
  handled by the CoDel algorithm).

  Finally, if the CoDel algorithm did not return a packet, the queue is
  empty, and the scheduler does one of two things: if the queue selected
  for dequeue came from the list of new queues, it is moved to the end
  of the list of old queues. If instead it came from the list of old
  queues, that queue is removed from the list, to be added back (as a
  new queue) the next time a packet arrives that hashes to that queue.
  Then, (since no packet was available for dequeue), the whole dequeue
  process is restarted from the beginning.

  If, instead, the scheduler *did* get a packet back from the CoDel
  algorithm, it updates the byte deficit for the selected queue before
  returning the packet to the lower layers of the kernel networking
  stack for sending.

The new/old queue distinction has a particular consequence for queues
that don't build up more than a quantum bytes before being visited by
the scheduler (as a new queue), *and* then stay empty until the
scheduler comes back around to it in the list of old queues: Such queues
will get removed from the list, and then re-added as a new queue each
time a packet arrives for it, and so will always get priority over
queues that do not empty out each round. Exactly how much data a flow
has to send to keep its queue in this state is somewhat difficult to
reason about, because it depends on both the egress link speed and the
number of concurrent flows. This makes it harder to reason about the
behaviour of fq_codel. However, in practice many things that are
beneficial to have prioritised for typical internet use (ACKs, DNS
lookups, interactive SSH, HTTP requests; and also ICMP pings) *tend* to
fall in this category, which is why fq_codel performs so well for many
practical applications.

For those interested in examining the behaviour of fq_codel in more
detail, the code can be found in your local Linux source tree, in
net/sched/sch_fq_codel.c[4]. While some of it can be somewhat of a
challenge to comprehend, it overall is a very instructive example of a
practical implementation of a queueing algorithm in a modern operating
system.

[1] http://queue.acm.org/detail.cfm?id=2209336

[2] For a description of SFQ, refer to your system man page `man tc-sfq`
or to the paper by Paul McKenney:
http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf

[3] See http://en.wikipedia.org/wiki/Deficit_round_robin

[4] Or online at
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <https://lists.bufferbloat.net/pipermail/codel/attachments/20140206/b63d2372/attachment.pgp>