On Sun, Feb 09, 2014 at 09:09:38PM +0100, Jesper Dangaard Brouer wrote: > > (top post) > > Thank you Toke for this excellent description, it is really good! > > AFAIKR Paul also did a description of fq_codel, but with a focus on the > SFQ part. I actually think, these two descriptions could be combined. > Perhaps Paul can give us a link to his desc? Please see attached for a tarball. I dropped this effort about a year ago when it became apparent that I was working to conflicting sets of requirements. This is not all that unusual, but I had not budgeted sufficient time to resolve them. I believe that others (e.g., Dave Taht and Jim Gettys, as well as Toke) have done more since. Thanx, Paul > And where do you plan to put this excellent description (so its more > accessible to mankind)? > > --Jesper > > > On Thu, 06 Feb 2014 15:23:03 +0100 > Toke Høiland-Jørgensen wrote: > > > I recently had occasion to describe the scheduling mechanism in fq_codel > > in plain text. I thought this might be useful to have for general > > reference somewhere, so I'm posting it here to elicit comments and have > > a plan to post it to the bufferbloat wiki somewhere afterwards. > > > > See below. > > > > -Toke > > > > > > FQ_CODEL > > > > The fq_codel queueing discipline in Linux is an implementation of a > > somewhat modified stochastic fairness queueing algorithm, with CoDel > > added as an AQM for the individual queues. As such, it consists of two > > parts: the scheduler which selects which queue to dequeue a packet from, > > and the CoDel AQM which works on each of the queues built in to the > > qdisc. The subtleties of fq_codel are mostly in the scheduling part, > > which is the subject of this description. For a description of CoDel, > > refer to Kathy and Van's paper[1]. > > > > The interaction between the scheduler and the CoDel algorithm are fairly > > straight forward: At initiation (i.e. at boot, or when fq_codel is first > > installed on the interface, or its parameters are changed), the list of > > queues is initialised so that each queue has a separate set of CoDel > > state variables. By default, 1024 queues are created, and packets are > > hashed into them at enqueue time. Each queue maintains the CoDel state > > variables throughout its lifetime, and so acts the same as the non-fq > > CoDel variant would (including retaining the control law state when the > > queue drains, etc). > > > > As for the scheduler, it is different from a straight-forward conceptual > > SFQ scheduler[2] in a number of respects: > > > > - fq_codel is byte-based, employing a deficit round-robin mechanism[3] > > between queues. The quantum is configurable, but defaults to the > > interface MTU. This means that if one flow sends packets of size > > MTU/3, and another sends MTU-sized packets, the first flow will > > dequeue three packets each time it gets a turn, whereas the second > > flow only dequeues one. This is kept track of by maintaining a byte > > dequeue deficit for each queue, which is first initialised to the > > quantum value, and decreased by the packet size on each dequeue from > > the queue. > > > > - Queue space is shared: there's a global limit on the number of packets > > the queues can hold, but not one per queue. If the space runs out at > > enqueue time, the queue with the largest number of *bytes* in it will > > get a packet dropped. This means that the packet being enqueued will > > pretty much never be dropped; rather a different packet is dropped, > > and not necessarily from the same queue. Packets are always dropped > > from the head of a queue. > > > > - fq_codel maintains two lists of active queues, called "new" and "old" > > queues ("new" and "old" being the terms used in the code). When a > > packet is added to a queue that is not currently active, that queue is > > added to the list of new queues. This is the source of some subtlety > > in the packet scheduling at dequeue time, explained below. > > > > - Most of fq_codel's scheduling is done at packet dequeue time. It > > consists of three parts: selecting a queue from which to dequeue a > > packet, actually dequeuing it (employing the CoDel algorithm in the > > process), and some final bookkeeping. > > > > For the first part, the scheduler first looks at the list of new > > queues; for each queue in that list, if that queue has a negative > > deficit (i.e. it has already dequeued a quantum of bytes (or more)), > > its deficit is increased by one quantum, and the queue is put onto the > > end of the list of old queues, and the routine starts again. > > Otherwise, that queue is selected for dequeue. If the list of new > > queues is empty, the scheduler proceeds down the list of old queues in > > the same fashion (checking the deficit, and either selecting the queue > > for dequeuing, or increasing the deficit and putting the queue back at > > the end of the list). > > > > After having selected a queue from which to dequeue a packet, the > > CoDel algorithm is invoked on that queue. This applies the CoDel > > control law in the usual fashion, and may discard one or more packets > > from the head of the queue, before returning the packet that should be > > dequeued (or nothing if the queue is or becomes empty while being > > handled by the CoDel algorithm). > > > > Finally, if the CoDel algorithm did not return a packet, the queue is > > empty, and the scheduler does one of two things: if the queue selected > > for dequeue came from the list of new queues, it is moved to the end > > of the list of old queues. If instead it came from the list of old > > queues, that queue is removed from the list, to be added back (as a > > new queue) the next time a packet arrives that hashes to that queue. > > Then, (since no packet was available for dequeue), the whole dequeue > > process is restarted from the beginning. > > > > If, instead, the scheduler *did* get a packet back from the CoDel > > algorithm, it updates the byte deficit for the selected queue before > > returning the packet to the lower layers of the kernel networking > > stack for sending. > > > > The new/old queue distinction has a particular consequence for queues > > that don't build up more than a quantum bytes before being visited by > > the scheduler (as a new queue), *and* then stay empty until the > > scheduler comes back around to it in the list of old queues: Such queues > > will get removed from the list, and then re-added as a new queue each > > time a packet arrives for it, and so will always get priority over > > queues that do not empty out each round. Exactly how much data a flow > > has to send to keep its queue in this state is somewhat difficult to > > reason about, because it depends on both the egress link speed and the > > number of concurrent flows. This makes it harder to reason about the > > behaviour of fq_codel. However, in practice many things that are > > beneficial to have prioritised for typical internet use (ACKs, DNS > > lookups, interactive SSH, HTTP requests; and also ICMP pings) *tend* to > > fall in this category, which is why fq_codel performs so well for many > > practical applications. > > > > For those interested in examining the behaviour of fq_codel in more > > detail, the code can be found in your local Linux source tree, in > > net/sched/sch_fq_codel.c[4]. While some of it can be somewhat of a > > challenge to comprehend, it overall is a very instructive example of a > > practical implementation of a queueing algorithm in a modern operating > > system. > > > > [1] http://queue.acm.org/detail.cfm?id=2209336 > > > > [2] For a description of SFQ, refer to your system man page `man tc-sfq` > > or to the paper by Paul McKenney: > > http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf > > > > [3] See http://en.wikipedia.org/wiki/Deficit_round_robin > > > > [4] Or online at > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer