From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2.tohojo.dk (mail2.tohojo.dk [IPv6:2a01:4f8:200:3141::101]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id C7B8221F1BE; Thu, 6 Feb 2014 06:23:22 -0800 (PST) X-Virus-Scanned: amavisd-new at example.com Received: by alrua-kau.localdomain (Postfix, from userid 1000) id 0253747FE6; Thu, 6 Feb 2014 15:23:07 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toke.dk; s=201310; t=1391696587; bh=0akpjQ4Z3znJa0J2il93uLi8nUS21BYQuSCkskx+bAI=; h=From:To:Subject:Date; b=UquRjN1mG8fpR+fxbex17S9ztP19y7ctevIg2gWVCblbLDBOwp6FXxVIUyqyEA0sy fq8ljTDWwryIjadQaWymiydi60Cj9DeY9ydr6rkUYgHhm6dcNDn/hskKkMV+LS7m9z SufnGm4O46oK5OFBP6synCFmo8E7WuYSVhAxmYdY= From: =?utf-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= To: bloat , codel Date: Thu, 06 Feb 2014 15:23:03 +0100 Message-ID: <87ha8cl3qw.fsf@toke.dk> Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Subject: [Codel] Describing fq_codel X-BeenThere: codel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: CoDel AQM discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Feb 2014 14:23:23 -0000 --=-=-= Content-Type: text/plain I recently had occasion to describe the scheduling mechanism in fq_codel in plain text. I thought this might be useful to have for general reference somewhere, so I'm posting it here to elicit comments and have a plan to post it to the bufferbloat wiki somewhere afterwards. See below. -Toke FQ_CODEL The fq_codel queueing discipline in Linux is an implementation of a somewhat modified stochastic fairness queueing algorithm, with CoDel added as an AQM for the individual queues. As such, it consists of two parts: the scheduler which selects which queue to dequeue a packet from, and the CoDel AQM which works on each of the queues built in to the qdisc. The subtleties of fq_codel are mostly in the scheduling part, which is the subject of this description. For a description of CoDel, refer to Kathy and Van's paper[1]. The interaction between the scheduler and the CoDel algorithm are fairly straight forward: At initiation (i.e. at boot, or when fq_codel is first installed on the interface, or its parameters are changed), the list of queues is initialised so that each queue has a separate set of CoDel state variables. By default, 1024 queues are created, and packets are hashed into them at enqueue time. Each queue maintains the CoDel state variables throughout its lifetime, and so acts the same as the non-fq CoDel variant would (including retaining the control law state when the queue drains, etc). As for the scheduler, it is different from a straight-forward conceptual SFQ scheduler[2] in a number of respects: - fq_codel is byte-based, employing a deficit round-robin mechanism[3] between queues. The quantum is configurable, but defaults to the interface MTU. This means that if one flow sends packets of size MTU/3, and another sends MTU-sized packets, the first flow will dequeue three packets each time it gets a turn, whereas the second flow only dequeues one. This is kept track of by maintaining a byte dequeue deficit for each queue, which is first initialised to the quantum value, and decreased by the packet size on each dequeue from the queue. - Queue space is shared: there's a global limit on the number of packets the queues can hold, but not one per queue. If the space runs out at enqueue time, the queue with the largest number of *bytes* in it will get a packet dropped. This means that the packet being enqueued will pretty much never be dropped; rather a different packet is dropped, and not necessarily from the same queue. Packets are always dropped from the head of a queue. - fq_codel maintains two lists of active queues, called "new" and "old" queues ("new" and "old" being the terms used in the code). When a packet is added to a queue that is not currently active, that queue is added to the list of new queues. This is the source of some subtlety in the packet scheduling at dequeue time, explained below. - Most of fq_codel's scheduling is done at packet dequeue time. It consists of three parts: selecting a queue from which to dequeue a packet, actually dequeuing it (employing the CoDel algorithm in the process), and some final bookkeeping. For the first part, the scheduler first looks at the list of new queues; for each queue in that list, if that queue has a negative deficit (i.e. it has already dequeued a quantum of bytes (or more)), its deficit is increased by one quantum, and the queue is put onto the end of the list of old queues, and the routine starts again. Otherwise, that queue is selected for dequeue. If the list of new queues is empty, the scheduler proceeds down the list of old queues in the same fashion (checking the deficit, and either selecting the queue for dequeuing, or increasing the deficit and putting the queue back at the end of the list). After having selected a queue from which to dequeue a packet, the CoDel algorithm is invoked on that queue. This applies the CoDel control law in the usual fashion, and may discard one or more packets from the head of the queue, before returning the packet that should be dequeued (or nothing if the queue is or becomes empty while being handled by the CoDel algorithm). Finally, if the CoDel algorithm did not return a packet, the queue is empty, and the scheduler does one of two things: if the queue selected for dequeue came from the list of new queues, it is moved to the end of the list of old queues. If instead it came from the list of old queues, that queue is removed from the list, to be added back (as a new queue) the next time a packet arrives that hashes to that queue. Then, (since no packet was available for dequeue), the whole dequeue process is restarted from the beginning. If, instead, the scheduler *did* get a packet back from the CoDel algorithm, it updates the byte deficit for the selected queue before returning the packet to the lower layers of the kernel networking stack for sending. The new/old queue distinction has a particular consequence for queues that don't build up more than a quantum bytes before being visited by the scheduler (as a new queue), *and* then stay empty until the scheduler comes back around to it in the list of old queues: Such queues will get removed from the list, and then re-added as a new queue each time a packet arrives for it, and so will always get priority over queues that do not empty out each round. Exactly how much data a flow has to send to keep its queue in this state is somewhat difficult to reason about, because it depends on both the egress link speed and the number of concurrent flows. This makes it harder to reason about the behaviour of fq_codel. However, in practice many things that are beneficial to have prioritised for typical internet use (ACKs, DNS lookups, interactive SSH, HTTP requests; and also ICMP pings) *tend* to fall in this category, which is why fq_codel performs so well for many practical applications. For those interested in examining the behaviour of fq_codel in more detail, the code can be found in your local Linux source tree, in net/sched/sch_fq_codel.c[4]. While some of it can be somewhat of a challenge to comprehend, it overall is a very instructive example of a practical implementation of a queueing algorithm in a modern operating system. [1] http://queue.acm.org/detail.cfm?id=2209336 [2] For a description of SFQ, refer to your system man page `man tc-sfq` or to the paper by Paul McKenney: http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf [3] See http://en.wikipedia.org/wiki/Deficit_round_robin [4] Or online at https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAEBCAAGBQJS85rHAAoJEENeEGz1+utPxV0IAK6yLMzU7UG/AcMfF/DQ1l1Q TV4ndqG2a+86uxyF8gDdvYNI+VucJ2neMCKJFgAZ8oF9W/LT2cZX+pAcZlZOKCgk rOhoWKWkd7Sdppr9qR2IcaEb2YCIGsSxYd6aEeFI3A0RVNUkwYyUlGH/G1+ru99E NaNf2CnFtdWh75l6KojgmT38DjXdrka8UEr2TVpEj35yX6cnv7wd1E+us+suAKBy B70GRtPc+jN908HLc9nBMPoUah+zRdinifZ9axtdpKPVehE5J+CF97aM2zCHr/fd mIG8iS/46bltZPJLqbCIg6DY+xDT1GAYehQlboa7kxb4nXbG0P5/9FScH/rVa0s= =4ysg -----END PGP SIGNATURE----- --=-=-=--