November 21, 2012
Author list TBD
The
CoDel queueing algorithm
has been put forward as a way of addressing the
bufferbloat problem,
and Eric Dumazet and Dave Taht have implemented this in the Linux kernel as
net/sched/sch_codel.c
, which appeared in v3.5.
However, in his
IETF presentation
reporting on Kathleen Nichols's and his work,
Van Jacobson recommends use of FQ-CoDel, which combines
stochastic fairness queueing (SFQ)
with CoDel.
Eric and Dave implemented this as well, with
net/sched/sch_fq_codel.c
also appearing in v3.5.
Of course, Alexey Kuznetsov implemented SFQ itself as
net/sched/sch_sfq.c
back in the 1990s.
This raises the question of how FQ-CoDel differs from SFQ on the one hand and from pure CoDel on the other, which this article attempts to answer. The remainder of this article addresses this question as follows:
This is of course followed by the Answers to Quick Quizzes.
The purpose of SFQ is straightforward: With high probability, isolate “hog” sessions so that they bear the brunt of any packet dropping that might be required. To this end, an example SFQ data structure might look as follows:
The horizontal row of boxes labeled A, B, C, and D represent a hash table of queues. Each incoming packet is enqueued based on a hash of its “quintuple”, namely its source address, source port, destination address, destination port, and IP protocol (e.g., 6 for TCP or 17 for UDP). The default number of hash buckets in the Linux kernel implementation is 128, but the figure above shows only four buckets for clarity. As shown in the diagram, each hash bucket is a queue that can hold a number of packets (empty boxes) in doubly linked lists. In the Linux kernel implementation, a given queue can hold at most 127 packets.
Each non-empty bucket is linked into a doubly linked list, which in this example contains buckets A, C, and D. This list is traversed when dequeueing packets. In this example, the next bucket to dequeue from is D, indicated by the dot-dashed arrow, and the next bucket after that is A.
Each non-empty bucket is also linked into a doubly linked list containing all other buckets with the same number of packets. These lists are indicated by the dashed arrows. These lists are anchored off of an array, shown on the left-hand side of the diagram. In this example, the buckets with one packet are A and D. The other list contains only C, which is the sole bucket with three packets.
There is also an index into this array that tracks the buckets with the most packets. In the diagram, this index is represented by the arrow referencing array element 3. This index is used to find queues to steal packets from when the SFQ overflows. This approach means that (with high probability) packets will be dropped from “hog” sessions. These dropped packets can then be expected to cause the “hog” sessions to respond by decreasing their offered load. This is the major purpose of the SFQ: To preferentially cause “hog” sessions to decrease their offered load, while allowing low-bandwidth sessions to continue undisturbed. This will in theory result in fair allocation of bandwidth at network bottlenecks, at least for some probabilistic definition of “fair”.
There clearly will be some list maintenance required as packets are enqueued and dequeued, and readers interested in that sort of detail are referred to the paper.
Of course, it is possible that a low-bandwidth session will, though shear bad luck, happen to hash to the same bucket as a “hog” session. In order to prevent such bad luck from becoming permanent bad luck, SFQ allows the hash function to be periodically perturbed, in essence periodically reshuffling the sessions. This can be quite effective, but unfortunately interacts poorly with many end-to-end congestion-control schemes because the rehashing often results in packet drops or packet reordering, either of which can cause the corresponding session to unnecessarily decrease offered load. Nevertheless, SFQ works well enough that it is often configured as a “leaf” packet scheduler in the Linux kernel.
Quick Quiz 1:
But mightn't tricky protocol designers split their “hog” sessions over
multiple TCP sessions?
Wouldn't that defeat SFQ's attempt to fairly allocate bottleneck link
bandwidth?
Answer
The underlying CoDel article is described in the LWN article, the ACM Queue paper, the CACM article, or Van Jacobson's IETF presentation. The basic idea is to control queue length, maintaining sufficient queueing to keep the outgoing link busy, but avoiding building up the queue beyond that point. Roughly speaking, this is done by preferentially dropping packets that sit in the queue for longer time periods.
Nevertheless, CoDel still maintains a single queue, so that low-bandwidth voice-over-IP (VOIP) packets could easily get stuck behind higher-bandwidth flows such as video downloads. This same risk is incurred by other types of packets, including DNS lookups, DHCP packets, ARP packets, and routing packets, timely delivery of which is critical to the operation of Internet for all types of workloads. Therefore, what we would like to allow the low-bandwidth time-sensitive VOIP packets to jump ahead of the video download, but not to the extent that the video download is in any danger of starvation—or even in danger of significant throughput degradation. One way to do this is to combine CoDel with SFQ. Of course, this combining does require significant rework of SFQ, but fortunately Eric Dumazet was up to the job. A rough schematic of the result is shown below:
The most significant attributes of SFQ remain, namely that packets are hashed into multiple buckets. However, each bucket contains not a first-come-first-served queue as is the case with SFQ, but rather a CoDel-managed queue.
Perhaps the next most significant change in that there are now two lists linking the buckets together instead of just one. The first list contains buckets A and D, namely the buckets that with high probability contain packets from low-bandwidth time-sensitive sessions. The next bucket to be dequeued from is indicated by the dash-dotted green arrow referencing bucket D. The second list contains all other non-empty buckets, in this case only bucket C, which with high probability contains “hog” flows.
Quick Quiz 2:
But mightn't bucket C instead just contain a bunch of packets from
a number of unlucky VOIP sessions?
Wouldn't that be needlessly inflicting dropouts on the hapless VOIP users?
Answer
Quick Quiz 3:
OK, but if this session segregation is so valuable, why didn't the original
SFQ implement it?
Answer
FQ-CoDel operates by dequeueing from each low-bandwidth bucket, dequeueing from one of the “hog” bucket, and repeating. If a given low-bandwidth bucket accumulates too many packets, it is relegated to the end of the “hog” list. If a bucket from either list becomes empty, it is removed from whichever list it is on. The first packet arriving at an empty bucket is initially classified as a low-bandwidth session and is thus placed on the low-bandwidth list of buckets.
Quick Quiz 4:
Doesn't this initial misclassification unfairly penalize competing
low-bandwidth time-sensitive flows?
Answer
Another key change is that FQ-CoDel drops packets from the head of the queue, rather than from the tail, as has been the tradition, a tradition that SFQ adhered to. To see the benefit of dropping from the head rather than the tail, keep in mind that for many transport protocols (including TCP), a dropped packet signals the sender to reduce its offered load. Clearly, the faster this signal reaches the sender the better.
But if we drop from the tail of a long queue, this signal must propagate through the queue as well as traversing the network to the receiver and then (via some sort of acknowledgement) back to the sender. In contrast, if we drop from the head of a long queue, the signal need not propagate through the queue itself, but needs only traverse the network. This faster propagation enables the transport protocols to more quickly adjust their offered load, resulting in faster reduction in queue length, which in turn results in faster reduction in network round-trip time, which finally improves overall network responsiveness.
In addition, use of head drop instead of tail drop results in dropping of older packets, which is helpful in cases where faster propagation of newer information is more valuable than slower propagation of older information.
Another difference between SFQ and FQ-CoDel is that the array on
the left-hand side of the diagram is just an
array of int
s in FQ-CoDel, as opposed to SFQ's array of
list headers.
This change was necessary because FQ-CoDel does its accounting in
bytes rather than packets, which allows the benefits of
byte queue limits (BQL) to be brought to bear
(using CONFIG_BQL
).
That said, because there is an extremely large number of possible packet sizes,
blindly using the SFQ approach would have resulted in a truly huge array.
For example, assume an MTU of 512 bytes with a limit of 127 packets
per bucket.
If the SFQ approach were used, with a separate array entry per possible
bucket size in bytes, the array would need more than 65,000 entries,
which is clearly overkill.
Instead, for FQ-CoDel, the left-hand array has one entry per bucket, where each entry contains the current count of bytes for the corresponding bucket. When it is necessary to drop a packet, FQ-CoDel scans this array looking for the largest entry. Because the array has only 1024 entries comprising 4096 contiguous bytes, the caches of modern microprocessors make short work of scanning this array. Yes, there is some overhead, but then again one of the strengths of CoDel is that packet drops are normally reasonably infrequent.
Finally, FQ-CoDel does not perturb the hash function at runtime. Instead, a hash function is selected randomly from a set of about 4 billion possible hash functions at boot time.
Because FQ-CoDel is built on top of a number of Linux-kernel networking features, it is usually not sufficient to simply enable it. Instead, a group of related kernel configuration parameters must be enabled in order to get the full benefits of FQ-CoDel:
CONFIG_NET_SCH_FQ_CODEL
CONFIG_BQL
CONFIG_NET_SCH_HTB
To demonstrate the effectiveness of FQ-CoDel, Dave Taht and David Woodhouse ran a test concurrently running four TCP uploads, four additional TCP downloads, along with four low-bandwidth workloads, three of which used UDP and the fourth being ICMP ping packets. The graphs below show the throughputs of the TCP streams and the latencies of the low-bandwidth workloads. The graph to the right uses FQ-CoDel, while that to the left does not.
As you can see, FQ-CoDel is extremely effective, improving the low-bandwidth latency by roughly a factor of four, with no noticeable degradation in throughput for the uploads and downloads. Note also that without FQ-CoDel, the latency is closely related to the throughput, as can be seen by the step-up behavior when first the downloads and then the uploads start. In contrast, the FQ-CoDel latency is not affected much by the throughput, as is desired.
Quick Quiz 5:
Why the jumps in throughput near the beginnings and ends of the tests?
Answer
Quick Quiz 6:
Why the smaller per-session spikes in throughput during the tests?
Answer
FQ-CoDel combines the best of CoDel and SFQ, making a few needed changes along the way. Testing thus far has shown that it works extremely well for current Internet traffic.
So, what happens if someone comes up with a type of traffic that it does not handle very well? Trust me, this will happen sooner or later. When it happens, it will be dealt with—and even now, FQ-CoDel workers are looking at other active queue management (AQM) schemes to see if FQ-CoDel can be further improved. However, FQ-CoDel works well as is, so we can expect to see it deployed widely, which means that we should soon reap the benefits of improved VOIP sessions with minimal impact on bulk-data downloads.
TBD.
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: But mightn't tricky protocol designers split their “hog” sessions over multiple TCP sessions? Wouldn't that defeat SFQ's attempt to fairly allocate bottleneck link bandwidth?
Answer: Indeed it might, because the separate TCP sessions would probably occupy different buckets, each getting a separate share of the bandwidth. If this sort of thing becomes too common, there are ways to deal with it. And there will no doubt be ways of abusing the resulting modified SFQ. Hey, I never promised you that life would be easy! ;-)
Quick Quiz 2: But mightn't bucket C instead just contain a bunch of packets from a number of unlucky VOIP sessions? Wouldn't that be needlessly inflicting dropouts on the hapless VOIP users?
Answer: Indeed it might. Which is why there are all those “with high probability” qualifiers in the description. However, given that FQ-CoDel uses no fewer than 1024 hash buckets, the probabilty that (say) 100 VOIP sessions will all hash to the same bucket is something like ten to the power of minus 300.
Quick Quiz 3: OK, but if this session segregation is so valuable, why didn't the original SFQ implement it?
Answer: Two reasons: (1) I didn't think of it at the time, and (2) It might not have been a winning strategy for the low-clock-rate 68000 CPUs that I was using at the time.
Quick Quiz 4: Doesn't this initial misclassification unfairly penalize competing low-bandwidth time-sensitive flows?
Answer: Again, indeed it might. However, a “hog” flow is likely to persist for some time, so the fraction of time that it spends misclassified is usually insignificant.
Quick Quiz 5: Why the jumps in throughput near the beginnings and ends of the tests?
Answer: Good question. This is being investigated.
Quick Quiz 6: Why the smaller per-session spikes in throughput during the tests?
Answer: Packet drops can force individual sessions to sharply reduce their offered load momentarily. The sessions recover quickly and sometimes also overshoot when slow-starting, resulting in the spikes. Note that the overall average throughput, indicated by the black trace, does not vary much, so the aggregate bandwidth is quite steady.