[Codel] [aqm] Codel's count variable and re-entering dropping state at small time intervals

Mon Jul 20 15:34:59 EDT 2015

I can’t speak for Kathy, but as Cake’s author I have some insight into the changes I’ve made to the associated version of Codel.

> Ronald Bless:

> 1. after exiting dropping state, count is usually reset, unless "the
> next drop state is entered too close to the previous one”.

In fact, in most cases (ie. where the true RTT is near to or less than the one assumed by the interval parameter, and the flow is persistent) the drop state will be re-entered sufficiently soon to prevent the reset.  The backoff behaviour for that case is thus more significant in practice.

The standard backoff behaviour is to reduce count by just 2.  This is obviously wrong in the case where count has already reached a significant value; with a high drop frequency, count will climb a lot during a single RTT.  But if the elevated drop frequency is necessary and sufficient to control the flow, we don’t want it to grow indefinitely.

Cake’s version halves the count, reducing the drop frequency by a factor of sqrt(2) after a pause.  This multiplicative backoff prevents count from growing indefinitely, and allows it to reduce gradually to a lower value if that becomes appropriate due to reduced network load.

Anil's suggestion of continuously reducing count during the non-dropping state is an interesting alternative which might also make sense.  It has the pleasing property of relating the backoff rate to the length of time spent outside the dropping state.

> 2. is 'count' supposed to be reset or saturated on overflow and what
> should be its maximum value (it makes a difference whether you are using
> 16-, 32-, or 64-bit counter)?

It’s clear to me that a saturating increment is necessary to deal with unresponsive flows adequately.  Otherwise, the drop frequency will wrap around to a low value periodically when it needs to be consistently high.

My test case for this involves 50 upstream TCP flows on a narrow link (say 10Mbit) and a single downstream TCP flow on a wide link (say 100Mbit).  In this configuration, the acks for the downstream flow constitute an unresponsive (in fact, anti-responsive) upstream flow of greater throughput demand than the individual upstream TCP flows.  Anti-responsive means that dropping packets actually increases the network load, because reducing the ack density permits higher throughput on the downlink, which in turn generates additional acks on the uplink.

Since Cake uses very robust flow isolation, simply delivering all the acks in order, using a fair-share of the bandwidth, would severely limit the throughput of the downstream flow.  However, the standard Codel behaviour resulted in wildly oscillating throughput on the downstream flow.  Straightforward calculations suggested that the count variable was probably wrapping.

At first, I only put in the modified backoff behaviour mentioned above.  Since the downstream flow was often reaching full throughput, I thought that occasionally the drop rate was managing to empty the ack queue, making the backoff behaviour significant. This assumption turned out to be correct, but the backoff modification was insufficient to completely correct the behaviour.  Only when I changed count to use a saturating increment instead of a wrapping one did the downstream flow achieve full throughput consistently.

It’s possible that if count was a wider variable (32 instead of 16 bits) that the backoff modification would have been sufficient to prevent wrapping in this case.  However, using saturating arithmetic is inexpensive on modern CPUs (many have dedicated instructions which could be used by highly optimised implementations) and is the most robust solution.

Note that in the absence of robust flow isolation, eg. with plain Codel or PIE, the upstream TCP flows will back off to make room for all the acks.  This does not require a particularly high drop rate, but it reduces the upstream goodput significantly.  Cake instead achieves high goodput in both directions in this especially challenging case.

> Anil Agarwal:

> I have attached the updated document.

There are indeed a lot of suggestions here, which I haven’t yet fully assimilated.  I’ll cherry-pick a few that stood out:

> When RTT is small compared to the default Interval, CoDel is slow in reacting and controlling queue size/delay during TCP slow start.

> For example, if Interval = 100 ms and RTT = 20 ms, CoDel will not drop or mark a packet for the first 100 ms of a new TCP connection; if link rate = 10 Mbps, initial window size = 14600 bytes, then during this period, the TCP cwnd will grow to 104 kbytes; queuing delay will be 63 ms.

Cake’s modified Codel does vary its sensitivity according to how quickly the queue is filling.  With the default parameters, it may trigger as quickly as 35ms after a large burst arrives, or as slowly as 200ms if the queue is growing very slowly.

It could be improved further in theory, by giving Codel visibility of the length and drain rate of the queue (and thus the ability to predict future delay) rather than only observing the exit delay.  However, in the general case the drain rate is variable, so this may be difficult to make robust.

Ultimately, triggering instantly when the queue begins to fill is optimal to deal with slow-start, which will double its cwnd further while the congestion signal is in transit, before responding with a halving.  However, an instant trigger is suboptimal for the congestion-avoidance phase of TCP, which is why Codel always delays its response.  Flow isolation is a good (if imperfect) countermeasure for the suboptimal slow-start behaviour this causes.

My “Explicit Load Regulation” proposal, which hasn’t yet been widely circulated, aims to get closer to optimal behaviour in both situations.  It is an extension to ECN behaviour, giving the option of more moderate signals than ECN permits, while being easier to deploy incrementally than DCTCP is.  Hopefully I’ll be able to write it up properly later.

> The draft rfc states - “For example, there is about an 86% probability that no more than two of the 100 VoIP sessions will be involved in any given collision”
> Based on analytical equations for hash collision probabilities (I derived them independently some time ago), the probability of no collision = 90.78%…

Cake’s set-associative hash function reduces the probability of flow collisions further, to negligible values for the above number of flows.

> Alternatively, codel_dequeue() can be re-structured as two routines…

Certainly a refactoring would be beneficial for readability.  I haven’t got around to doing it yet.

 - Jonathan Morton