[Cake] Control theory and congestion control

Tue May 12 22:51:36 EDT 2015

> On 13 May, 2015, at 02:23, David Lang <david at lang.hm> wrote:
> 
>> 1) The most restrictive signal seen during an RTT is the one to react to.  So a “fast down” signal overrides anything else.
> 
> sorry for joining in late, but I think you are modeling something that doesn't match reality.
> 
> are you really going to see two bottlenecks in a given round trip (or even one connection)? Since you are ramping up fairly slowly, aren't you far more likely to only see one bottleneck (and once you get through that one, you are pretty much set through the rest of the link)

It’s important to remember that link speeds can change drastically over time (usually if it’s *anything* wireless), that new competing traffic might reduce the available bandwidth suddenly, and that as a result the bottleneck can *move* from an ELR-enabled queue to a different queue which might not be.  I consider that far more likely than an ELR queue abruptly losing control as Sebastian originally suggested, but it looks similar to the endpoints.

So what you might have is an ELR queue happily controlling the cwnd based on the assumption that *it* is the bottleneck, which until now it has been.  But *after* that queue is another one which has just *become* the bottleneck, and it’s not ELR - it’s plain ECN.  The only way it can tell the flow to slow down is by giving “fast down” signals.  But that’s okay, the endpoints will react to that just as they should do, as long as they correctly interpret the most restrictive signal as being the operative one.

Or maybe the new bottleneck is a dumb FIFO.  In this case, ELR will initially hold the cwnd constant, but the FIFO will fill up, increasing latency and reducing throughput at the same BDP.  This will cause ELR to start giving “slow up” and then maybe “fast up” signals, and might thereby relinquish control of the flow automatically.  Note that “fast up” is signalled by ELR *not modifying* any packets.

Or maybe the new bottleneck is a drop-only AQM.  In that case, the first sign of it will be a dropped packet after, if anything, only a small increase in latency (ie. not enough, for long enough, for ELR to do very much about).  At this point, the observable network state is indistinguishable from a randomly-lost packet, ie. not congestion related.

The safe option here is to react like an ECN-enabled flow, treating any lost packet as a “fast down” signal.  An alternative is to treat a lost packet as “slow down” *if* it is accompanied by “slow up” or “hold” signals in the same RTT (ie. there’s a reasonable belief that we’re being properly controlled by ELR).  While “slow down” doesn’t react as quickly as a new bottleneck queue might prefer, it does at least respond; if enough drops appear, the ELR queue’s control loop will be shifted to “fast up”, relinquishing control.  Or, if the AQM isn’t tight enough to do that, the corresponding increase in RTT will do it instead.

> (if it's a new flow, it should start slow and ramp up, so you, and the other affected flows, should all be good with a 'slow down' signal)

Given that slow-start grows the cwnd exponentially, that might not be the case after the first few RTTs.  But that’s all part of the control loop, and ELR would normally signal it with the CE codepoint rather than dropping packets.  Sebastian’s scenario of “slow down” suddenly changing to “omgwtfbbq drop everything now” within the same queue is indeed unlikely.

>> I fully appreciate that *some* network paths may be unstable, and any congestion control system will need to chase the sweet spot up and down under such conditions.
>> 
>> Most of the time, however, baseline RTT is stable over timescales of the order of minutes, and available bandwidth is dictated by the last-mile link as the bottleneck.  BDP and therefore the ideal cwnd is a simple function of baseline RTT and bandwidth.  Hence there are common scenarios in which a steady-state condition can exist.  That’s enough to justify the “hold” signal.
> 
> Unless you prevent other traffic from showing up on the network (phones checking e-mail, etc). I don't believe that you are ever going to have stable bandwidth available for any noticable timeframe.

On many links, light traffic such as e-mail will disturb the balance too little to even notice, especially with flow isolation.  Assuming ELR is implemented as per my later post, running without flow isolation will allow light traffic to perturb the ELR signal slightly, converting a “hold” into a random sequence of “slow up”, “hold" and “slow down”, but this will self-correct conservatively, with ELR transitioning to a true “slow up” briefly if required.

Of course, as with any speculation of this nature, simulations and other experiments will tell a more convincing story.

 - Jonathan Morton