[Codel] [Cake] Proposing COBALT

Fri May 20 09:22:07 EDT 2016

Hi Jonathan,

> On May 20, 2016, at 14:18 , Jonathan Morton <chromatix99 at gmail.com> wrote:
> 
>>> One of the major reasons why Codel fails on UDP floods is that its drop schedule is time-based.  This is the correct behaviour for TCP flows, which respond adequately to one congestion signal per RTT, regardless of the packet rate.  However, it means it is easily overwhelmed by high-packet-rate unresponsive (or anti-responsive, as with TCP acks) floods, which an attacker or lab test can easily produce on a high-bandwidth ingress, especially using small packets.
>> 
>> In essence I agree, but want to point out that the protocol itself does not really matter but rather the observed behavior of a flow. Civilized UDP applications (that expect their data to be carried over the best-effort internet) will also react to drops similar to decent TCP flows, and crappy TCP implementations might not. I would guess with the maturity of TCP stacks misbehaving TCP flows will be rarer than misbehaving UDP flows (which might be for example well-behaved fixed-rate isochronous flows that simply should never have been sent over the internet).
> 
> Codel properly handles both actual TCP flows and other flows supporting TCP-friendly congestion control.  The intent of COBALT is for BLUE to activate whenever Codel clearly cannot cope, rather than on a protocol-specific basis.  This happens to dovetail neatly with the way BLUE works anyway.

	Well, as I said I agree, only wanted to smart alec around the tcp versus udp flood destinction. And I fully agree the behaviur should depend on observed flow behavior and not header values…

> 
>>> BLUE’s up-trigger should be on a packet drop due to overflow (only) targeting the individual subqueue managed by that particular BLUE instance.  It is not correct to trigger BLUE globally when an overall overflow occurs.  Note also that BLUE has a timeout between triggers, which should I think be scaled according to the estimated RTT.
>> 
>> That sounds nice in that no additional state is required. But with the current fq_codel I believe, the packet causing the memory limit overrun, is not necessarily from the flow that actually caused the problem to beginn with, and I doesn’t fq_codel actuall search the fattest flow and drops from there. But I guess that selection procedure could be run with blue as as well.
> 
> Yes, both fq_codel and Cake search for the longest extant queue and drop packets from that on overflow.  It is this longest queue which would receive the BLUE up-trigger at that point, which is not necessarily the queue for the arriving packet.
> 
>>> BLUE’s down-trigger is on the subqueue being empty when a packet is requested from it, again on a timeout.  To ensure this occurs, it may be necessary to retain subqueues in the DRR list while BLUE’s drop probability is nonzero.
>> 
>> Question, doesn’t this mean the affected flow will be throttled quite harshly? Will blue slowly decrease the drop probability p if the flow behaves? If so, blue could just disengage if p drops below a threshold?
> 
> Given that within COBALT, BLUE will normally only trigger on unresponsive flows, an aggressive up-trigger response from BLUE is in fact desirable.  

	Sure, by that point the flow had ample/some time to react, but didn’t so a sliding tackle is warranted.

> Codel is far too meek to handle this situation; we should not seek to emulate it when designing a scheme to work around its limitations.

	And again since we triggerd blue by crossiing a threshold we know that codel’s way of asking nicely whether the flow might reduce its bandwidth lead o where…

> 
> BLUE’s down-trigger decreases the drop probability by a smaller amount (say 1/4000) than the up-trigger increases it (say 1/400).  These figures are the best-performing configuration from the original paper, which is very readable, and behaviour doesn’t seem to be especially sensitive to the precise values (though only highly-aggregated traffic was considered, and probably on a long timescale).  For an actual implementation, I would choose convenient binary fractions, such as 1/256 up and 1/4096 down, and a relatively short trigger timeout.
> 
> If the relative load from the flow decreases, BLUE’s action will begin to leave the subqueue empty when serviced, causing BLUE’s drop probability to fall off gradually, potentially until it reaches zero.  At this point the subqueue is naturally reset and will react normally to subsequent traffic using it.

	But if we reach a queue length of codel’s target (for some small amount of time), would that not be the best point in time to hand back to codel? Otherwise we push the queue to zero only to have codel come in and let it grow back to target (well approximately).

> 
> The BLUE paper: http://www.eecs.umich.edu/techreports/cse/99/CSE-TR-387-99.pdf

	If I had time I would read that now ;)

> 
>>> Note that this does nothing to improve the situation regarding fragmented packets.  I think the correct solution in that case is to divert all fragments (including the first) into a particular queue dependent only on the host pair, by assuming zero for src and dst ports and a “special” protocol number.  
>> 
>> I believe the RFC recommends using the SRC IP, DST IP, Protocol, Identity tuple, as otherwise all fragmented flows between a host pair will hash into the same bucket…
> 
> I disagree with that recommendation, because the Identity field will be different for each fragmented packet,

	Ah, I see from rfc 791 (https://tools.ietf.org/html/rfc791):
    The identification field is used to distinguish the fragments of one
    datagram from those of another.  The originating protocol module of
    an internet datagram sets the identification field to a value that
    must be unique for that source-destination pair and protocol for the
    time the datagram will be active in the internet system.  The
    originating protocol module of a complete datagram sets the
    more-fragments flag to zero and the fragment offset to zero.

I agree the identity field decidely does the wrong thing, by spreading even a single flow over all hash buckets. That leaves my proposal from earlier, extract the ports from packets marked MF=1 Fragment offset packets, store the identity and use the stored values to calculate the hash values for all other packets in the same fragmented datagram… That sounds expensive enough to initially punt and use your idea, but certainly it is not ideal.

> even if many such packets belong to the same flow.  This would spread these packets across many subqueues and give them an unfair advantage over normal flows, which is the opposite of what we want.
> 
> Normal traffic does not include large numbers of fragmented packets (I would expect a mere handful from certain one-shot request-response protocols which can produce large responses), so it is better to shunt them to a single queue per host-pair.

	This kind of special-casing can easily be abused as an attack vector… really if possible even fragmented flows should be hashed properly. If you are unlucky and set the wrong MTU for a ppoe link for example all full MTU packets will be fragmented and it would be nice to even show grace under load ;)

Best Regards
	Sebastian

> 
> - Jonathan Morton
>