[Ecn-sane] Comments on L4S drafts

Fri Jun 7 14:07:53 EDT 2019

Thanks Jake,

I'll address each of your questions inline. But I notice that I need to 
lay down some context first.

The problem boils down to deployment incentives. The introduction of 
fine-grained congestion control requires changes to sender, receiver and 
at least the bottleneck link before it is effective. ECN deployment 
faced the same 3-part deployment problem. So we tried hard to learn from 
it.

Faced with a 3-part deployment, no single party makes a move unless they 
judge that the potential gain is worth the effort and that /all/ the 
other parts (server, client, network) are strongly likely to make the 
same judgement {Note 1}.

The effort isn't just the coding, it's all the hassle dealing with 
unexpected consequences of making the change, e.g. the risk of people's 
Internet service being taken out by a middlebox black-holing the new 
protocol. High risk of high cost/effort needs very high gain.

So the improvement has to be remarkable. Not just incremental, but 
stunning enough to enable applications that are not even possible 
otherwise.

The aim here is to use the last unicorn in the world (ECT(1)) to the 
full. If we don't make delay extremely low and extremely consistent 
we'll have wasted it. So we must focus on 99th percentile delay (and 
more 9s if you want to take longer to measure it). Now, inline...

On 05/06/2019 01:01, Holland, Jake wrote:
> Hi Bob,
>
> I have a few comments and questions on draft-ietf-tsvwg-ecn-l4s-id-06
> and draft-ietf-tsvwg-l4s-arch-03.
>
> I've been re-reading these with an eye toward whether it would be
> feasible to make L4S compatible with SCE[1] by using ECN capability alone
> as the dualq classifier (roughly as described in Appendix B.3 of l4s-id),
> and using ECT(1) to indicate a sub-loss congestion signal, assuming
> some reasonable mechanism for reflecting the ECT(1) signals to sender
> (such as AccECN in TCP, or even just reflecting each SCE signal in the
> NS bit from receiver, if AccECN is un-negotiated).
>
> I'm trying to understand the impact this approach would have on the
> overall L4S architecture, and I thought I'd write out some of the
> comments and questions that taking this angle on a review has left me
> with.
>
> This approach of course would require some minor updates to DCTCP or other
> CCs that hope to make use of the sub-loss signal, but the changes seem
> relatively straightforward (I believe there's a preliminary
> implementation that was able to achieve similarly reduced RTT in lab) and
> the idea of course comes with some tradeoffs--I've tried to articulate the
> key ones I noticed below, which I think are mostly already stated in the
> l4s drafts, but I thought I'd ask your opinion of whether you agree with
> this interpretation of what these tradeoffs would look like, or there
> are other important points you'd like to mention for consideration.
May I give this proposal a name for brevity: ECN-DualQ-SCE (which 
sort-of represents ECN as the input classifier into 1 of 2 queues and 
SCE as the output from that queue).

>
>
> 1.
> Of course, I understand using SCE-style signaling with ECT capability as
> the dualq classifier would come with a cost that where there's classic ECT
> behavior at endpoints, the low latency queue would routinely get some
> queue-building, until there's pretty wide deployment of scalable controllers
> and feedback for the congestion signals at the endpoints.
>
> This is a downside for the proposal, but of course even under this downside,
> there's the gains described in Section 5.2 of l4s-arch:
>     "State-of-the-art AQMs:  AQMs such as PIE and fq_CoDel give a
>        significant reduction in queuing delay relative to no AQM at all."
Indeed, herein lies the problem. Imagine you are trying to convince a 
network operator to start a major project to tender for a new low 
latency technology then deploy it across their access network. You tell 
them it will also depend on:
* servers/CDNs deploying new OS code.
* and clients deploying new OS code.
Then you tell them that, until /most/ servers deploy, and /most/ clients 
deploy (maybe a decade?), the low latency queue will routinely add as 
much queue delay as we can already get (without clients and servers 
changing)....

One day, you continue, if all the other servers and clients passing 
traffic through that box get upgraded, it will be cool. Until that day, 
a gamer in augmented reality gets stunningly low delay,... except every 
time her daughter in the bedroom looks at a mate's facebook page or 
watches a YouTube clip.

Is the network operator really going to take all those risks for jam 
tomorrow (=  maybe a decade)? I really don't think so.

Then we'll have burned the last unicorn to routinely get what we've 
already got.

  * Incremental deployment means, as you deploy the new capability, old
    traffic continues to work, while new traffic gets the new service.
  * As you say, with ECN-DualQ-SCE, new traffic only gets the new
    service if there's no old traffic there. That's not only incremental
    deployment; that's also ineffective deployment.

>
> On top of that, the same pressures that l4s-arch describes that should
> cause rapid rollout of L4S should for the same reasons cause rapid rollout
> of the endpoint capabilities, especially if the network capability is
> there.

I'm afraid there are not the same pressures to cause rapid roll-out at 
all, cos it's flakey now, jam tomorrow. (Actually ECN-DualQ-SCE has a 
much greater problem - complete starvation of SCE flows - but we'll come 
on to that in Q4.)

I want to say at this point, that I really appreciate all the effort 
you've been putting in, trying to find common ground.

In trying to find a compromise, you've taken the fire that is really 
aimed at the inadequacy of underlying SCE protocol - for anything other 
than FQ. If the primary SCE proponents had attempted to articulate a way 
to use SCE in a single queue or a dual queue, as you have, that would 
have taken my fire.

>
> But regardless, the queue-building from classic ECN-capable endpoints that
> only get 1 congestion signal per RTT is what I understand as the main
> downside of the tradeoff if we try to use ECN-capability as the dualq
> classifier.  Does that match your understanding?
This is indeed a major concern of mine (not as major as the starvation 
of SCE explained under Q4, but we'll come to that).

Fine-grained (DCTCP-like) and coarse-grained (Cubic-like) congestion 
controls need to be isolated, but I don't see how, unless their packets 
are tagged for separate queues. Without a specific fine/coarse 
identifier, we're left with having to re-use other identifiers:

  * You've tried to use ECN vs Not-ECN. But that still lumps two large
    incompatible groups (fine ECN and coarse ECN) together.
  * The only alternative that would serve this purpose is the flow
    identifier at layer-4, because it isolates everything from
    everything else. FQ is where SCE started, and that seems to be as
    far as it can go.

Should we burn the last unicorn for a capability needed on 
"carrier-scale" boxes, but which requires FQ to work? Perhaps yes if 
there was no alternative. But there is: L4S.

That brings us neatly to the outstanding issues with L4S...

>
>
> 2.
> I ended up confused about how falling back works, and I didn't see it
> spelled out anywhere.  I had assumed it was a persistent state-change
> for the sender for the rest of the flow lifetime after detecting a
> condition that required it, but then I saw some text that seemed to
> indicate it might be temporary? From section 4.3 in l4s-id:
>     "Note that a scalable congestion control is not expected to change
>        to setting ECT(0) while it temporarily falls back to coexist with
>        Reno ."
>
> Can you clarify whether the fall-back is meant to be temporary or not,
> and whether I missed a more complete explanation of how it's supposed to
> work?
Firstly, as has been made clear in our latest talk/paper at Linux netdev 
and in my latest iccrg talk, currently TCP Prague only includes 
fall-back to Reno on loss. It does not do fall-back on classic ECN 
marking (yet). We're still working on RTT-independence and scaling to 
very low RTT (sub-MSS window) first.

Fall-back on loss is definitely very temporary: it does one large 
Reno-style window halving on a loss (ignoring any other losses in that 
RTT as Reno does), then immediately continues with DCTCP-style 
congestion avoidance driven by all the ECN marks (not just one per-RTT).

For classic ECN AQM detection, we only have initial design ideas. 
Olivier posted his design ideas here:
     https://github.com/L4STeam/tcp-prague/issues/2

I want to keep it simple (see response to Q4 about false negatives). 
Fall-back would be temporary, but last longer than for loss - until the 
flow next goes idle. Here's the simplest that I think might work:
     Starting X RTTs after first CE mark;    // allows end of Slow Start 
to stabilize
     if (srtt > (min_rtt + Y) || rttvar > Z) {fallback()};
Where X,Y&Z are TBD, dependent on experiments, but say X=5-6 RTT, 
Y=4-5ms & Z=dunno_without_measuring. The min_rtt could be taken only 
since the previous start-up or idle period (or perhaps the previous 
two). An idle would have to be defined as >3-4 RTT, to allow any 
self-induced queue to drain.

The whole of L4S is experimental track. So others might take different 
approaches (e.g. BBRv2) and I'm sure our approach will evolve, which is 
why the requirement is worded liberally (it has to cover real-time, etc. 
not just TCP).

>
>
> 3.
> I also was a little confused on the implementation status of the fallback
> logic.  I was looking through some of the various links I could find, and
> I think these 2 are the main ones to consider? (from
> https://riteproject.eu/dctth/#code ):
> - https://github.com/L4STeam/sch_dualpi2_upstream
> - https://github.com/L4STeam/tcp-prague
>
> It looks like the prague_fallback_to_ca case so far only happens when
> AccECN is not negotiated, right?
That's not the same sort of fall-back. That's fall-back because without 
AccECN there's only one ECN feedback signal per RTT, so it falls back to 
the configured classic congestion controller for the whole connection. 
Which controller depends on the parameter prague_ca_fallback which 
defaults to cubic.

As said above, fall-back on classic ECN has not yet been implemented in 
TCP Prague. Of the 3 things left on our list, it's the last 'cos we're 
waiting to see the results of measurements from a CDN, to see if there 
are any single queue classic ECN AQMs out there. If there aren't we 
would not plan to implement this requirement until there were. Whether 
others do is up to them of course.

>
> To me, the logic for when to do this (especially for rtt changes) seems
> fairly complicated and easy to get wrong, especially if it's meant to be
> temporary for the flow, or if needs to deal with things like network path
> changes unrelated to the bottleneck, or variations in rtt due an endpoint
> being a mobile device, or on wi-fi.
>
> Which brings me to:
>
>
> *4.
> (* I think this is the biggest point driving me to ask about this.)
>
> I'm pretty worried about mis-categorizing CE marking from classic AQM
> algorithms as L4S-style markings, when using ECT(1) as the dualq
> classifier.
>
> I did see this issue addressed in the l4s drafts, but reviewing it
> left me a little confused, so I thought I'd ask about a point I
> noticed for clarification:
>
>  From section 6.3.3 of l4s-arch:
>     "an L4S sender will have to
>     fall back to a classic ('TCP-Friendly') behaviour if it detects that
>     ECN marking is accompanied by greater queuing delay or greater delay
>     variation than would be expected with L4S"
>
>  From the abstract in l4s-arch:
>     "In
>     extensive testing the new L4S service keeps average queuing delay
>     under a millisecond for _all_ applications even under very heavy
>     load"
>
> My reading of these seems to suggest that if the sender can observe
> a variance or increase of more than 1 millisecond of rtt, it should fall
> back to classic ECN?
>
> I'm not sure yet how to square that with Section A.1.4 of l4s-id:
>     "An increase in queuing delay or in delay variation would be
>     a tell-tale sign, but it is not yet clear where a line would be drawn
>     between the two behaviours."
>
> Is the discrepancy here because the extensive testing (also mentioned in
> the abstract of l4s-arch) was mainly in controlled environments, but the
> internet is expected to introduce extra non-bottleneck delays even where
> a dualq is present at the bottleneck, such as those from wi-fi, mobile
> networks, and path changes?
No, it's simply 'cos there is no implementation of this requirement yet.

>
> Regardless, this seems to me like a worrisome gap in the spec, because if
> the claim that dualq will get deployed and enabled quickly and widely is
> correct, it means this will be a common scenario in deployment--basically
> wherever there's existing classic AQMs deployed, especially since in CPE
> devices the existing AQMs are generally configured to have a lower
> bandwidth limit than the subscriber limit, so they'll (deliberately) be
> the bottleneck whenever the upstream access network isn't overly
> congested.
I believe FQ-CoDel is the only AQM in CPE that I know of that supports 
classic ECN. In this case, an L4S-ECN congestion controller cannot 
starve a Cubic-ECN or Reno-ECN flow, cos the FQ scheduler controls their 
capacity shares.

The only other CPE AQM I am aware of is DOCSIS-PIE, which doesn't 
support ECN.

If the IETF assigns the ECT(1) codepoint to L4S, then it would be 
extremely easy to modify FQ-Codel to set a very shallow ECN threshold in 
any queue where at least one ECT(1) codepoint had been detected. This 
would work fine with highly transient flow queues.

>
> I guess if it's really a 1-2 ms variance threshold to fall back, that
> would probably address the safety concern, but it seems like it would
> have a lot of false positives, and unnecessarily fall back on a lot of
> flows.
>
> But worse, if there's some (not yet specified?) logic that tries to reduce
> those false positives by relaxing a simple very-few-ms threshold, it seems
> like there's a high likelihood of logic that produces false negatives going
> undetected.
>
> If that's the case, to me it seems like it will remain a significant risk
> even while TCP Prague has been deployed for quite a long time at a sender,
> as long as different endpoint and AQM implementations roll out randomly
> behind different network conditions, for the various endpoints that end
> up connected with the sender.
I am less worried about this. I would be comfortable erring on the side 
of reducing false positives at the expense of false negatives.

Nonetheless, this position depends on what we find in measurement studies.
* If we find no single-queue AQMs that do ECN-marking, it's a 
non-problem {Note 2}.
* If such AQMs exist but are rare, they are likely to be in specific 
operator's networks, so there would be operator-specific ways to address 
such problems. E.g. if a CDN wanted to deploy the L4S experiment on its 
caches for that network, in collaboration with the network operator it 
could set a local-use DSCP instead of using ECT(1). That would still not 
deal with L4S traffic to/from the Internet, but the probability that 
different types of long-running flows coincide is low anyway, so the 
probability that different types of flows that are both long-running and 
non-CDN will coincide must surely be tiny.

>
> It also seems to me there's a high likelihood of causing unsafe non-
> responsive sender conditions in some of the cases where this kind of false
> negative happens in any kind of systematic way.
This overstates the problem. There is no unresponsiveness. Even when two 
long-running flows coincide, an L4S flow does not actually starve a 
classic (e.g. Reno-ECN) TCP flow. They come to a balance that can be 
highly unequal in high BDP links, but never starvation or 
unresponsiveness. Indeed, as the link's BDP gets smaller, or the more 
flows there are, the more DCTCP & Reno-ECN tend to equality.

>
> By contrast, as I understand it an SCE-based approach wouldn't need the
> same kind of fallback state-change logic for the flow, since any CE would
> indicate a RFC 3168-style multiplicative decrease, and only ECT(1) would
> indicate sub-loss congestion.
I'm afraid you understand it wrong.

With the ECN-DualQ-SCE approach, any flows where the receiver does not 
feed back SCE (ECT(1)) markings starve any SCE (DCTCP-like) flows in the 
same bottleneck.

Similarly, any Reno-ECN or Cubic-ECN senders (i.e. without the logic to 
understand SCE) starve the SCE (DCTCP-like) flows in the same 
ECN-DualQ-SCE bottleneck.

And here, starve actually means starve. Not just come to a highly 
unbalanced equilibrium, but completely starve.

This is because a Cubic-ECN flow will keep pushing the queue up to the 
point where it emits CE markings, because it doesn't understand and 
therefore ignores the SCE markings. One queue can only have one length. 
So, because the Cubic flow(s) have pushed the queue past the shallower 
point where it starts to emit SCE markings, all packets not marked CE 
will be marked SCE.

For example, say Cubic flow(s) induce a fairly normal 0.5% CE marking 
(or 0.5% drop for non-ECN flows). Then there will be 99.5% SCE marking.

Then, the DCTCP-like flows designed to understand SCE will keep reducing 
in response to this saturated SCE marking and the Cubic flows will fill 
the space they leave and starve them.

We did experiments to try to minimize this starvation, with two AQMs in 
one queue where one type of CC ignores the signals from the lower 
threshold back in 2012. See:
     http://bobbriscoe.net/pubs.html#DCTCP-Internet
This led us to realize we would have to use at least two queues.

>
> This is one of the big advantages of the SCE-based approach in my mind,
> since there's no chance of mis-classifying the meaning of a CE mark and
> no need for a state change for how the sender handles the ECT backoff logic
> or sets the ECT markings.  (It just goes back to treating any CE as RFC3168-
> style loss equivalent, and SCE as a sub-loss signal.)
>
> Since an SCE-based approach would avoid this problem nicely, I consider
> the reduced risk of false negatives (and unresponsive flows) here one of the
> important gains, to be weighed against the key downside mentioned in comment
> #1.
I hope you can see now that the ECN-DualQ-SCE approach suffers from the 
same problem as you are concerned about with L4S. Except the difference 
is it's not in 'legacy' non-SCE queues, but in the queue implementing 
SCE marking itself.

Unless one separates non-SCE traffic into a different queue, it starves 
SCE traffic.

>
>
> 5.
> Something similar comes up again in some other places, for instance:
>
> from A.1.4 in l4s-id:
(it's A.1.1.)
>     "Description: A scalable congestion control needs to distinguish the
>     packets it sends from those sent by classic congestion controls.
>
>     Motivation: It needs to be possible for a network node to classify
>     L4S packets without flow state into a queue that applies an L4S ECN
>     marking behaviour and isolates L4S packets from the queuing delay of
>     classic packets."
>
> Listing this as a requirement seems to prioritize enabling the gains of
> L4S ahead of avoiding the dangers of L4S flows failing to back off in the
> presence of possibly-miscategorized CE markings, if I'm reading it right?
> I guess Appendix A says these "requirements" are non-normative, but I'm a
> little concerned that framing it as a requirement instead of a design
> choice with a tradeoff in its consequences is misleading here, and
> pushes toward a less safe choice.
As I hope you can now see from the last part of answer #4 that, if you 
try to classify ECN flows with fine-grained (DCTCP-like) and coarse 
(Cubic-like) congestion controls into the same queue (whether L4S or SCE 
marking), the Cubic-like congestion controls ruin it.

So I think this requirement stands. I've made a note-to-self to add the 
text: "To avoid having to use per-flow classification..." though.

>
>
> 6.
> If queuing from classic ECN-capable flows is the main issue with using
> ECT as the dualq classifier, do you think it would still be possible to
> get the queuing delay down to a max of ~20-40ms right away for ECN-capable
> endpoints in networks that deploy this kind of dualq, and then hopefully
> see it drop further to ~1-5ms as more endpoints get updated with AccECN or
> some kind of ECT(1) feedback and a scalable congestion controller that
> can respond to SCE-style marking?
Technically yes, but realistically no.

What I mean is, as I said from the start, if you remove the feature that 
deploying the L4S DualQ Coupled AQM gives very low and consistently very 
low latency straight away, then operators will lose interest in 
deploying it.

> Or is it your position that the additional gains from the ~1ms queueing delay
> that should be achievable from the beginning by using ECT(1) (in connections
> where enough of the key entities upgrade) are worth the risks?
Well, I'd say "probably worth the risks", cos we're waiting for 
measurements to get a feel for whether any of the CE markings seen by 
the tests Apple reported in 2016-2017 are from single queue ECN AQMs.

See 
https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-implementing-the-prague-requirements-in-tcp-for-l4s-01#page=11

>
> (And if so, do you happen to have a pointer to any presentations or papers
> that made a quantitative comparison of the benefits from those 2 options?
> I don't recall any offhand, but there's a lot of papers...)
Latest results here (actually no different from results we reported in 
2015 - all the changes to the code since have been non-performance related):
"DUALPI2 - Low Latency, Low Loss and Scalable (L4S) AQM" Olga Albisser 
(Simula), Koen De Schepper (Nokia Bell-Labs), Bob Briscoe (Independent), 
Olivier Tilmans (Nokia Bell-Labs) and Henrik Steen (Simula), in Proc. 
Netdev 0x13 
<https://www.netdevconf.org/0x13/session.html?talk-DUALPI2-AQM> (Mar 2019).

The paper via the netdev link shows qdelay, utilization, completion time 
efficiency, etc with the most extreme traffic load we use (2 
long-running flows plus 5X Web flows per sec, where X is each link rate 
in Mb/s, e.g. 600 flows/sec over the 120Mb/s link), for a full range of 
link rates, round trip times, etc.

The plots are pretty crammed, so if you'd prefer one example qdelay 
cumulative distribution function for the same extreme traffic load, see 
here:
https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-implementing-the-prague-requirements-in-tcp-for-l4s-01#page=22

If you want results from a range of less-extreme traffic models, just ask.

HTH

Bob

>
>
> Best regards,
> Jake
>
>

{Note 1}: Or different server, client and network operators all agree to 
deploy, but let's assume that would be a bonus and not rely on it.

{Note 2}: Even where there are no single-queue AQMs now, there might be 
a concern that some could be enabled in future. Given study after study 
since ECN was first standardized (2001) have detected hardly any CE 
marks on the Internet until FQ-CoDel was deployed about 15 years later, 
the chance of those AQMs being turned on now is surely vanishing.

-- 
________________________________________________________________
Bob Briscoe                               http://bobbriscoe.net/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/ecn-sane/attachments/20190607/0fe500ab/attachment-0001.html>