Discussion of explicit congestion notification's impact on the Internet
 help / color / mirror / Atom feed
* [Ecn-sane] IETF 110 quick summary
@ 2021-03-08 23:47 Pete Heist
  2021-03-08 23:57 ` Dave Taht
  0 siblings, 1 reply; 22+ messages in thread
From: Pete Heist @ 2021-03-08 23:47 UTC (permalink / raw)
  To: ECN-Sane

Just responding to Dave's ask for a quick IETF 110 summary on ecn-sane,
after one day. We presented the data on ECN at MAPRG
(https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/
). It basically just showed that ECN is in use by endpoints (more as a
proportion across paths than a proportion of flows), that RFC3168 AQMs
do exist out there and are signaling, and that the ECN field can be
misused. There weren't any questions, maybe because we were the last to
present and were already short on time.

We also applied that to L4S by first explaining that risk is the
product of severity and prevalence, and tried to increase the awareness
about the flow domination problem when L4S flows meet non-L4S flows
(ECN or not) in a 3168 queue. Spreading this information seems to go
slowly, as we're still hearing "oh really?", which leads me to believe
1) that people are tuning this debate out, and 2) it just takes a long
time to comprehend, and to believe. It's still our stance that L4S
can't be deployed due to its signalling design, or if it is, the end
result is likely to be more bleaching and confusion with the DS field.

There was a question I'd already heard before about why fq_codel is
being deployed at an ISP, so I tried to cover that over in tsvwg.
Basically, fq_codel is not ideal for this purpose, lacking host and
subscriber fairness, but it's available and effective, so it's a good
start.

Wednesday's TSVWG session will be entirely devoted to L4S drafts.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-08 23:47 [Ecn-sane] IETF 110 quick summary Pete Heist
@ 2021-03-08 23:57 ` Dave Taht
  2021-03-09  2:13   ` Holland, Jake
  2021-03-09  8:21   ` Pete Heist
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Taht @ 2021-03-08 23:57 UTC (permalink / raw)
  To: Pete Heist; +Cc: ECN-Sane

Thx very much for the update. I wanted to note that
preseem does a lot of work with wisps and I wish they'd share more
data on it, as well as our ever present mention of free.fr.

Another data point is that apple's early rollout of ecn was kind of
a failure, and there are now so many workarounds in the os for it as
to make coherent testing impossible.

I do wish there was more work on ecn enabling bbr, as presently
it does negotiate ecn often and then completely ignores it. You can
see this in traces from dropbox in particular.



On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
>
> Just responding to Dave's ask for a quick IETF 110 summary on ecn-sane,
> after one day. We presented the data on ECN at MAPRG
> (https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/
> ). It basically just showed that ECN is in use by endpoints (more as a
> proportion across paths than a proportion of flows), that RFC3168 AQMs
> do exist out there and are signaling, and that the ECN field can be
> misused. There weren't any questions, maybe because we were the last to
> present and were already short on time.
>
> We also applied that to L4S by first explaining that risk is the
> product of severity and prevalence, and tried to increase the awareness
> about the flow domination problem when L4S flows meet non-L4S flows
> (ECN or not) in a 3168 queue. Spreading this information seems to go
> slowly, as we're still hearing "oh really?", which leads me to believe
> 1) that people are tuning this debate out, and 2) it just takes a long
> time to comprehend, and to believe. It's still our stance that L4S
> can't be deployed due to its signalling design, or if it is, the end
> result is likely to be more bleaching and confusion with the DS field.
>
> There was a question I'd already heard before about why fq_codel is
> being deployed at an ISP, so I tried to cover that over in tsvwg.
> Basically, fq_codel is not ideal for this purpose, lacking host and
> subscriber fairness, but it's available and effective, so it's a good
> start.
>
> Wednesday's TSVWG session will be entirely devoted to L4S drafts.
>
>
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-08 23:57 ` Dave Taht
@ 2021-03-09  2:13   ` Holland, Jake
  2021-03-09  4:06     ` Steven Blake
                       ` (2 more replies)
  2021-03-09  8:21   ` Pete Heist
  1 sibling, 3 replies; 22+ messages in thread
From: Holland, Jake @ 2021-03-09  2:13 UTC (permalink / raw)
  To: Dave Taht, Pete Heist; +Cc: ECN-Sane

The presentations were pretty great, but they were really short
on time.  In the chat a person or 2 was surprised about the way
L4S will impact NECT competing traffic when competing in a queue.
I agree some of the people who have tuned out the discussion are
learning things from these presentations, and I thought Jonathan's
slot was a good framing of the real question, and Pete's study was
also very helpful.

I seem to recall a thread in the wake of Apple's ECN enabling about
one of the Linux distros considering turning ECN on by default for
outbound connections, in which one of them found that it completely
wrecked his throughput, and so it got tabled with unfortunately
no pcap posted.

Any recollection of where that was?  I was guessing it might be
one of the misbehaviors from the network that Apple encountered.

I also thought Apple had a sysctl to disable the hold-downs and
always use ECN in spite of the heuristics, did that not work?

-Jake

On 3/8/21, 3:57 PM, "Dave Taht" <dave.taht@gmail.com> wrote:

Thx very much for the update. I wanted to note that
preseem does a lot of work with wisps and I wish they'd share more
data on it, as well as our ever present mention of free.fr.

Another data point is that apple's early rollout of ecn was kind of
a failure, and there are now so many workarounds in the os for it as
to make coherent testing impossible.

I do wish there was more work on ecn enabling bbr, as presently
it does negotiate ecn often and then completely ignores it. You can
see this in traces from dropbox in particular.



On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
>
> Just responding to Dave's ask for a quick IETF 110 summary on ecn-sane,
> after one day. We presented the data on ECN at MAPRG
> (https://urldefense.com/v3/__https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07G3f1kzw$ 
> ). It basically just showed that ECN is in use by endpoints (more as a
> proportion across paths than a proportion of flows), that RFC3168 AQMs
> do exist out there and are signaling, and that the ECN field can be
> misused. There weren't any questions, maybe because we were the last to
> present and were already short on time.
>
> We also applied that to L4S by first explaining that risk is the
> product of severity and prevalence, and tried to increase the awareness
> about the flow domination problem when L4S flows meet non-L4S flows
> (ECN or not) in a 3168 queue. Spreading this information seems to go
> slowly, as we're still hearing "oh really?", which leads me to believe
> 1) that people are tuning this debate out, and 2) it just takes a long
> time to comprehend, and to believe. It's still our stance that L4S
> can't be deployed due to its signalling design, or if it is, the end
> result is likely to be more bleaching and confusion with the DS field.
>
> There was a question I'd already heard before about why fq_codel is
> being deployed at an ISP, so I tried to cover that over in tsvwg.
> Basically, fq_codel is not ideal for this purpose, lacking host and
> subscriber fairness, but it's available and effective, so it's a good
> start.
>
> Wednesday's TSVWG session will be entirely devoted to L4S drafts.
>
>
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://urldefense.com/v3/__https://lists.bufferbloat.net/listinfo/ecn-sane__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07L2Cfk-Y$ 



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729
_______________________________________________
Ecn-sane mailing list
Ecn-sane@lists.bufferbloat.net
https://urldefense.com/v3/__https://lists.bufferbloat.net/listinfo/ecn-sane__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07L2Cfk-Y$ 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  2:13   ` Holland, Jake
@ 2021-03-09  4:06     ` Steven Blake
  2021-03-09  9:57       ` Pete Heist
  2021-03-09  8:43     ` Pete Heist
  2021-03-09 11:06     ` Jonathan Morton
  2 siblings, 1 reply; 22+ messages in thread
From: Steven Blake @ 2021-03-09  4:06 UTC (permalink / raw)
  To: Holland, Jake; +Cc: ECN-Sane

If I'm a random network operator, not participating in any L4S
experiments, and L4S traffic traversing my network hits a bottleneck,
what happens? Consider all of the cases (no AQM tail-drop, AQM-drop,
AQM-classic ECN).

My understanding was that TCP-Prague's classic bottleneck detection
code wasn't fully baked.


On Tue, 2021-03-09 at 02:13 +0000, Holland, Jake wrote:
> The presentations were pretty great, but they were really short
> on time.  In the chat a person or 2 was surprised about the way
> L4S will impact NECT competing traffic when competing in a queue.
> I agree some of the people who have tuned out the discussion are
> learning things from these presentations, and I thought Jonathan's
> slot was a good framing of the real question, and Pete's study was
> also very helpful.
> 
> I seem to recall a thread in the wake of Apple's ECN enabling about
> one of the Linux distros considering turning ECN on by default for
> outbound connections, in which one of them found that it completely
> wrecked his throughput, and so it got tabled with unfortunately
> no pcap posted.
> 
> Any recollection of where that was?  I was guessing it might be
> one of the misbehaviors from the network that Apple encountered.
> 
> I also thought Apple had a sysctl to disable the hold-downs and
> always use ECN in spite of the heuristics, did that not work?
> 
> -Jake
> 
> On 3/8/21, 3:57 PM, "Dave Taht" <dave.taht@gmail.com> wrote:
> 
> Thx very much for the update. I wanted to note that
> preseem does a lot of work with wisps and I wish they'd share more
> data on it, as well as our ever present mention of free.fr.
> 
> Another data point is that apple's early rollout of ecn was kind of
> a failure, and there are now so many workarounds in the os for it as
> to make coherent testing impossible.
> 
> I do wish there was more work on ecn enabling bbr, as presently
> it does negotiate ecn often and then completely ignores it. You can
> see this in traces from dropbox in particular.
> 
> 
> 
> On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
> > Just responding to Dave's ask for a quick IETF 110 summary on ecn-
> > sane,
> > after one day. We presented the data on ECN at MAPRG
> > (
> > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07G3f1kzw$
> >  
> > ). It basically just showed that ECN is in use by endpoints (more
> > as a
> > proportion across paths than a proportion of flows), that RFC3168
> > AQMs
> > do exist out there and are signaling, and that the ECN field can be
> > misused. There weren't any questions, maybe because we were the
> > last to
> > present and were already short on time.
> > 
> > We also applied that to L4S by first explaining that risk is the
> > product of severity and prevalence, and tried to increase the
> > awareness
> > about the flow domination problem when L4S flows meet non-L4S flows
> > (ECN or not) in a 3168 queue. Spreading this information seems to
> > go
> > slowly, as we're still hearing "oh really?", which leads me to
> > believe
> > 1) that people are tuning this debate out, and 2) it just takes a
> > long
> > time to comprehend, and to believe. It's still our stance that L4S
> > can't be deployed due to its signalling design, or if it is, the
> > end
> > result is likely to be more bleaching and confusion with the DS
> > field.
> > 
> > There was a question I'd already heard before about why fq_codel is
> > being deployed at an ISP, so I tried to cover that over in tsvwg.
> > Basically, fq_codel is not ideal for this purpose, lacking host and
> > subscriber fairness, but it's available and effective, so it's a
> > good
> > start.
> > 
> > Wednesday's TSVWG session will be entirely devoted to L4S drafts.


Regards,

// Steve





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-08 23:57 ` Dave Taht
  2021-03-09  2:13   ` Holland, Jake
@ 2021-03-09  8:21   ` Pete Heist
  1 sibling, 0 replies; 22+ messages in thread
From: Pete Heist @ 2021-03-09  8:21 UTC (permalink / raw)
  To: Dave Taht; +Cc: ECN-Sane

On Mon, 2021-03-08 at 15:57 -0800, Dave Taht wrote:
> Thx very much for the update. I wanted to note that
> preseem does a lot of work with wisps and I wish they'd share more
> data on it, as well as our ever present mention of free.fr.
> 
> Another data point is that apple's early rollout of ecn was kind of
> a failure, and there are now so many workarounds in the os for it as
> to make coherent testing impossible.

Is there any info on what way it was a failure, or what workarounds
there are?

> I do wish there was more work on ecn enabling bbr, as presently
> it does negotiate ecn often and then completely ignores it. You can
> see this in traces from dropbox in particular.

I haven't tested BBR but briefly. I'd expect:
- BBR harming itself through fq_codel as TCP RTT goes up, but if it
also uses that as a signal to back off, I don't know the end result
- Harm to competing traffic in a tunnel through fq_codel

> On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
> > 
> > Just responding to Dave's ask for a quick IETF 110 summary on ecn-
> > sane,
> > after one day. We presented the data on ECN at MAPRG
> > ( 
> > https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/
> > ). It basically just showed that ECN is in use by endpoints (more
> > as a
> > proportion across paths than a proportion of flows), that RFC3168
> > AQMs
> > do exist out there and are signaling, and that the ECN field can be
> > misused. There weren't any questions, maybe because we were the
> > last to
> > present and were already short on time.
> > 
> > We also applied that to L4S by first explaining that risk is the
> > product of severity and prevalence, and tried to increase the
> > awareness
> > about the flow domination problem when L4S flows meet non-L4S flows
> > (ECN or not) in a 3168 queue. Spreading this information seems to
> > go
> > slowly, as we're still hearing "oh really?", which leads me to
> > believe
> > 1) that people are tuning this debate out, and 2) it just takes a
> > long
> > time to comprehend, and to believe. It's still our stance that L4S
> > can't be deployed due to its signalling design, or if it is, the
> > end
> > result is likely to be more bleaching and confusion with the DS
> > field.
> > 
> > There was a question I'd already heard before about why fq_codel is
> > being deployed at an ISP, so I tried to cover that over in tsvwg.
> > Basically, fq_codel is not ideal for this purpose, lacking host and
> > subscriber fairness, but it's available and effective, so it's a
> > good
> > start.
> > 
> > Wednesday's TSVWG session will be entirely devoted to L4S drafts.
> > 
> > 
> > _______________________________________________
> > Ecn-sane mailing list
> > Ecn-sane@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/ecn-sane
> 
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  2:13   ` Holland, Jake
  2021-03-09  4:06     ` Steven Blake
@ 2021-03-09  8:43     ` Pete Heist
  2021-03-09 15:57       ` Holland, Jake
  2021-03-09 11:06     ` Jonathan Morton
  2 siblings, 1 reply; 22+ messages in thread
From: Pete Heist @ 2021-03-09  8:43 UTC (permalink / raw)
  To: Holland, Jake; +Cc: ECN-Sane

On Tue, 2021-03-09 at 02:13 +0000, Holland, Jake wrote:
> The presentations were pretty great, but they were really short
> on time.  In the chat a person or 2 was surprised about the way
> L4S will impact NECT competing traffic when competing in a queue.
> I agree some of the people who have tuned out the discussion are
> learning things from these presentations, and I thought Jonathan's
> slot was a good framing of the real question, and Pete's study was
> also very helpful.

I'm glad to hear that. At least it adds something to your work before,
from a different vantage point, albeit much smaller. I know that
studies from entirely disinterested parties would be good too, but that
might the hard part, they're disinterested! :)

> I seem to recall a thread in the wake of Apple's ECN enabling about
> one of the Linux distros considering turning ECN on by default for
> outbound connections, in which one of them found that it completely
> wrecked his throughput, and so it got tabled with unfortunately
> no pcap posted.
> 
> Any recollection of where that was?  I was guessing it might be
> one of the misbehaviors from the network that Apple encountered.

That is odd and would be good to know about. I enabled ECN on my Linux
laptop a long time ago and haven't noticed a problem that I'm aware of.
I wish the distros would reconsider enabling it, unless there are
active reasons it shouldn't be deployed, but they may now just be in a
holding pattern on it.

> I also thought Apple had a sysctl to disable the hold-downs and
> always use ECN in spite of the heuristics, did that not work?
> 
> -Jake
> 
> On 3/8/21, 3:57 PM, "Dave Taht" <dave.taht@gmail.com> wrote:
> 
> Thx very much for the update. I wanted to note that
> preseem does a lot of work with wisps and I wish they'd share more
> data on it, as well as our ever present mention of free.fr.
> 
> Another data point is that apple's early rollout of ecn was kind of
> a failure, and there are now so many workarounds in the os for it as
> to make coherent testing impossible.
> 
> I do wish there was more work on ecn enabling bbr, as presently
> it does negotiate ecn often and then completely ignores it. You can
> see this in traces from dropbox in particular.
> 
> 
> 
> On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
> > 
> > Just responding to Dave's ask for a quick IETF 110 summary on ecn-
> > sane,
> > after one day. We presented the data on ECN at MAPRG
> > ( 
> > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07G3f1kzw$
> >  
> > ). It basically just showed that ECN is in use by endpoints (more
> > as a
> > proportion across paths than a proportion of flows), that RFC3168
> > AQMs
> > do exist out there and are signaling, and that the ECN field can be
> > misused. There weren't any questions, maybe because we were the
> > last to
> > present and were already short on time.
> > 
> > We also applied that to L4S by first explaining that risk is the
> > product of severity and prevalence, and tried to increase the
> > awareness
> > about the flow domination problem when L4S flows meet non-L4S flows
> > (ECN or not) in a 3168 queue. Spreading this information seems to
> > go
> > slowly, as we're still hearing "oh really?", which leads me to
> > believe
> > 1) that people are tuning this debate out, and 2) it just takes a
> > long
> > time to comprehend, and to believe. It's still our stance that L4S
> > can't be deployed due to its signalling design, or if it is, the
> > end
> > result is likely to be more bleaching and confusion with the DS
> > field.
> > 
> > There was a question I'd already heard before about why fq_codel is
> > being deployed at an ISP, so I tried to cover that over in tsvwg.
> > Basically, fq_codel is not ideal for this purpose, lacking host and
> > subscriber fairness, but it's available and effective, so it's a
> > good
> > start.
> > 
> > Wednesday's TSVWG session will be entirely devoted to L4S drafts.
> > 
> > 
> > _______________________________________________
> > Ecn-sane mailing list
> > Ecn-sane@lists.bufferbloat.net
> >  
> > https://urldefense.com/v3/__https://lists.bufferbloat.net/listinfo/ecn-sane__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07L2Cfk-Y$
> >  
> 
> 
> 
> -- 
> "For a successful technology, reality must take precedence over
> public
> relations, for Mother Nature cannot be fooled" - Richard Feynman
> 
> dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
>  
> https://urldefense.com/v3/__https://lists.bufferbloat.net/listinfo/ecn-sane__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07L2Cfk-Y$
>  
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  4:06     ` Steven Blake
@ 2021-03-09  9:57       ` Pete Heist
  2021-03-09 13:53         ` Jonathan Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Pete Heist @ 2021-03-09  9:57 UTC (permalink / raw)
  To: Steven Blake; +Cc: ECN-Sane

On Mon, 2021-03-08 at 23:06 -0500, Steven Blake wrote:
> If I'm a random network operator, not participating in any L4S
> experiments, and L4S traffic traversing my network hits a bottleneck,
> what happens? Consider all of the cases (no AQM tail-drop, AQM-drop,
> AQM-classic ECN).
> 
> My understanding was that TCP-Prague's classic bottleneck detection
> code wasn't fully baked.

Hi Steven, I'll take a crack at this as I see it anyway:

*No AQM tail-drop & AQM-drop*

Both _should_ be OK, as L4S transports, at least Prague, treat drop
with a 50% MD (barring one bug which has been fixed). We have tested
with straight tail-drop FIFOs and drop-based AQMs and afaik so far it
was safe, even if performance wasn't ideal in all cases.

*AQM-classic ECN, single queue*

Severity:

L4S flows drive competing flows, ECN capable or not, down to somewhere
around minimum cwnd. FCT for shorter flows is also harmed, but some
flows can do better, if they complete before getting out of SS.

Prevalence:

We're not sure how many single queue AQMs are enabled, so it's unclear
how often this would be a problem. Maybe rarely, but it's hard to
believe that there are zero single queue 3168 AQMs enabled out there.

*AQM-classic ECN, FQ*

Severity:

Same as AQM-classic ECN single queue, _when there is a problem_.

Prevalence:

FQ protects competing flows, unless L4S and non-L4S traffic ends up in
the same queue. This can happen with a hash collision, or maybe more
commonly, with tunneled traffic in tunnels that support copying the ECN
bits from the inner to the outer. If anyone thinks of any other reasons
we haven't considered why competing flows would share the same 5-tuple
and thus the same queue, do mention it. :) We've tried to get a handle
on the percentage of random paths with fq_codel deployed. In one
environment we measured around 10%, but that's still +/- an order of
magnitude as for the general Internet, given that the study was
relatively small
(https://tools.ietf.org/html/draft-heist-tsvwg-ecn-deployment-observations-02#section-3.2
).

Lastly, not a safety problem but a performance problem, when L4S flows
traverse ANY fq_codel bottleneck they impose delays on themselves,
since they don't respond to CE in the way the AQM expects. That leads
to intra-flow latency spikes, explained here:
https://github.com/heistp/l4s-tests/#intra-flow-latency-spikes
So, this will happen on whatever percentage of paths fq_codel, or any
other RFC3168 AQM is deployed on. Delay spikes after rate reductions
can be higher in Codel due to how the algorithm works.

> On Tue, 2021-03-09 at 02:13 +0000, Holland, Jake wrote:
> > The presentations were pretty great, but they were really short
> > on time.  In the chat a person or 2 was surprised about the way
> > L4S will impact NECT competing traffic when competing in a queue.
> > I agree some of the people who have tuned out the discussion are
> > learning things from these presentations, and I thought Jonathan's
> > slot was a good framing of the real question, and Pete's study was
> > also very helpful.
> > 
> > I seem to recall a thread in the wake of Apple's ECN enabling about
> > one of the Linux distros considering turning ECN on by default for
> > outbound connections, in which one of them found that it completely
> > wrecked his throughput, and so it got tabled with unfortunately
> > no pcap posted.
> > 
> > Any recollection of where that was?  I was guessing it might be
> > one of the misbehaviors from the network that Apple encountered.
> > 
> > I also thought Apple had a sysctl to disable the hold-downs and
> > always use ECN in spite of the heuristics, did that not work?
> > 
> > -Jake
> > 
> > On 3/8/21, 3:57 PM, "Dave Taht" <dave.taht@gmail.com> wrote:
> > 
> > Thx very much for the update. I wanted to note that
> > preseem does a lot of work with wisps and I wish they'd share more
> > data on it, as well as our ever present mention of free.fr.
> > 
> > Another data point is that apple's early rollout of ecn was kind of
> > a failure, and there are now so many workarounds in the os for it as
> > to make coherent testing impossible.
> > 
> > I do wish there was more work on ecn enabling bbr, as presently
> > it does negotiate ecn often and then completely ignores it. You can
> > see this in traces from dropbox in particular.
> > 
> > 
> > 
> > On Mon, Mar 8, 2021 at 3:47 PM Pete Heist <pete@heistp.net> wrote:
> > > Just responding to Dave's ask for a quick IETF 110 summary on ecn-
> > > sane,
> > > after one day. We presented the data on ECN at MAPRG
> > > (
> > > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/draft-heist-tsvwg-ecn-deployment-observations/__;!!GjvTz_vk!AsneqOLeLWeNxzyWItOxlVbVQYefAMLslNpK4U9NEHw0dfUI0vDG7O07G3f1kzw$
> > >  
> > > ). It basically just showed that ECN is in use by endpoints (more
> > > as a
> > > proportion across paths than a proportion of flows), that RFC3168
> > > AQMs
> > > do exist out there and are signaling, and that the ECN field can be
> > > misused. There weren't any questions, maybe because we were the
> > > last to
> > > present and were already short on time.
> > > 
> > > We also applied that to L4S by first explaining that risk is the
> > > product of severity and prevalence, and tried to increase the
> > > awareness
> > > about the flow domination problem when L4S flows meet non-L4S flows
> > > (ECN or not) in a 3168 queue. Spreading this information seems to
> > > go
> > > slowly, as we're still hearing "oh really?", which leads me to
> > > believe
> > > 1) that people are tuning this debate out, and 2) it just takes a
> > > long
> > > time to comprehend, and to believe. It's still our stance that L4S
> > > can't be deployed due to its signalling design, or if it is, the
> > > end
> > > result is likely to be more bleaching and confusion with the DS
> > > field.
> > > 
> > > There was a question I'd already heard before about why fq_codel is
> > > being deployed at an ISP, so I tried to cover that over in tsvwg.
> > > Basically, fq_codel is not ideal for this purpose, lacking host and
> > > subscriber fairness, but it's available and effective, so it's a
> > > good
> > > start.
> > > 
> > > Wednesday's TSVWG session will be entirely devoted to L4S drafts.
> 
> 
> Regards,
> 
> // Steve
> 
> 
> 
> 
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  2:13   ` Holland, Jake
  2021-03-09  4:06     ` Steven Blake
  2021-03-09  8:43     ` Pete Heist
@ 2021-03-09 11:06     ` Jonathan Morton
  2 siblings, 0 replies; 22+ messages in thread
From: Jonathan Morton @ 2021-03-09 11:06 UTC (permalink / raw)
  To: Holland, Jake; +Cc: Dave Taht, Pete Heist, ECN-Sane

> On 9 Mar, 2021, at 4:13 am, Holland, Jake <jholland@akamai.com> wrote:
> 
> In the chat a person or 2 was surprised about the way
> L4S will impact NECT competing traffic when competing in a queue.

I think that was mostly Martin Duke.  I caught up with him in the IETF Gather space immediately afterwards and discussed this with him, one to one, and he now seems to understand more clearly what we were presenting.  I was pleased to hear that he's also familiar with the "risk matrix" formulation I presented.

> We also applied that to L4S by first explaining that risk is the
> product of severity and prevalence…

And also, crucially, the concept of "externalised risk", ie. the distinction between involved participants, interested observers, and innocent bystanders.  L4S has innocent bystanders (existing networks and their users, who have no idea that L4S even exists nor how to troubleshoot ECN related problems) incur most of the risk of Bad Things happening.  This is an "externalised risk" which is very difficult to manage after the fact, and must be minimised to a much greater extent than other risks.

SCE ensures that innocent bystanders incur virtually no risk, in that bad interactions only occur for poeple actually using SCE over an SCE-enabled path, which is where mitigations can actually be practical to employ - in the limit, by switching off SCE.  This is much easier to accept in a risk analysis.  We didn't get to that slide, however, due to shortage of time.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  9:57       ` Pete Heist
@ 2021-03-09 13:53         ` Jonathan Morton
  2021-03-09 14:27           ` Sebastian Moeller
  2021-03-09 17:31           ` Steven Blake
  0 siblings, 2 replies; 22+ messages in thread
From: Jonathan Morton @ 2021-03-09 13:53 UTC (permalink / raw)
  To: Pete Heist; +Cc: Steven Blake, ECN-Sane

> On 9 Mar, 2021, at 11:57 am, Pete Heist <pete@heistp.net> wrote:
> 
> FQ protects competing flows, unless L4S and non-L4S traffic ends up in
> the same queue. This can happen with a hash collision, or maybe more
> commonly, with tunneled traffic in tunnels that support copying the ECN
> bits from the inner to the outer. If anyone thinks of any other reasons
> we haven't considered why competing flows would share the same 5-tuple
> and thus the same queue, do mention it.

Bob Briscoe's favourite defence to this, at the moment, seems to be that multiple flows sharing one tunnel are *also* disadvantaged when they share an FQ AQM bottleneck with multiple other flows that are not tunnelled, and which the FQ mechanism *can* distinguish.  Obviously this is specious, but it's worth pinning down exactly *why* so we can explain it back to him (and more importantly, anyone else paying attention).

Bob's scenario involves entirely conventional traffic, and a saturated bottleneck managed by an FQ-AQM (fq_codel), which is itself shared with at least one other flow.  We assume that all AQMs in existing networks are ECN enabled (as distinct from the also-common policers which only drop).  The FQ mechanism treats the tunnel as a single flow, and shares out bandwidth equally on that basis.  So the throughput available to the tunnel as a whole is one share of the total, no matter how many flows occupy the tunnel.  Additionally, the same AQM mark/drop rate is applied to everything in the tunnel, causing the flows using it to adopt an RTT-fair relationship to each other.

The disadvantage experienced by the tunnel (relative to a plain AQM) is proportional to the number of flows using the tunnel, and only indirectly related to the number of other flows using the bottleneck.  This I would classify as Minor severity, since it is a moderate, sustained effect.  It increases in effect only linearly with the load on the tunnel, which is the same as at any ordinary bottleneck - and this is routinely tolerated.

Note that if the tunnel is the only traffic using the bottleneck, the situation is equivalent to a plain, single-queue AQM.  This is an important degenerate case, which we can come back to later.  Also, in principle the effect can be avoided by either not using the tunnel, or by dividing the flows between multiple tunnels that the FQ mechanism *can* distinguish.  This puts the risk into either an "involved participant" or "interested observer" category, unless the tunnel has been imposed on the user without knowledge or consent.  What this means is that the tunnel user might reasonably consider the security or privacy benefit of the tunnel to outweigh the performance defect it incurs, and thereby choose to continue using it.

Now, let us add one L4S flow to the tunnel, replacing one of the conventional flows in it, but keeping everything else the same.  The conventional flows *outside* the tunnel are unaffected, because they are protected by the FQ-AQM.  But the conventional flows *inside* the tunnel, which the FQ-AQM cannot protect because it cannot distinguish them, are immediately squashed to minimum cwnd or thereabouts, which may be considerably less than the fair-share BDP within that allocated by the tunnel.  The L4S flow thereby grows to dominate the tunnel traffic as described elsewhere.  This is clearly a Major severity effect, as the conventional traffic in the tunnel is seriously impaired.

Note that if the tunnel shared a plain AQM bottleneck, without FQ, with other conventional flows outside the tunnel, these other flows would *also* be squashed by the L4S flow in the tunnel.  This is because the AQM must increase its signalling rate considerably to control the L4S flow, and it applies the same signalling rate to all traffic.  The FQ-AQM only increases signalling to the flow requiring it.

Returning to the degenerate case where the tunnel is the only traffic using the bottleneck, the situation remains the same within the tunnel, and the behaviour is again equivalent to a plain AQM, with the L4S flow dominating and the conventional traffic severely impaired.  The tunnel as a whole now occupies the full bottleneck rather than merely a fraction of it, but almost all of this extra capacity is used by the L4S flow, and can't be effectively used by the conventional flows within the tunnel.

It is therefore clear that the effect is caused by the L4S flow meeting a conventional AQM, and not by the FQ mechanism.  Furthermore, the effect of an L4S flow within a tunnel is *over and above* any effects imposed on the tunnel as a whole by an FQ-AQM.

The main proposed solution to this is to upgrade the AQM at the bottleneck, so that it understands the ECT(1) signal distinguishing the L4S traffic from conventional traffic.  But this imposes the burden of mitigating the problem on the existing network, an "innocent bystander".  This is therefore clearly not an appropriate strategy; L4S should instead ensure that it reacts appropriately to congestion signals produced by existing networks, which by RFC-3168 compliance treat ECT(1) as equivalent to ECT(0).

If L4S cannot do this reliably - and we doubt that it can - then it must either be redesigned to use an unambiguous signal, or explicitly confined to networks which have been prepared for it by removing/upgrading all conventional AQMs.  We have proposed two possible methods of redesigning L4S, both of which have been rejected by the L4S team.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 13:53         ` Jonathan Morton
@ 2021-03-09 14:27           ` Sebastian Moeller
  2021-03-09 14:35             ` Dave Taht
  2021-03-09 17:31           ` Steven Blake
  1 sibling, 1 reply; 22+ messages in thread
From: Sebastian Moeller @ 2021-03-09 14:27 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Pete Heist, ECN-Sane

Hi Jonathan,


> On Mar 9, 2021, at 14:53, Jonathan Morton <chromatix99@gmail.com> wrote:
> 
>> On 9 Mar, 2021, at 11:57 am, Pete Heist <pete@heistp.net> wrote:
>> 
>> FQ protects competing flows, unless L4S and non-L4S traffic ends up in
>> the same queue. This can happen with a hash collision, or maybe more
>> commonly, with tunneled traffic in tunnels that support copying the ECN
>> bits from the inner to the outer. If anyone thinks of any other reasons
>> we haven't considered why competing flows would share the same 5-tuple
>> and thus the same queue, do mention it.
> 
> Bob Briscoe's favourite defence to this, at the moment, seems to be that multiple flows sharing one tunnel are *also* disadvantaged when they share an FQ AQM bottleneck with multiple other flows that are not tunnelled, and which the FQ mechanism *can* distinguish.  Obviously this is specious, but it's worth pinning down exactly *why* so we can explain it back to him (and more importantly, anyone else paying attention).

	[SM] I think the way forward in this would be to embrace the IPv6 flow label, and include it into the hash (sure will not help with IPv4 tunnels). That way even tunneled flows can reveal them selves to upper layers and get per-flow treatment (or they can decide to keep to their secret ways, their choice). I think that trying to abuse the flow label will result in massive reordering for the tunneled flow, so might still be a risk (but it seems hard for an abuser to gain more usable capacity).
How do such tunnels behave in the prevalent FIFO's, do they actually get a share depending on their number of hidden constituent flows, or are they treated as a single flow? And in either case, isn't that not a policy question the operator of the bottleneck should be able to control?

I snipped the rest of your excellent analysis, as I only want to bring up the flow-label to side-step that issue partially. This does not solve L4S' misdesign, but it will take Bob's argument the wind out of the sails to some degree...

Best Regards
	Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 14:27           ` Sebastian Moeller
@ 2021-03-09 14:35             ` Dave Taht
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Taht @ 2021-03-09 14:35 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: Jonathan Morton, ECN-Sane

I would certainly like to see more exploration of when and where the
ipv6 flow label gets peed on. but as it is yet another untried idea...

On Tue, Mar 9, 2021 at 6:27 AM Sebastian Moeller <moeller0@gmx.de> wrote:
>
> Hi Jonathan,
>
>
> > On Mar 9, 2021, at 14:53, Jonathan Morton <chromatix99@gmail.com> wrote:
> >
> >> On 9 Mar, 2021, at 11:57 am, Pete Heist <pete@heistp.net> wrote:
> >>
> >> FQ protects competing flows, unless L4S and non-L4S traffic ends up in
> >> the same queue. This can happen with a hash collision, or maybe more
> >> commonly, with tunneled traffic in tunnels that support copying the ECN
> >> bits from the inner to the outer. If anyone thinks of any other reasons
> >> we haven't considered why competing flows would share the same 5-tuple
> >> and thus the same queue, do mention it.
> >
> > Bob Briscoe's favourite defence to this, at the moment, seems to be that multiple flows sharing one tunnel are *also* disadvantaged when they share an FQ AQM bottleneck with multiple other flows that are not tunnelled, and which the FQ mechanism *can* distinguish.  Obviously this is specious, but it's worth pinning down exactly *why* so we can explain it back to him (and more importantly, anyone else paying attention).
>
>         [SM] I think the way forward in this would be to embrace the IPv6 flow label, and include it into the hash (sure will not help with IPv4 tunnels). That way even tunneled flows can reveal them selves to upper layers and get per-flow treatment (or they can decide to keep to their secret ways, their choice). I think that trying to abuse the flow label will result in massive reordering for the tunneled flow, so might still be a risk (but it seems hard for an abuser to gain more usable capacity).
> How do such tunnels behave in the prevalent FIFO's, do they actually get a share depending on their number of hidden constituent flows, or are they treated as a single flow? And in either case, isn't that not a policy question the operator of the bottleneck should be able to control?
>
> I snipped the rest of your excellent analysis, as I only want to bring up the flow-label to side-step that issue partially. This does not solve L4S' misdesign, but it will take Bob's argument the wind out of the sails to some degree...
>
> Best Regards
>         Sebastian
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09  8:43     ` Pete Heist
@ 2021-03-09 15:57       ` Holland, Jake
  0 siblings, 0 replies; 22+ messages in thread
From: Holland, Jake @ 2021-03-09 15:57 UTC (permalink / raw)
  To: Pete Heist; +Cc: ECN-Sane

On 3/9/21, 12:43 AM, "Pete Heist" <pete@heistp.net> wrote:
> I'm glad to hear that. At least it adds something to your work before,
> from a different vantage point, albeit much smaller. I know that
> studies from entirely disinterested parties would be good too, but that
> might the hard part, they're disinterested! :)

Yeah, sorry I couldn't manage to re-run my scripts successfully yet.
I'm still curious to figure out if there's been any deployment motion,
but the attempts that are easy to try haven't succeeded, and I haven't
had time to refactor it to insist on an answer.  (I re-tried a few times,
hoping it was a cluster capacity issue that would sort itself out or that
it would complete if I batched the jobs smaller, but no joy.)

>> I seem to recall a thread in the wake of Apple's ECN enabling about
>> one of the Linux distros considering turning ECN on by default for
>> outbound connections, in which one of them found that it completely
>> wrecked his throughput, and so it got tabled with unfortunately
>> no pcap posted.
>>
>> Any recollection of where that was?  I was guessing it might be
>> one of the misbehaviors from the network that Apple encountered.
>
> That is odd and would be good to know about. I enabled ECN on my Linux
> laptop a long time ago and haven't noticed a problem that I'm aware of.
> I wish the distros would reconsider enabling it, unless there are
> active reasons it shouldn't be deployed, but they may now just be in a
> holding pattern on it.

There's apparently a few misbehaving boxes out there, so using ECN from
the wrong network location can leave you messed up, which I thought was
why Apple had the heuristics running to detect pathologies and respond
by turning off ECT for a while.  But it's pretty hard to pin down what's
happening (and from where) without a zillion clients running ECN from all
over the world.

Anyway, I thought I remembered Dave posting a link to that thread to one
of the lists (maybe this one?) and commenting in the thread (my vague
recollection was he was asking for a pcap but was told they had already
moved on and couldn't get one easily, or some such).

I also thought I remembered someone in that thread was maybe considering
(or maybe just suggesting) adding something like the apple heuristics,
so I was curious if anything ever happened.

-Jake



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 13:53         ` Jonathan Morton
  2021-03-09 14:27           ` Sebastian Moeller
@ 2021-03-09 17:31           ` Steven Blake
  2021-03-09 17:50             ` Steven Blake
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Blake @ 2021-03-09 17:31 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: ECN-Sane

TL;DR: L4S traffic sharing a queue with AQM-Classic ECN will crush non-
L4S traffic.

Thanks, this lines up with my prior understanding (wanted to make sure
I wasn't missing any arguments from the zillions of back-and-forth
emails on the tsvwg list). And I'm glad that at least they appear to
behave correctly in the face of packet discards.

The disaster scenario is that their experiment introduces performance
issues in some unsuspecting operators, causing them to start bleaching
ECN bits.

Their whole safety plan depends on the claim that Classic RFC 3168 ECN 
is not deployed (except in fq_codel on the edge; who cares? they can
patch their code). If that were the case, it would make more sense for
them to try to move classic ECN to historic and redefine ECT(0) to
signal L4S traffic (ala DCTCP). 

It's also been clear that this is not an effort to conduct an
experiment.


On Tue, 2021-03-09 at 15:53 +0200, Jonathan Morton wrote:
> > On 9 Mar, 2021, at 11:57 am, Pete Heist <pete@heistp.net> wrote:
> > 
> > FQ protects competing flows, unless L4S and non-L4S traffic ends up
> > in
> > the same queue. This can happen with a hash collision, or maybe
> > more
> > commonly, with tunneled traffic in tunnels that support copying the
> > ECN
> > bits from the inner to the outer. If anyone thinks of any other
> > reasons
> > we haven't considered why competing flows would share the same 5-
> > tuple
> > and thus the same queue, do mention it.
> 
> Bob Briscoe's favourite defence to this, at the moment, seems to be
> that multiple flows sharing one tunnel are *also* disadvantaged when
> they share an FQ AQM bottleneck with multiple other flows that are
> not tunnelled, and which the FQ mechanism *can*
> distinguish.  Obviously this is specious, but it's worth pinning down
> exactly *why* so we can explain it back to him (and more importantly,
> anyone else paying attention).
> 
> Bob's scenario involves entirely conventional traffic, and a
> saturated bottleneck managed by an FQ-AQM (fq_codel), which is itself
> shared with at least one other flow.  We assume that all AQMs in
> existing networks are ECN enabled (as distinct from the also-common
> policers which only drop).  The FQ mechanism treats the tunnel as a
> single flow, and shares out bandwidth equally on that basis.  So the
> throughput available to the tunnel as a whole is one share of the
> total, no matter how many flows occupy the tunnel.  Additionally, the
> same AQM mark/drop rate is applied to everything in the tunnel,
> causing the flows using it to adopt an RTT-fair relationship to each
> other.
> 
> The disadvantage experienced by the tunnel (relative to a plain AQM)
> is proportional to the number of flows using the tunnel, and only
> indirectly related to the number of other flows using the
> bottleneck.  This I would classify as Minor severity, since it is a
> moderate, sustained effect.  It increases in effect only linearly
> with the load on the tunnel, which is the same as at any ordinary
> bottleneck - and this is routinely tolerated.
> 
> Note that if the tunnel is the only traffic using the bottleneck, the
> situation is equivalent to a plain, single-queue AQM.  This is an
> important degenerate case, which we can come back to later.  Also, in
> principle the effect can be avoided by either not using the tunnel,
> or by dividing the flows between multiple tunnels that the FQ
> mechanism *can* distinguish.  This puts the risk into either an
> "involved participant" or "interested observer" category, unless the
> tunnel has been imposed on the user without knowledge or
> consent.  What this means is that the tunnel user might reasonably
> consider the security or privacy benefit of the tunnel to outweigh
> the performance defect it incurs, and thereby choose to continue
> using it.
> 
> Now, let us add one L4S flow to the tunnel, replacing one of the
> conventional flows in it, but keeping everything else the same.  The
> conventional flows *outside* the tunnel are unaffected, because they
> are protected by the FQ-AQM.  But the conventional flows *inside* the
> tunnel, which the FQ-AQM cannot protect because it cannot distinguish
> them, are immediately squashed to minimum cwnd or thereabouts, which
> may be considerably less than the fair-share BDP within that
> allocated by the tunnel.  The L4S flow thereby grows to dominate the
> tunnel traffic as described elsewhere.  This is clearly a Major
> severity effect, as the conventional traffic in the tunnel is
> seriously impaired.
> 
> Note that if the tunnel shared a plain AQM bottleneck, without FQ,
> with other conventional flows outside the tunnel, these other flows
> would *also* be squashed by the L4S flow in the tunnel.  This is
> because the AQM must increase its signalling rate considerably to
> control the L4S flow, and it applies the same signalling rate to all
> traffic.  The FQ-AQM only increases signalling to the flow requiring
> it.
> 
> Returning to the degenerate case where the tunnel is the only traffic
> using the bottleneck, the situation remains the same within the
> tunnel, and the behaviour is again equivalent to a plain AQM, with
> the L4S flow dominating and the conventional traffic severely
> impaired.  The tunnel as a whole now occupies the full bottleneck
> rather than merely a fraction of it, but almost all of this extra
> capacity is used by the L4S flow, and can't be effectively used by
> the conventional flows within the tunnel.
> 
> It is therefore clear that the effect is caused by the L4S flow
> meeting a conventional AQM, and not by the FQ
> mechanism.  Furthermore, the effect of an L4S flow within a tunnel is
> *over and above* any effects imposed on the tunnel as a whole by an
> FQ-AQM.
> 
> The main proposed solution to this is to upgrade the AQM at the
> bottleneck, so that it understands the ECT(1) signal distinguishing
> the L4S traffic from conventional traffic.  But this imposes the
> burden of mitigating the problem on the existing network, an
> "innocent bystander".  This is therefore clearly not an appropriate
> strategy; L4S should instead ensure that it reacts appropriately to
> congestion signals produced by existing networks, which by RFC-3168
> compliance treat ECT(1) as equivalent to ECT(0).
> 
> If L4S cannot do this reliably - and we doubt that it can - then it
> must either be redesigned to use an unambiguous signal, or explicitly
> confined to networks which have been prepared for it by
> removing/upgrading all conventional AQMs.  We have proposed two
> possible methods of redesigning L4S, both of which have been rejected
> by the L4S team.
> 
>  - Jonathan Morton


Regards,

// Steve





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 17:31           ` Steven Blake
@ 2021-03-09 17:50             ` Steven Blake
  2021-03-09 18:07               ` Rodney W. Grimes
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Steven Blake @ 2021-03-09 17:50 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: ECN-Sane

On Tue, 2021-03-09 at 12:31 -0500, Steven Blake wrote:

> Their whole safety plan depends on the claim that Classic RFC 3168
> ECN 
> is not deployed (except in fq_codel on the edge; who cares? they can
> patch their code). If that were the case, it would make more sense
> for
> them to try to move classic ECN to historic and redefine ECT(0) to
> signal L4S traffic (ala DCTCP). 

Actually, that is the ideal outcome. ECT(0) signals ECT-Capable, ECT(1)
and CE signal two levels of congestion. In other words, SCE everywhere.

Maybe that is an argument that you can throw at them: if it is safe to
ignore classic ECN, might as well move straight to SCE with non-ECT
traffic shunted off to a separate queue(s).


Regards,

// Steve





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 17:50             ` Steven Blake
@ 2021-03-09 18:07               ` Rodney W. Grimes
  2021-03-09 18:13               ` Pete Heist
  2021-03-09 18:44               ` Holland, Jake
  2 siblings, 0 replies; 22+ messages in thread
From: Rodney W. Grimes @ 2021-03-09 18:07 UTC (permalink / raw)
  To: Steven Blake; +Cc: Jonathan Morton, ECN-Sane

> On Tue, 2021-03-09 at 12:31 -0500, Steven Blake wrote:
> 
> > Their whole safety plan depends on the claim that Classic RFC 3168
> > ECN 
> > is not deployed (except in fq_codel on the edge; who cares? they can
> > patch their code). If that were the case, it would make more sense
> > for
> > them to try to move classic ECN to historic and redefine ECT(0) to
> > signal L4S traffic (ala DCTCP). 
> 
> Actually, that is the ideal outcome. ECT(0) signals ECT-Capable, ECT(1)
> and CE signal two levels of congestion. In other words, SCE everywhere.
> 
> Maybe that is an argument that you can throw at them: if it is safe to
> ignore classic ECN, might as well move straight to SCE with non-ECT
> traffic shunted off to a separate queue(s).

Would you be willing to float that infront of them?  We have
discussed this internal between Jonothan, Pete and myself,
it is a viable solution.  And iirc our discussion resulted in
this ECT(0) being used to signal ECT or SCE treatment to be
rather low risk.

Right now any time we (SCE) try to float anything its shot down without
any due consideration or discussion, sadly.

> Regards,
> // Steve

-- 
Rod Grimes                                                 rgrimes@freebsd.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 17:50             ` Steven Blake
  2021-03-09 18:07               ` Rodney W. Grimes
@ 2021-03-09 18:13               ` Pete Heist
  2021-03-09 19:51                 ` Holland, Jake
  2021-03-09 18:44               ` Holland, Jake
  2 siblings, 1 reply; 22+ messages in thread
From: Pete Heist @ 2021-03-09 18:13 UTC (permalink / raw)
  To: Steven Blake; +Cc: ECN-Sane

On Tue, 2021-03-09 at 12:50 -0500, Steven Blake wrote:
> On Tue, 2021-03-09 at 12:31 -0500, Steven Blake wrote:
> 
> > Their whole safety plan depends on the claim that Classic RFC 3168
> > ECN 
> > is not deployed (except in fq_codel on the edge; who cares? they can
> > patch their code). If that were the case, it would make more sense
> > for
> > them to try to move classic ECN to historic and redefine ECT(0) to
> > signal L4S traffic (ala DCTCP). 
> 
> Actually, that is the ideal outcome. ECT(0) signals ECT-Capable, ECT(1)
> and CE signal two levels of congestion. In other words, SCE everywhere.
> 
> Maybe that is an argument that you can throw at them: if it is safe to
> ignore classic ECN, might as well move straight to SCE with non-ECT
> traffic shunted off to a separate queue(s).

You've hit on what IMO is a serious inconsistency in section B.5 of the
L4S-ID draft, which at one point explored that option:

-----
B.5.  ECN capability alone

   This approach uses ECN capability alone as the L4S identifier.  It
   would only have been feasible if RFC 3168 ECN had not been widely
   deployed.  This was the case when the choice of L4S identifier was
   being made and this appendix was first written.  Since then, RFC
3168
   ECN has been widely deployed and L4S did not take this approach
   anyway.  So this approach is not discussed further, because it is no
   longer a feasible option.
----

On the one hand, the argument is that 3168 is *not* widely deployed
when it comes to safety with existing AQMs, and on the other hand, it
*is* widely deployed when it comes to selection of the identifier. I
think this finally needs bringing up, maybe tomorrow.

We had a conversation late last year around instead making a
discontinuous upgrade to ECN/SCE by redefining ECT(0) to be the
identifier, and I spent some time thinking about it. It's not without
issues, but I wouldn't mind hearing other's thoughts on it before I
pollute it with mine.

Pete

> Regards,
> 
> // Steve
> 
> 
> 
> 
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 17:50             ` Steven Blake
  2021-03-09 18:07               ` Rodney W. Grimes
  2021-03-09 18:13               ` Pete Heist
@ 2021-03-09 18:44               ` Holland, Jake
  2021-03-09 19:09                 ` Jonathan Morton
  2 siblings, 1 reply; 22+ messages in thread
From: Holland, Jake @ 2021-03-09 18:44 UTC (permalink / raw)
  To: Steven Blake, Jonathan Morton; +Cc: ECN-Sane

On 3/9/21, 9:50 AM, "Steven Blake" <slblake@petri-meat.com> wrote:
> Actually, that is the ideal outcome. ECT(0) signals ECT-Capable, ECT(1)
> and CE signal two levels of congestion. In other words, SCE everywhere.
>
> Maybe that is an argument that you can throw at them: if it is safe to
> ignore classic ECN, might as well move straight to SCE with non-ECT
> traffic shunted off to a separate queue(s).

The L4S drafts address this somewhat already, I think.  The main argument
is probably best-articulated in Appendix B.2 of l4s-id:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-ecn-l4s-id-14#appendix-B.2

From more of a summary of (my recollection of) live discussion, the
main reason it's rejected is that classic ECN traffic will not
respond as quickly as L4S, so you could not get lower latency than
classic ECN offers in a shared queue with competing classic ECN traffic,
which would impede adoption by providing no latency benefit plus some 
throughput penalty for upgrading.

(Also of note: L4S is trying to target bigger devices upstream
of the home gateway, where flow-aware queuing is less practical, and
also where most of the congestion and buffering delay occurs for
those who are not throttling at their home gateway.)

This was one point where a lot of the people in tsvwg explicitly
expressed that you really do need a classifier to improve on classic
aqm latency, hence preferred ECT(1)-as-input.  (I think even some
who do not agree it should be deployed due to safety concerns did
agree with this point.)

So I don't expect raising that point would be helpful.

-Jake



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 18:44               ` Holland, Jake
@ 2021-03-09 19:09                 ` Jonathan Morton
  2021-03-09 19:27                   ` Holland, Jake
  0 siblings, 1 reply; 22+ messages in thread
From: Jonathan Morton @ 2021-03-09 19:09 UTC (permalink / raw)
  To: Holland, Jake; +Cc: Steven Blake, ECN-Sane

> On 9 Mar, 2021, at 8:44 pm, Holland, Jake <jholland@akamai.com> wrote:
> 
> …classic ECN traffic will not respond as quickly as L4S…

I know it wasn't you making this claim, Jake, but I have to point out that it's completely false.  Classic ECN transports actually respond *more* quickly to a CE mark than L4S transports.

Let's walk through the processes.


RFC-3168 TCP:

A single CE mark is applied to a data segment.
The receiver immediately sends an ACK with ECE set, and keeps ECE set on all further ACKs until a CWR cancels it.
The sender gets the ECE, reduces the cwnd immediately, and sends the next data segment with CWR set to confirm it.
Proportional Rate Reduction may be used to spread out the reduction in actual in-flight data.  This takes at most one RTT.

From the queue's perspective, one RTT (the minimum possible) elapses before the arrival rate from the sender halves (due to PRR).  After two RTTs maximum, the in-flight data has reached a new, substantially lower value than the original.


L4S TCP (Prague):

A single CE mark is applied to a data segment.
The receiver updates the CE mark counter in the next ACK.
The sender sees the new counter value, and feeds it into a low-pass filter which operates on discrete time intervals.
When the filter is next processed, on average a single CE mark results in half a segment being removed from the cwnd.  Half the time, this results in no externally visible change to the data in flight.  The other half, it is a very slight response.

From the queue's perspective, one RTT plus (on average) half the filter window passes before any possible response reaches it, and half the time it's no response anyway, the other half a single segment reduction.  Meanwhile, at least one segment has been added to the cwnd in the RTT-plus time since the mark was applied (due to Reno-style growth).


I do not see how TCP Prague's response can be described as "faster" than that of standard TCP.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 19:09                 ` Jonathan Morton
@ 2021-03-09 19:27                   ` Holland, Jake
  2021-03-09 19:42                     ` Jonathan Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Holland, Jake @ 2021-03-09 19:27 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Steven Blake, ECN-Sane

Sorry Jonathan, I think I didn't convey some context properly...

On 3/9/21, 11:09 AM, "Jonathan Morton" <chromatix99@gmail.com> wrote:
>> On 9 Mar, 2021, at 8:44 pm, Holland, Jake <jholland@akamai.com> wrote:
>> 
>> …classic ECN traffic will not respond as quickly as L4S…
>
>I know it wasn't you making this claim, Jake, but I have to point out that it's completely false.  Classic ECN transports actually respond *more* quickly to a CE mark than L4S transports.

Here I meant to talk about an SCE-style low-congestion signal (in
either 1->0 or 0->1 direction), which would be ignored by a classic
endpoint but which a high-fidelity endpoint would respond to.

So I'm not referring to a CE mark here, but rather an SCE mark, as
I thought Steve was proposing with this bit:

>> Maybe that is an argument that you can throw at them: if it is safe to
>> ignore classic ECN, might as well move straight to SCE with non-ECT
>> traffic shunted off to a separate queue(s).

Sorry for any confusion there, I'm not in favor of talking past each
other and I think we probably agree here if I've understood correctly.

What I was trying to say is that an SCE response (specifically
including an L4S-using-SCE response, though I think you had some
intriguing alternate ideas to reduce the effect) would be faster
than a classic response that ignores SCE and waits for a CE.

I do agree with your explanation that a classic CC responds faster to
a CE mark than TCP Prague, that's just not what I was trying to talk
about.

-Jake



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 19:27                   ` Holland, Jake
@ 2021-03-09 19:42                     ` Jonathan Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Morton @ 2021-03-09 19:42 UTC (permalink / raw)
  To: Holland, Jake; +Cc: Steven Blake, ECN-Sane

> On 9 Mar, 2021, at 9:27 pm, Holland, Jake <jholland@akamai.com> wrote:
> 
>>> …classic ECN traffic will not respond as quickly as L4S…
>> 
>> I know it wasn't you making this claim, Jake, but I have to point out that it's completely false.  Classic ECN transports actually respond *more* quickly to a CE mark than L4S transports.
> 
> Here I meant to talk about an SCE-style low-congestion signal (in
> either 1->0 or 0->1 direction), which would be ignored by a classic
> endpoint but which a high-fidelity endpoint would respond to.
> 
> So I'm not referring to a CE mark here, but rather an SCE mark, as
> I thought Steve was proposing with this bit:
> 
>>> Maybe that is an argument that you can throw at them: if it is safe to
>>> ignore classic ECN, might as well move straight to SCE with non-ECT
>>> traffic shunted off to a separate queue(s).
> 
> Sorry for any confusion there, I'm not in favor of talking past each
> other and I think we probably agree here if I've understood correctly.
> 
> What I was trying to say is that an SCE response (specifically
> including an L4S-using-SCE response, though I think you had some
> intriguing alternate ideas to reduce the effect) would be faster
> than a classic response that ignores SCE and waits for a CE.

Okay, that does make more sense.  I probably wouldn't use "faster" or "not as quickly" to describe that, however.  Such a description only makes sense if you pre-suppose a queue depth that rises monotonically over time.

AIMD and HFCC responses do tend to need different operating points to work efficiently.  HFCC can settle on a steady-state cwnd that is quite close to the true BDP.  AIMD needs the peak queue depth to be significantly higher, to accommodate the deep sawtooth without losing too much goodput.  So it entirely makes sense to set the thresholds for the two types of signalling accordingly.

> I do agree with your explanation that a classic CC responds faster to
> a CE mark than TCP Prague, that's just not what I was trying to talk
> about.

Sure.  But the phrasing sounded so much like arguments that have indeed come from the L4S team - I'm sure you remember all the marketing BS that had to be cut out of their drafts, and there's still a lot of stuff there which I think is not supported (at best) by the data.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 18:13               ` Pete Heist
@ 2021-03-09 19:51                 ` Holland, Jake
  2021-03-09 20:53                   ` Pete Heist
  0 siblings, 1 reply; 22+ messages in thread
From: Holland, Jake @ 2021-03-09 19:51 UTC (permalink / raw)
  To: Pete Heist, Steven Blake; +Cc: ECN-Sane

On 3/9/21, 10:13 AM, "Pete Heist" <pete@heistp.net> wrote:
> On the one hand, the argument is that 3168 is *not* widely deployed
> when it comes to safety with existing AQMs, and on the other hand, it
> *is* widely deployed when it comes to selection of the identifier. I
> think this finally needs bringing up, maybe tomorrow.

I think they rephrased section B.2 to match up with this.

Although B.5 probably does need some editorial work, I think the
technical explanation is mostly the same as what's covered in B.2,
so bringing this up probably has limited utility.

I won't deny that there's some weird shifting of the purported
reasoning behind stable conclusions, but I'll suggest that IMHO
you're better off keeping the focus on the real crux of the issue,
which I think is correctly articulated as harm to bystanders by
deploying a new codepoint assignment for ECT(1) without first proving
it can be used effectively without harm by most traffic under the
prior meaning of that codepoint.

(I'm less sure about the tunnels, which seem to be considered both
so common that FQ can't address their latency and also ignorable wrt
harm from sharing classic 3168 and TCP Prague traffic.  Raising
this point might at least bring them around on the idea that tunnels
could be split by flows when it's useful, but probably also has
limited utility overall.)

> We had a conversation late last year around instead making a
> discontinuous upgrade to ECN/SCE by redefining ECT(0) to be the
> identifier, and I spent some time thinking about it. It's not without
> issues, but I wouldn't mind hearing other's thoughts on it before I
> pollute it with mine.

They did at least update the draft to speak to this point in
l4s-id B.3.  I think the biggest objection on their side was that it's
not a good classifier with chained aqms, and this problem gets worse
as deployment increases.

I still kinda like it as the least harmful, mostly only helpful
option (assuming endpoints who negotiate support will also do better
RACK-like support for reordering and switches will stop trying to do
it).  While it doesn't provide a great classifier, it at least
provides a crappy one that doesn't hurt that much when you're wrong.

-Jake



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Ecn-sane] IETF 110 quick summary
  2021-03-09 19:51                 ` Holland, Jake
@ 2021-03-09 20:53                   ` Pete Heist
  0 siblings, 0 replies; 22+ messages in thread
From: Pete Heist @ 2021-03-09 20:53 UTC (permalink / raw)
  To: Holland, Jake; +Cc: ECN-Sane

On Tue, 2021-03-09 at 19:51 +0000, Holland, Jake wrote:
> On 3/9/21, 10:13 AM, "Pete Heist" <pete@heistp.net> wrote:
> > On the one hand, the argument is that 3168 is *not* widely deployed
> > when it comes to safety with existing AQMs, and on the other hand,
> > it
> > *is* widely deployed when it comes to selection of the identifier.
> > I
> > think this finally needs bringing up, maybe tomorrow.
> 
> I think they rephrased section B.2 to match up with this.
> 
> Although B.5 probably does need some editorial work, I think the
> technical explanation is mostly the same as what's covered in B.2,
> so bringing this up probably has limited utility.

Ok, I'll trust that. I think they mainly meant there that they didn't
want Apple devices polluting their green field L queue.

> I won't deny that there's some weird shifting of the purported
> reasoning behind stable conclusions, but I'll suggest that IMHO
> you're better off keeping the focus on the real crux of the issue,
> which I think is correctly articulated as harm to bystanders by
> deploying a new codepoint assignment for ECT(1) without first proving
> it can be used effectively without harm by most traffic under the
> prior meaning of that codepoint.
> 
> (I'm less sure about the tunnels, which seem to be considered both
> so common that FQ can't address their latency and also ignorable wrt
> harm from sharing classic 3168 and TCP Prague traffic.  Raising
> this point might at least bring them around on the idea that tunnels
> could be split by flows when it's useful, but probably also has
> limited utility overall.)

These are good points. It's true that when we've tried to present
arguments in teh past that waiver from these fundamental safety issues,
they've almost never landed and end up being better left unsaid, or
just written on the list.

> > We had a conversation late last year around instead making a
> > discontinuous upgrade to ECN/SCE by redefining ECT(0) to be the
> > identifier, and I spent some time thinking about it. It's not
> > without
> > issues, but I wouldn't mind hearing other's thoughts on it before I
> > pollute it with mine.
> 
> They did at least update the draft to speak to this point in
> l4s-id B.3.  I think the biggest objection on their side was that
> it's
> not a good classifier with chained aqms, and this problem gets worse
> as deployment increases.
> 
> I still kinda like it as the least harmful, mostly only helpful
> option (assuming endpoints who negotiate support will also do better
> RACK-like support for reordering and switches will stop trying to do
> it).  While it doesn't provide a great classifier, it at least
> provides a crappy one that doesn't hurt that much when you're wrong.

By now I think of that idea as B.Jake. While I understood their loss of
classification argument, it's a definite improvement on flow
starvation. :)

> -Jake
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-03-09 20:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 23:47 [Ecn-sane] IETF 110 quick summary Pete Heist
2021-03-08 23:57 ` Dave Taht
2021-03-09  2:13   ` Holland, Jake
2021-03-09  4:06     ` Steven Blake
2021-03-09  9:57       ` Pete Heist
2021-03-09 13:53         ` Jonathan Morton
2021-03-09 14:27           ` Sebastian Moeller
2021-03-09 14:35             ` Dave Taht
2021-03-09 17:31           ` Steven Blake
2021-03-09 17:50             ` Steven Blake
2021-03-09 18:07               ` Rodney W. Grimes
2021-03-09 18:13               ` Pete Heist
2021-03-09 19:51                 ` Holland, Jake
2021-03-09 20:53                   ` Pete Heist
2021-03-09 18:44               ` Holland, Jake
2021-03-09 19:09                 ` Jonathan Morton
2021-03-09 19:27                   ` Holland, Jake
2021-03-09 19:42                     ` Jonathan Morton
2021-03-09  8:43     ` Pete Heist
2021-03-09 15:57       ` Holland, Jake
2021-03-09 11:06     ` Jonathan Morton
2021-03-09  8:21   ` Pete Heist

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox