[Cerowrt-devel] please kill the ECN thread from hell here and take it to aqm

Dave Taht dave.taht at gmail.com
Sun Mar 22 10:49:22 EDT 2015


On Sat, Mar 21, 2015 at 10:28 PM, Michael Welzl <michawe at ifi.uio.no> wrote:
>
>> On 21. mar. 2015, at 23.57, Dave Taht <dave.taht at gmail.com> wrote:
>>
>> On Fri, Mar 20, 2015 at 5:15 PM, Michael Welzl <michawe at ifi.uio.no> wrote:
>>
>>> I think it's about time we finally turn it [ecn] on in the real world.
>>
>> Please start with turning it on as fully as possible on *your* networks.
>>
>> Advocacy with *actual experience* I approve of, otherwise it's just
>> religion and a rathole.
>
> We, and others, have done many tests with it.

Try this as a new test.

Get up from your desk(s), find whatever dimly lit basement cubbyhole
your org's system administrators live in. Take one to lunch, feed
him/her saki and sushi, ask about their top 3 problems, and find a way
to slide in talk about the need for aqm and ecn. aqm and ecn are
probably not anywhere near their top 100, *but* with enough saki, you
might get one enthusiastic enough about these issues to go do a test
on a production network. If not, take a different sysadmin out to
lunch the next day, and try again.

If you can get a little interest on the fq/aqm/ecn fronts, hover over
a laptop, help parse (somehow) the confusing information available
from the cisco and juniper manuals, and get the sysadmin to turn aqm +
ecn on in a few places. (and then publish somewhere how to do it
"right").

And I don't mind at all if you can also convince them int using the
single sysctl needed to enable pie, fq_codel, codel etc in any modern
linux! to turn it on by default, too, nor if they set ecn=1 also.

This simple 2 line file I have in every debian/ubuntu linux system I have:

root at ranger:~#  cat /etc/sysctl.d/5-bufferbloat.conf

net.core.default_qdisc=fq_codel
net.ipv4.tcp_ecn=1

Note: It took me truly an astounding amount of attempts and alcohol to
find one major provider to distribute those two lines to all their
machines. Totally failed so far on finding a good conf for their
switches.

Your time would be better spent that way rather than writing
manefestos on mailing lists, and mine, on not reading them.

Furthermore, to get trustable data that other sysadmins might believe:

Ask for access to the mrtg/cacti loss statistics that any competent
sysadmin usually has around, and also help him/her also deploy
smokeping (far too few have). And publish B/A data on that on nanog,
and places like that, not here.

And see what happens.


>
>> I only once been so tempted to shut a thread down on these email
>> lists. (the other time was a near-discussion of systemd)
>>
>> This thread started off usefully discussing the docsis 3.1 deployment
>> and other deployment issues and if I could invoke godwins law on ecn,
>> I would. Hell, let me try that. Only a nazi would inflict such a
>> controversial technology on others without comprehensively trying it
>> themselves on all their own traffic. For years.
>>
>> The ecn debate is a 21 year old bikeshed from hell.  There is no
>> comprehensive data from actual deployments.
>>
>> Start with getting some from yours! And from whoever else you can
>> convince to try it, at scale, and not in manifestos that have so far
>> as I can tell, a multiplicity of false premises and wishful thinking,
>> not backed by any operational experience with the actual code
>> available.
>
> The paper I mentioned in this thread used actual code on actual machines  :-)
>
>
>> Go ahead, convince your org(s) to deploy it, get everyone using your
>> network to use it on every operating system available, have meet ups
>> for every new student entering your uni to turn it on, have a black
>> hat take the existing aqm algs apart, and then write a document
>> describing those experiences. *Then* write the rfc.
>>
>> *I* deployed it. I gave my feedback already.
>>
>> My conclusions 3 years back were:
>>
>> 1) ECN is safe to deploy given the bottleneck links had fq + aqm
>> w/ecn, and the links were high enough bandwidth to not slow other
>> traffic. But at lower rates, it clears congestion fastest, and uses
>> less memory, to drop packets.
>>
>> 2) It might be safe in limited well controlled environments (e.g. in
>> data center, and especially in long RTT environements in space or
>> satellites), but a wide range of testing on actual traffic mixes on
>> things like DCTCP - and what happens when things like DCTCP
>> accidentally escape the datacenter needs to also be carefully
>> evaluated.
>>
>> 3) It is not safe to deploy on the wild and wooly internet with any of
>> the pure aqm algorithms currently available.
>
> You saw all these three things in your home by watching ECN traffic with wireshark, or what?

I have had or have had multiple labs around the world in the last 4 years.

Most recently we ran an extensive series of tests of ecn vs non-ecn
traffic on toke's testbed, but since the first paper from that series
has not been publishable as yet, the natural followon papers
referencing that enormous data set are un-done.

I documented the other sites and places in a prior email.

The biggest real-world deployment I have actual fq_codel + ecn data
from thus far is archive.org, the 200th largest web site in the world.

The second is the enabled-by-default fq_codel with ecn implementation
in openwrt and the derivatives and the increasingly popular
sqm-scripts and ubnt edgerouter implementations.

What do you got?

In all cases ecn availabity or not is a trivial tiny part of the
relevant statistics. In order of importance to low latency, high
throughput are fq, then aqm, then ecn.

> 1) might be something you can "experience". The figure in the paper I mentioned earlier in the thread shows that it's inconsistently sometimes better, sometimes worse if all you look at is the queue (as opposed to the actual end-to-end delay at the application layer, which includes head-of-line blocking).

You persist in not recognising that other forms of latency sensitive
traffic exist, like dns, gaming, arp, voip, videoconferencing, etc
exist.


>> I have seen no data to come close to challenging these conclusions,
>> nor tests, nor deployments, and until that happens...
>
> Challenging conclusion 1: figures 13 and 14 in https://www.duo.uio.no/bitstream/handle/10852/37381/khademi-AQM_Kids_TR434.pdf?sequence=5&isAllowed=y
> Note that these diagrams show a distribution, not a single test somewhere. Many tests, using real equipment and the actual totally real code.
> In cases with only one flow for example, goodput was slightly lower with ECN, meaning: less packets, less congestion, less memory usage.

I have listed my many issues with that paper elsewhere on many other
threads, and asked that tests (particularly of cross traffic) be done
- and the promise in the paper of a follow-up- followed up. It's been
several years, now. Where's the sequel?

Toke's paper at iccrg, stanford, and google to that showed the issues
ARED was actually having as well, which decimated a few of the
conclusions therein.

> 2) is an assumption, talks about DCTCP and has nothing to do with the current discussion.

It is a testable assumption.

> 3) is also an assumption. Why "not safe to deploy"? Again, did you see this on your own home connection with wireshark, or how can you conclude that?

I saw that a mark setpoint below the drop setpoint that ecn traffic
lost out to drop traffic, and with one above the drop setpoint drop
traffic lost out to ecn traffic.

In roughly a zillion tests, and in about 60 different attempts at
finding a setpoint above or below in revisions to the codel algorithm.

What I settled on as a set of fixes for (fq)codel to be
comprehensively evaluated this summer was

A) overload protection based on a somewhat arbitrary choice of
fraction of packet limit (still a setpoint greater than the drop
point)
B) When there was a drop (for any reason), the very next packet is
marked if possible.

This seems to give harder to game results, although I have not put it
through a comprehensive set of attacks built around the isoburst and
isoping tools in gfiber's repository, and the suite of aqm attack
tools that so far as I know nobody has run besides me, here:

https://sites.google.com/site/cwzhangres/home/posts/aqmdossimulationplatform

All this code has been publicly available for a long time now, but due
to so many changes in TCP in general -  toke's testbed, and mine, and
my stuff in the cloud, needed a major update in order to be relevant,
and I've always kind of hoped to get others to actually take what
we've made available over the years, like ns3 patches,
netperf-wrapper, linux itself, and these patches and beat it up in
ways that we've never imagined...

... and to also publish their own tools, and raw data, so that others
could inspect and run them.

Besides A and B there are a few more speculative things left to
explore on the fq_codel (cake) front, which include something more
byte-mode-like for cross traffic, a smarter/saner resumption portion
of the codel algorithm, and some thoughts towards ack thinning
particularly on the wifi front.

> In terms of packets successfully passing through the Internet, a number of recent measurement papers generally find that the situation has significantly improved. Perhaps the most recent one: http://www.ict-mplane.eu/sites/default/files/public/publications/311ecndeployment.pdf
>
> This is only one paper; lots of data does exist.

Get some sysadmins drunk enough to actually try it in production.

>
>
>> ...In addition to getting off the aqm mailing list, I am now sending
>> anything I get with the evil " ecn " in it directly to /dev/null  -
>> until prague or I get way better BP meds. I got way better things to
>> do.
>
> Apologies for taking the liberty to still answer your email  :-)

My principal complaint was the rathole a formerly useful thread on
these lists went into. One day I had 10 useful messages in that
thread, two days later I had 51, and I really don't have time right
now for endless arguments cycled repeatedly on things far more trivial
than *the need to solve bufferbloat by getting fq and aqm technologies
into deployable states*,

and my highest priority right now is to go get wifi fixed before it
starts melting down for more people than it already has.

Every day we waste on these mailing lists is several million more
machines with lousy wifi stacks, deploying.

I do apologize for the tenor of my last 2 mails, I was running really
low on sleep. That said, filtering out and unsubscribing to aqm til
prague. Got some coding to do.

>
> Cheers,
> Michael
>



-- 
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb



More information about the Cerowrt-devel mailing list