[Bloat] quick review and rant of "Identifying and Handling Non Queue Building Flows in a Bottleneck Link"

Mon Oct 29 00:02:45 EDT 2018

Dear Greg:

I don't feel like commenting much on ietf matters these days
but, jeeze, I'm really tired of rehashed arguments based on bad papers
that aren't even cited in the document, that aren't talking about the
real problem.

I get that some magical solution to a non-problem is wanted here:

https://tools.ietf.org/id/draft-white-tsvwg-nqb-00.txt

but then I just read this part... and had to cringe and respond.

   Flow queueing approaches (such as fq_codel RFC 8290 [RFC8290]), on
   the other hand, achieve latency improvements by associating packets
   into "flow" queues and then prioritizing "sparse flows", i.e. packets
   that arrive to an empty flow queue.  Flow queueing does not attempt
   to differentiate between flows on the basis of value (importance or
   latency-sensitivity), it simply gives preference to sparse flows, and
   tries to guarantee that the non-sparse flows all get an equal share

Nope, that's not what it does. please strike the sentence "it simply
gives preference... " and replace with something like "all flows
get as fully mixed as possible." 

It's just slightly better min/max fair than most other fair queuing
mechanisms.

Better:

   Flow queueing approaches (such as fq_codel RFC 8290 [RFC8290]), on
   the other hand, achieve latency improvements by breaking flows back
   into individual packets, and then prioritizing "sparse flows", i.e. packets
   that arrive to an empty flow queue that empties every round.

   Flow queueing does not attempt to differentiate between flows on the
   basis of value (importance or latency-sensitivity), it simply
   interleaves flows as best as possible, giving a slight preference to
   sparse flows.

   As a result, fq mechanisms are appropriate for unmanaged environments
   and general internet traffic.

Editorial:

Are. Have. A full deployment is basically done now. fq_codel is
universally "on" on most linuxes, with sch_fq and pacing filling in the
rest.  a FQ technique is a near universal default in the cloud
now. deployed in the 10s of millions on commercial linux APs. fq_codel
is the default in OSX wifi. It's now in freebsd, and there was a great
thread recently on how it all works in pfsense here:
https://forum.netgate.com/topic/112527/playing-with-fq_codel-in-2-4/709

Got any users for this other stuff yet?

   Downsides to this approach can include loss of low latency performance
   due to hash collisions (where a sparse flow shares a queue with a
   bulk data flow)

sch_fq has *no* buckets. it's *perfectly* fair to millions of tcp flows.

fq_codel hash collision probability is sqrt(buckets), 1024 has thus far
been pretty good (32). When you combine that with the probability of
there being a fat flow to collide with (call it 3%? of all flows), the
real world impact of these collisions is nearly immeasurable against
normal traffic.

vs a single queue!!! which imposes the queue-length from one fat flow on
*all* flows. The probability of that happening is 100%.

Jeeze. Describing a "loss of low latency performance" this way is a 99%
lie vs what you are comparing against, and it's because I'm ranting
today since I've encountered this lie a dozen times in a dozen documents
as if repetition made it true!

NOW...

I NOTE HOWEVER: We happen to agree that that one collision in a few
hundred in the case of something you really really care about is too
much. So we added not only 8 way set associative hashing to sch_cake
(collision probability of roughly zero), but full support for diffserv
classification, and it's shipping in linux 4.19. (been available for two
years in openwrt, ubnt, evenroute and elsewhere). It's got great support
for docsis framing also.

I sure hope more people outside the ietf try it. Oh yea, we also
defaulted sch_cake to per host FQ (running simultaneously with the per
flow FQ), which works, even through nat, and that does a great job of
giving low latency to all those IOT devices you might have... keeps
torrents even more under control...

https://kernelnewbies.org/Linux_4.19#Better_networking_experience_with_the_CAKE_queue_management_algorithm

Anyway, observed use of diffserv markings is at nearly nil, and comcast
rewrites all traffic it doesn't recognise as CS1, so we wash that clean
on inbound.

And most of the time, at real bandwidths, fq_codel more than sufficies,
except when you need to toss off a one-liner for a shaper, like

tc qdisc add dev eth0 root cake docsis bandwidth 20mbit ack-filter

   "complexity in managing a large number of queues"

fq_codel, um, a couple hundred lines of code. Trivial, compared to just
about anything real. tcp is a few thousand, for example. It's smaller
than most device drivers. The core of the algorithm is 20 lines.

(cake is more complicated)

Please stop calling it complicated. The codebase for nearly any queuing
system is going to be within 10-20% of each other...

THE REAL ELEPHANT IN THE ROOM is SHAPING.

... the 680ms of totally gratiutious buffering comcast CMTSes have at
100mbit, and 280 at 10mbit up. Having to outbound shape is not a problem
with any hardware we have deployed, even with cpus designed in the late
80s. Inbound shaping of the "Badwidth" being provided by ISPs today is
CPU intensive, easily 95% of the overall cost and complexity of doing
queuing "right". Can we have a discussion on fixing SHAPING somewhere
in the ietf? Can we get ISPs to at the very least, buffer no more than a
85ms of data?

   "and
   the scheduling (typically DRR) that enforces that each non-sparse
   flow gets an equal fraction of link bandwidth causes problems with
   VPNs and other tunnels"

strike "non-sparse" and say "saturating". The problem is MOST VPN
traffic is not saturating either, and when it is, it's usually managed
by a lousy FIFO queue in the first place.

So fq_codel gained the ability to also hash stuff in a terminating
tunnel with really nice results for voip in particular a few years
back. I can't find the test right now, but we observed peak latency and
jitter in the 100ms range before doing that, and 2ms after.

Sure, in contrast to other saturating flows outside the tunnel, you
might get less bandwidth, but all the vpn bandwidth you do get, is
goodwidth. For all traffic.

And it's per vpn tunnel. These days I see one vpn per user... 

Did I rant already that the vast majority of flows are non-saturating?

I really wish, just once, one day that the l4s effort would actually try
real traffic in a real scenario, with web, voip, dns, gaming. I've
reached the point where I just start calling crazy conclusions based on
crazy assumptions "big buck bunny scenarios". 

start with an hour's packet capture of your home gateways normal
traffic. count the number of distinct flows.... or your office gateways'...

  , exhibits poor behavior with less-aggressive
   CA algos, e.g.  LEDBAT,  and exhibits poor behavior with RMCAT CA
   algos.

I would not call it poor. I would call it undesirable based on the
intent of the algorithm's designers. HOWEVER:

LEDBAT advocates think that imposing 100ms of queuing delay on all flows
is *ok*. To me, that's *poor*. In fact - *nuts*.

https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf

I really wish I wasn't outnumbered on writing that paper, and could
release a sequel to this based on real world results...

'cause fq_codel users run bittorrent all day long and don't notice its
there. Every freaking day. I'm running 10 torrents right now. Never
notice. I knew when we wrote that we'd obsolete LPCC's brain damaged
100ms induced latency concept as a "good" method... *entirely*, and we
did, and *WE DON'T CARE*. You shouldn't either.

Try running torrent all day on cable without an inbound fq_codel
shaper. Just try it. Chose.

* "Exhibits poor behavior with rmcat CA algos."

The relevant paper for this was terrible, and based on an artificial benchmark not modeling real traffic, and was at 2Mbit to boot, by a very biased observer.

Videoconferencing traffic works GREAT with fq_codel at higher bandwidths
against all forms of real traffic. 20mbit down/4mbit up and above.

2Mbit is just going to always suck for videoconferencing, above that...

To quote VJ : "low rate videoconferencing never gets dropped"

Which is really the measured case, in real traffic, on real networks,
doing videoconferencing. Try it.

"   In effect the network element is making a decision as to what
   constitutes a flow, and then forcing all such flows to take equal
   bandwidth at every instant."

Still wouldn't say it that way. It's maximizing entropy that matters.
Microbursts get ripped out. Jitter vanishes. Only fat flows see "equal
bandwidth".

My biggest bugaboo with the l4s thing is that somehow y'all think that
traffic will pre-arrive, automagically dispersed in statistically random
ways, and that does not happen. Ever. Traffic is naturally bursty and
needs to be broken up.

Anyway this rant is somewhat mis-directed. I do hope some magical way of
identifying sparser flows appears, other than FQ already does so
beautifully, but until then I really wish more people would repeat less
bullshit and go actually use the stuff they are dissing and understand
it more deeply.

It certainly is possible to do better entropy e2e than we do.

There has certainly been wonderful things with fq + pacing of late, and
I look forward to things like the new etx scheduler and intel hardware
support for timerwheels to be widely deployed.