[Bloat] quick review and rant of "Identifying and Handling Non Queue Building Flows in a Bottleneck Link"

Mon Nov 12 17:19:50 EST 2018

Greg White <g.white at CableLabs.com> writes:

> Hi Dave,
>
> Thanks for the constructive review comments on the text of the draft,
> including pointers to references that could be included.  I'll take
> those comments on in the next revision.

Thank you for filtering out the late-night noisy part of my post.

I hit save on this one... please forgive me for making these points
strenuously, one last time, in this forum, I plan to just go back to
work, making running code run everywhere I can, after this.

>
> As I know you are aware, the majority of bottleneck links are not
> linux machines (running tc and feeding packets to a continuous
> dedicated link)

I imagine there is ~2b home/business/edge routers in the world. I do not
know how many of those run a linux derived or bsd derived operating
system or hardware. A recent 5G test was also purely linux based.

The other bottlenecks (CMTS/DSLAM/ENODE-Bs) seem to be running something
else entirely.

2b ain't scratch.

There's still ddpk and other software solutions left to chase here.

>, so comparing lines of kernel code as a metric for
> complexity isn't the most useful.

What would you like as a metric for complexity?

In pure computer science we can build things up as an O(x) or O(nlogn)
or a variety of other bits of math. QFQ has a good analysis that way,
but that sort of analysis does tend to assume that stuff that's already
done in hw has a cost, when it doesn't.

This is kind of a serious question: in that you are seeking some method
to identify "sparse" packets cheaper than DRR or SFQ, without clearly
identifying what that is. For another complexity analysis, see the HHF
thing I'm linking to later here.

I've wracked my brain trying to come up with other things than hashes
and TCAMs - like bloom filters - and all of them have more complexity
and cost than hashes and FQ. In hardware. Or software.

People do regexps in hardware now. Those are *gnarly*.

It would be great to nail your complexity requirement to a
wall. Example:

"We want something other than hash + FQ that can run in a single clock
cycle on a Zilinx FPGA at 1Ghz to pull sparser flows out of the mix and
insert them ahead of other bulk flows."

That makes the research simpler, I think. That's a hardware spec.

hw bloom filters would be my best guess at something that might be able
to identify flows.

https://www.eecs.harvard.edu/~dbrooks/lyon_islped09.pdf

However, it easier to identify heavy hitters. The HHF algorithm for
example was a *good try*

https://lwn.net/Articles/577208/

but in terms of complexity and cost, well, I've not got around to
benching that one or looking into a hardware implementation, in my
world, lacking TCAMS - route table lookups, firewalls, natting and
shaping, far outweigh the cost of the base fq_codel algo.

shaping is *easy* in hw.

> In the bottleneck links that I am
> concerned about (and other similar links), it is advantageous to
> tightly couple the queuing and scheduling mechanisms with the
> hardware-optimized discontinuous MAC layer

It would be good if you clarified your targets in the draft.

Secondly, we did "tightly couple queuing and scheduling mechanisms with
the hardware-optimized discontinuous MAC layer" in our wifi work.

https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf

I'd hoped that that would beat a path forward for 5G, cable and GPON
networks designs.

>, and also take into account
> engineering tradeoffs that are different than those that apply in a
> linux machine.

I am painfully aware of these, however, my points about engineering
tradeoffs do seem slide off of some people.

On the high end, on gear I'm familiar with today:

* we don't know how to get past 100Gbit without multi-queueing
* most (all) of the 10GB+ hardware I'm aware of has at least 64 HW
* queues
* hashing is done in hw

after that...

* 1024 DRR or SFQ require a 10 bit AND gate wired to the output of the
  hasher that looks up where to stick the packet.
* timestamps are "free"

The DRR stuff has existed in netfpga for years.

> The point about bittorrent and RMCAT wasn't about latency, it was
> about throughput.  I have come to share the view that others have
> expressed that it is a valuable property of the internet that
> congestion control algorithm developers are free to innovate.

I certainly think cc devs are free to innovate.

Things is 99.99999% of the world doesn't care about CC. Take
freeswitch's voip stuff - they don't do any CC, they just blast packets.
Plenty of applications just blast packets. The TCP-friendly world exists
only in the IETF. Outside, it's gnarly.

Of the remainder of the population, most cc devs want theirs to be the
one true algorithm, and that's the source of endless contention - like
watching BBR duke it out against cubic which horrible to behold.

Lowering the allowable buffering bar to something reasonable and then
innovating with an FQ'd environment is perfectly doable.

using FQ actually *opens up the possibilities* to do whatever CC you
like, because whatever you do does not impact anyone else (much) and
neither are you impacted by them.

Want to grab more bandwidth? Go ahead, try, until you get delay. Too
much delay and you drop packets. Don't want to drop packets, use
ecn... eat as much delay as you want.

Want to use less bandwidth and be a burden? That's totally doable
also (in my case I just reverted reno to IW2 and cut the growth rate to 40%)

Diffserv - were it only transported e2e - is how to better express
your intent.

.. as it is cake's diffserv works pretty well on egress from your network.

>FQ in
> the bottleneck link diminishes this ability in some important ways.
> Some applications (cloud backup, mass uploads like the icloud photo
> library updates, bittorrent, etc) might not be interested in per-flow
> fairness with TCP, but would prefer to be less aggressive and thus
> take a smaller fraction of the link when there are competing TCP flows

and that happens, automatically. less greedy flows get a smaller
fraction of the link.

> (note, I am not arguing that the *current* ledbat protocol is the
> right one).  Other applications (RMCAT) might in the future aim to be
> per-flow fair, but need slightly longer to adapt than FQ allows.

"but need slightly longer to adapt than FQ allows"

I don't see your point here. Videoconferencing slightly below the fair
share never gets a drop or sees delay. Higher rates, see a bit of
delay.

Videoconferencing is very good at dealing with drops and jitter
far far worse than what FQ or FQ_codel can inflict.

In terms of never dropping...  I do hope to fiddle with ECN in NADA or
google's CC someday. Still, for videoconferencing, the minimum reaction
time is about a frame, 16ms - 

... not hundreds. Certainly encoders today have difficulty shifting
rates faster than every few 100ms, but keeping videoconferencing delays
below 150ms total, e2e, is required for a good experience - and I'd
really love to see 16ms on that side, and for that matter, a return to
scan-lines to get rid of that 16ms we currently lose per frame.

The LOLA project had it right.

> Some folks are just uncomfortable with the departure from the
> "end-to-end" principal with fq

The authors of that paper, notably Dave Reed, have embraced this
departure from the e2e principle. The author of RED never shared it.

> (i.e. a middlebox gets to determine
> what constitutes a "flow", then enforces that all fat "flows" get
> equal bandwidth at every instant).

but that's in general, not my concern. Microbursts, clumping at
bottlenecks, DOS attacks, applications that care no one whit about CC in
the first place is the largest part of the problem.

I'm unfond of middleboxes myself however I do not think re-ordering
packets *fairly* is unfair to anything.

> Fq_codel is a pragmatic solution
> for some systems, and it gives a real undeniable benefit in common
> cases with current applications/mixes, but it isn't perfect.

I wish, very much, that the AQM working group had merely come up with a
recomendation for "no more than a contintental BDP's worth of buffering"
at the head end.  It's multiple ISPs and designs that still have > 200ms
of buffering built into them that makes me the most nuts.

Can anyone here make a spirited and robust defence of > 100ms of buffering
at the bottleneck links along the edge? Of 500ms? Of 2 seconds?

We really never not even get close to 100ms worth of buffering with
today's paced tcps. Videoconferencing, you need a frame. What else?

The perfect is the enemy of the good. I'm perfectly happy to have cut
the internet's observable buffering to nearly 0 for most flows, and to
far less than 100ms for all others. I'm happy to see innovations like
BBR, which + fq of any sort - works great.

There may well be a target even further below than what we currently
achieve that's achievable, but given a choice between allowing things
to persist... 

well... I went off and fixed wifi. good wifi will keep isps relevant
relative to 5G for a few more years at least... and the uptake is
near total on the hardware and software I've been able to influence,
with universal acclaim from the users.

> Perhaps
> sch_cake is better in some cases, but it makes a lot more decisions on
> behalf of the user that might be ok now, but may not be in the future.

Certainly we put the control back in the users hands here. I did rather
wish with cake that someone here would go and play with the diffserv (which
allows a flow to express, directly, what it wants) stuff, or the ECN
implentation, which allows a flow to either respond with more or less
latency as it chooses, without loss with their CC algo or application of choice.

... and get back to us.

I do also note a subtle point. In many parts of the DSL world, at least,
users get to chose how their ISP shapes their traffic. I wish we could
have devices and ISP links, universally, that handed the users back that
control, and took the ISP and CC devs, out of it.

> In light of all of the above, I'm open to considering other
> alternatives.  So, I don't see this as a non-problem, and I don't see
> the potential solutions as being in any way magical.

What solutions do you see? I'm stumped. Best idea I have is bloom
filters, in hw, with something that's far less effective for the fast
majority of real traffic and real applications that fq_codel has been.

Can you point at anything that might work? Especially anything that can
also handle all the bad non-cc'd behavior on the internet today?

> -Greg
>
>
> On 11/2/18, 2:13 AM, "Dave Taht" <dave.taht at gmail.com> wrote:
>
>     On Thu, Nov 1, 2018 at 6:25 AM Toke Høiland-Jørgensen <toke at toke.dk> wrote:
>     >
>     > Greg White <g.white at CableLabs.com> writes:
>     >
>     > > Hi Toke, thanks for the pointer to your paper, I had not seen it
>     > > before.
>     >
>     > You're welcome :)
>     >
>     > > I agree that could be a candidate algorithm. It is certainly simple.
>     > > It may not be the only (or perhaps even the best) solution for the
>     > > dual-queue case though. I'm thinking in the context of an L4S TCP
>     > > flow, which can respond quickly to "new" ECN markings and achieve link
>     > > saturation with ultra low (but not zero) queuing delay. A good
>     > > property for a queue protection algorithm would be that these L4S
>     > > flows could be consistently placed in the NQB queue. I think that the
>     > > simple approach you mentioned would result in many L4S flows being
>     > > deemed Queue Building.
>     >
>     > Yes, I think you are right (depending on traffic mix, of course). It
>     > might be possible to tweak it to work better, though. E.g., by changing
>     > the threshold (moving flows to QB if they end up with more than X ms of
>     > queue). This would only work if you start out all flows at NQB, with the
>     > associated aggressive marking behaviour; so I'm not sure if a normal TCP
>     > flow would ever manage to get to QB state before getting clobbered by
>     > the NQB markings...
>     >
>     > > Additionally, we've observed applications that send variable sized
>     > > "messages" at a fixed rate (e.g. 20 messages/second) where the message
>     > > sometimes exceeds the MTU and results in two closely spaced (possibly
>     > > back-to-back) packets. This is a flow that I think should be
>     > > considered to be NQB, but would get flagged as QB by the simple
>     > > approach. You described this case in your paper, where you state that
>     > > the first Q bytes of each burst will be treated as NQB (the first
>     > > packet in the case we're talking about here), but the rest will be
>     > > treated as QB. Assuming that message latency is important for these
>     > > sorts of applications, this is equivalent to saying that the entire
>     > > burst is considered as QB. In the fq_codel case, the message latency
>     > > would be something like Q(n-1)(N+1)/R (assuming no other sparse flow
>     > > arrivals), something like 1.3ms using the example values in your paper
>     > > (plus n=2, N=10) which may be ok. In the dual-queue case it is a
>     > > bigger deal, because the remaining packets would be put at the end of
>     > > the QB queue, which could have a latency of 10 or 20 ms.
>     >
>     > Sure, it's by no means a perfect mechanism. But it makes up for that by
>     > it's simplicity, IMO. And it does work really well for *a lot* of
>     > today's latency-sensitive traffic.
>     >
>     > (In your case of two-MTU messages, you could tune the quantum to allow
>     > those; but of course you can construct examples that won't work).
>     
>     sch_fq has a default quantum of 2 mtu, with a initial burst of 10.
>     There's all sorts of interesting work inside that to "right-size" the
>     ongoing gso offloads and a major new advance over there on calculating
>     rtts properly is described here:
>     
>     https://lwn.net/Articles/766564/
>     
>     ...
>     
>     there was at one point, a fq_pie implementation that used the rbtree
>     in sch_fq to
>     achieve perfect fairness.
>     
>     we often tune fq_codel's quantum as low as 300, at low rates.
>     
>     > > So, a queue protection function that provides a bit more (but still
>     > > limited) allowance for a flow to have packets in queue would likely
>     > > work better in the dual-queue case.
>     
>     For inbound shaping sch_cake defaults to 2mtu at higher rates.
>     
>     This kind of opens a question in that, what is a typical target
>     bandwidth for l4s applications?
>     
>     the videoconferencing paper I dissed in my earlier rant focused only
>     at 2mbits (and drew the conclusion that sfq was the best option for
>     the cc algo's 300ms desired range)
>     
>     I have generally been focused on badwidths in the 4mbit-40gbit range.
>     
>     >
>     > Yeah, that's tricky, especially if you want it to be very accurate in
>     > its distinction; which I sort of gather that you do, right?
>     
>     In part, that is why I would like the language in the problem
>     statement clarified. To give a concrete counterexample,
>     think upon the ultimate choice of a crc32 algorithm and the parameters
>     that drove that choice.
>     
>     If an accuracy of identifying "sparse flows" or NQB flows can be
>     specified, then all sorts of algorithms can be thrown into the fray.
>     AI. Bloom filters. rbtrees. regexes. cookoo++ hashes... and boring
>     old fq tech. :)
>     
>     > -Toke
>     > _______________________________________________
>     > Bloat mailing list
>     > Bloat at lists.bufferbloat.net
>     > https://lists.bufferbloat.net/listinfo/bloat
>     
>     
>     
>     -- 
>     
>     Dave Täht
>     CTO, TekLibre, LLC
>     http://www.teklibre.com
>     Tel: 1-831-205-9740
>