[Cake] Dropping dropped

Fri Feb 15 15:45:24 EST 2019

I still regard inbound shaping as our biggest deployment problem,
especially on cheap hardware.

Some days I want to go back to revisiting the ideas in the "bobbie"
shaper, other days...

In terms of speeding up cake:

* At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a
single cpu, in softirq. A lwn article just went by about a proposed
set of improvements for that:
https://lwn.net/SubscriberLink/779738/771e8f7050c26ade/

* Hardware multiqueue is more and more common (APU2 has 4). FQ_codel
is inherently parallel and could take advantage of hardware
multiqueue, if there was a better way to express it. What happens
nowadays is you get the "mq" scheduler with 4 fq_codel instances, when
running at line rate, but I tend to think with 64 hardware queues,
increasingly common in the >10GigE, having 64k fq_codel queues is
excessive. I'd love it if there was a way to have there be a divisor
in the mq -> subqdisc code so that we would have, oh, 32 queues per hw
queue in this case.

Worse, there's no way to attach a global shaped instance to that
hardware, e.g. in cake, which forces all those hardware queues (even
across cpus) into one. The ingress mirred code, here, is also a
problem. a "cake-mq" seemed feasible (basically you just turn the
shaper tracking into an atomic operation in three places), but the
overlying qdisc architecture for sch_mq -> subqdiscs has to be
extended or bypassed, somehow. (there's no way for sch_mq to
automagically pass sub-qdisc options to the next qdisc, and there's no
reason to have sch_mq

* I really liked the ingress "skb list" rework, but I'm not sure how
to get that from A to B.

* and I have a long standing dream of being able to kill off mirred
entirely and just be able to write

tc qdisc add dev eth0 ingress cake bandwidth X

*  native codel is 32 bit, cake is 64 bit. I

* hashing three times as cake does is expensive. Getting a partial
hash and combining it into a final would be faster.

* 8 way set associative is slower than 4 way and almost
indistinguishable from 8. Even direct mapping

* The cake blue code is rarely triggered and inline

I really did want cake to be faster than htb+fq_codel, I started a
project to basically ressurrect "early cake" - which WAS 40% faster
than htb+fq_codel and add in the idea *only* of an atomic builtin
hw-mq shaper a while back, but haven't got back to it.

https://github.com/dtaht/fq_codel_fast

with everything I ripped out in that it was about 5% less cpu to start with.

I can't tell you how many times I've looked over

https://elixir.bootlin.com/linux/latest/source/net/sched/sch_mqprio.c

hoping that enlightment would strike and there was a clean way to get
rid of that layer of abstraction.

But coming up with how to run more stuff in parallel was beyond my rcu-foo.