[Bloat] CAKE in openwrt high CPU

Tue Sep 1 17:09:10 EDT 2020

Jonathan Morton <chromatix99 at gmail.com> writes:

>> On 1 Sep, 2020, at 9:45 pm, Toke Høiland-Jørgensen via Bloat <bloat at lists.bufferbloat.net> wrote:
>> 
>> CAKE takes the global qdisc lock.
>
> Presumably this is a default mechanism because CAKE doesn't handle any
> locking itself.
>
> Obviously it would need to be replaced with at least a lock over
> CAKE's complete data structures, taking the lock on each entry point
> and releasing it at each return point, and I assume there is a flag we
> can set to indicate we do so. Finer-grained locking might be possible,
> but CAKE is fairly complex so that might be hard to implement. Locking
> per CAKE instance would at least allow running ingress and egress on
> different CPUs.

What you're describing here is basically the existing qdisc root lock.
It is per instance of the qdisc, and it is held only while enqueueing
and dequeueing packets from that qdisc. So it is possible today to run
the ingress and egress instances of CAKE on different CPUs. All you have
to do is schedule the packets to be processed on different CPUs in the
different directions - which usually means messing with RPS settings for
the NIC, and as I remarked to Sebastian, for many OpenWrt SOCs this is
not really supported...

To make CAKE truly take advantage of multiple CPUs, there are to
options:

1. Make it aware of multiple hardware queues. To do this, we to
   implement the 'attach()' method in the Qdisc_ops struct (see sch_mq
   for an example). The idea here would be to create stub child qdiscs
   with a separate struct Qdisc_ops implementing enqueue() and
   dequeue(). These would be called separately for each hardware queue,
   with their separate locks held at the time; and with proper XPS
   steering, each hardware queue can be serviced by a separate CPU.

2. Set the TCQ_F_NOLOCK in the qdisc flags; this will cause the existing
   enqueue() and dequeue() functions to be called without the root lock
   being held, and the qdisc is responsible for dealing with that
   itself.

Of course in either case, the trick is to get the CAKE data structures
to play nice with concurrent access from multiple CPUs. For option 1.
above, we could just duplicate all the flow queues for each netdev queue
and take the hit in wasted space - or we could partition the data
structure, either statically at init, or dynamically as each flow
becomes active. But at a minimum there would need to be some way for the
shaper to enforce the maximum rate. Maybe a granular lock or an atomic
is good enough for this, though?

Note also that for 2. there's an ongoing issue[0] with packets getting
stuck which is still unresolved, as far as I can tell - so not sure if
this is the right way to go. However, apart from this, the benefit of 2.
is that CAKE could *potentially* process packets on multiple CPUs
without relying on hardware multi-Q. I'm not quite sure if the stack
will actually process packets on more than one CPU without them,
though.

Either way, I suppose some experimentation would be needed to find the
best solution.

-Toke

[0] https://lore.kernel.org/netdev/CACS=qq+a0H=e8yLFu95aE7Hr0bQ9ytCBBn2rFx82oJnPpkBpvg@mail.gmail.com/