[Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51

Mon Mar 6 08:30:23 EST 2017

On Fri, Mar 3, 2017 at 12:21 AM, Dave Taht <dave.taht at gmail.com> wrote:

> As this is devolving into a cake specific discussion, removing the
> lede mailing list.
>
> On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99 at gmail.com>
> wrote:
> >
> >> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen at hotmail.com>
> wrote:
> >>
> >> That's not what I was going for. Agree, it would not be good to depend
> >> on an inferior hash. You mentioned divide as a "cost." So I was
> >> proposing a thought around a "benefit" estimate. If hash collisions are
> >> not as important (or are they), then what is "benefit / cost?"
> >
> > The computational cost of one divide is not the only consideration I
> have in mind.
> >
> > Cake’s set-associative hash is fundamentally predicated on the number of
> hash buckets *not* being prime, as it requires further decomposing the hash
> into a major and minor part when a collision is detected.  The minor part
> is then iterated to try to locate a matching or free bucket.
> >
> > This is considerably easier to do and reason about when everything is a
> power of two.  Then, modulus is a masking operation, and divide is a shift,
> either of which can be done in one cycle flat.
> >
> > AFAIK, however, the main CPU cost of the hash function in Cake is not
> the hash itself, but the packet dissection required to obtain the data it
> operates on.  This is something a profile would shed more light on.
>
> Tried. Mips wasn't a good target.
>
> The jhash3 setup cost is bad, but I agree flow dissection can be
> deeply expensive. As well as the other 42+ functions a packet needs to
> traverse to get from ingress to egress.
>
> But staying on hashing:
>
> One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash
> if one already existed (injected already by tcp, or by hardware, or
> the tunneling tool). we only need to compute a partial hash on the
> smaller subset of keys in that case (if we can rely on the skb->hash
> which we cannot do in the nat case)
>
> Another thing I did, long ago, was read the (60s-era!) liturature
> about set-associative cpu cache architectures... and...
>
> In all of these cases I really, really wanted to just punt all this
> extra work to hardware in ingress - computing 3 hashes can be easily
> done in parallel there and appended to the packet as it completes.
>
> I have been working quite a bit more with the arm architecture of
> late, and the "perf" profiler over there is vastly better than the
> mips one we've had.
>
> (and aarch64 is *nice*. So is NEON)
>
> - but I hadn't got around to dinking with cake there until yesterday.
>
> One thing I'm noticing is that even the gigE capable arms have weak or
> non-existent L2 caches, and generally struggle to get past 700Mbits
> bidirectionally on the network.
>
> some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are
> here:
>
> http://www.taht.net/~d/lime-2/
>
> The rrul tests were not particularly pleasing. [1]
>
> ...
>
> A second thing on my mind is to be able to take advantage of A) more cores
>
> ... and B) hardware that increasingly has 4 or more lanes in it.
>
> 1)  Presently fq_codel (and cake's) behavior there when set as a
> default qdisc is sub-optimal - if you have 64 hardware queues you end
> up with 64 instances, each with 1024 queues. While this might be
> awesome from a FQ perspective I really don't think the aqm will be as
> good. Or maybe it might be - what happens with 64000 queues at
> 100Mbit?
>
> 2) It's currently impossible to shape network traffic across cores.
> I'd like to imagine that with a single atomic exchange or sloppily
> shared values shaping would be feasible.
>
>
When you need to worry about multithreading, many times perfect is very
much the enemy of good. Depending on how quickly you need to make the
network react, you could do something along the lines of a "shared pool" of
bandwidth. Each core gets a split of the bandwidth, any unused bandwidth
can be added to the pool, and cores that want more bandwidth can take
bandwidth from the pool.

You could treat it like task stealing, except each core can generate tokens
that represent a quantum of bandwidth that is only valid for some interval.
If a core suddenly needs bandwidth, it can attempt to "take back" from its
publicly shared pool. If other cores have already borrowed, it can attempt
to borrow from another core. If it can't find any spare bandwidth, it just
waits for some interval related to how long a quantum is valid, and assumes
it's safe.

Or something.. I don't know, it's 7am and I just woke up.

> (also softirq is a single thread, I believe)
>
> 3) mq and mqprio are commonly deployed on the high end for this.
>
> So I've thought about doing up another version - call it - I dunno -
> smq - "smart multi-queue" - and seeing how far we could get.
>
> >  - Jonathan Morton
> >
> > _______________________________________________
> > Cake mailing list
> > Cake at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cake
>
>
>
> [1] If you are on this list and are not using flent, tough. I'm not
> going through the trouble of generating graphs myself anymore.
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org
> _______________________________________________
> Cake mailing list
> Cake at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/cake/attachments/20170306/aa917d57/attachment-0001.html>