[Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
dave.taht at gmail.com
Fri Mar 3 01:21:56 EST 2017
As this is devolving into a cake specific discussion, removing the
lede mailing list.
On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99 at gmail.com> wrote:
>> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen at hotmail.com> wrote:
>> That's not what I was going for. Agree, it would not be good to depend
>> on an inferior hash. You mentioned divide as a "cost." So I was
>> proposing a thought around a "benefit" estimate. If hash collisions are
>> not as important (or are they), then what is "benefit / cost?"
> The computational cost of one divide is not the only consideration I have in mind.
> Cake’s set-associative hash is fundamentally predicated on the number of hash buckets *not* being prime, as it requires further decomposing the hash into a major and minor part when a collision is detected. The minor part is then iterated to try to locate a matching or free bucket.
> This is considerably easier to do and reason about when everything is a power of two. Then, modulus is a masking operation, and divide is a shift, either of which can be done in one cycle flat.
> AFAIK, however, the main CPU cost of the hash function in Cake is not the hash itself, but the packet dissection required to obtain the data it operates on. This is something a profile would shed more light on.
Tried. Mips wasn't a good target.
The jhash3 setup cost is bad, but I agree flow dissection can be
deeply expensive. As well as the other 42+ functions a packet needs to
traverse to get from ingress to egress.
But staying on hashing:
One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash
if one already existed (injected already by tcp, or by hardware, or
the tunneling tool). we only need to compute a partial hash on the
smaller subset of keys in that case (if we can rely on the skb->hash
which we cannot do in the nat case)
Another thing I did, long ago, was read the (60s-era!) liturature
about set-associative cpu cache architectures... and...
In all of these cases I really, really wanted to just punt all this
extra work to hardware in ingress - computing 3 hashes can be easily
done in parallel there and appended to the packet as it completes.
I have been working quite a bit more with the arm architecture of
late, and the "perf" profiler over there is vastly better than the
mips one we've had.
(and aarch64 is *nice*. So is NEON)
- but I hadn't got around to dinking with cake there until yesterday.
One thing I'm noticing is that even the gigE capable arms have weak or
non-existent L2 caches, and generally struggle to get past 700Mbits
bidirectionally on the network.
some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are here:
The rrul tests were not particularly pleasing. 
A second thing on my mind is to be able to take advantage of A) more cores
... and B) hardware that increasingly has 4 or more lanes in it.
1) Presently fq_codel (and cake's) behavior there when set as a
default qdisc is sub-optimal - if you have 64 hardware queues you end
up with 64 instances, each with 1024 queues. While this might be
awesome from a FQ perspective I really don't think the aqm will be as
good. Or maybe it might be - what happens with 64000 queues at
2) It's currently impossible to shape network traffic across cores.
I'd like to imagine that with a single atomic exchange or sloppily
shared values shaping would be feasible.
(also softirq is a single thread, I believe)
3) mq and mqprio are commonly deployed on the high end for this.
So I've thought about doing up another version - call it - I dunno -
smq - "smart multi-queue" - and seeing how far we could get.
> - Jonathan Morton
> Cake mailing list
> Cake at lists.bufferbloat.net
 If you are on this list and are not using flent, tough. I'm not
going through the trouble of generating graphs myself anymore.
Let's go make home routers and wifi faster! With better software!
More information about the Cake