<div dir="ltr"><div dir="ltr">On Wed, Mar 25, 2020 at 12:18 PM Dave Taht <<a href="mailto:dave.taht@gmail.com">dave.taht@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">On Wed, Mar 25, 2020 at 8:58 AM Aaron Wood <<a href="mailto:woody77@gmail.com" target="_blank">woody77@gmail.com</a>> wrote:<br>
><br>
> One other thought I've had with this, is that the apu2 is multi-core, and the i210 is multi-queue.<br>
><br>
> Cake/htb aren't, iirc, setup to run on multiple cores (as the rate limiters then don't talk to each other). But with the correct tuple hashing in the i210, I _should_ be able to split things and do two cores at 500Mbps each (with lots of compute left over).<br>
><br>
> Obviously, that puts a limit on single-connection rates, but as the number of connections climb, they should more or less even out (I remember Dave Taht showing the oddities that happen with say 4 streams and 2 cores, where it's common to end up with 3 streams on the same core). But assuming that the hashing function results in even sharing of streams, it should be fairly balanced (after plotting some binomial distributions with higher "n" values). Still not perfect, especially since streams aren't likely to all be elephants.<br>
<br>
We live with imperfect per core tcp flow behavior already.<br></blockquote><div><br></div><div>Do you think this idea would make it worse, or better? (I couldn't tell from your comment how, exactly, you meant that)</div><div><br></div><div>OTOH, any gains I'd get over 500Mbps would just be gravy, as my current router can't do more than that downstream on a single core, so even if the sharing I have is uneven, it's better than what I have now (~400Mbps and not pretty), so even if all fat streams landed on one core (unlikely starting at 4+ streams), so I think I'd see overall gains (in my situation, others might not).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
What I wanted to happen was the "list" ingress improvement to become<br>
more generally available ( I can't find the lwn link at the moment).<br>
It has. I thought that then we could express a syntax of tc qdisc add<br>
dev eth0 ingress cake-mq bandwidth whatever, and it would rock.<br>
<br>
I figured getting rid of the cost of the existing ifb and tc mirred,<br>
and having a fast path preserving each hardware queue, then using<br>
rcu to do a sloppy allocate atomic lock for shaped bandwidth and merge<br>
every ms or so might be then low-cost enough. Certainly folding<br>
everything into a single queue has a cost!<br></blockquote><div><br></div><div>Sharing the tracked state by each cake-mq "thread" and updating it every so often?</div><div><br></div><div>Or doing the rate limiting in one core, and the fq'ing on another? (I don't think this is what you meant?)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
I was (before money ran out) prototyping adding a shared shaper to mq<br>
at one point (no rcu, just There have been so many other things<br>
toss around (bpf?)<br>
<br>
As for load balancing better, google "RSS++", if you must.</blockquote><div><br></div><div>I few years ago (before my current job ate all my brain cycles), I was toying around with taking the ideas from OpenFlow/OpenVSwitch and RSS and using them for parallelizing tasks like this:</div><div><br></div><div>- have N worker threads (say N=real cores, or real cores-1, or somesuch), each fed by RSS / RPS / multiqueue etc.</div><div>- have a single controller thread (like the OpenFlow "controller")</div><div><br></div><div>Each worker publishes state/observations to the controller, as well as forwarding "decisions to make", while the controller publishes each worker's operating parameters to them individually.</div><div><br></div><div>The workers then just move packets as fast as they can, using their simple rules, with no shared state between the workers, or need to access global tables like connection tracking (e.g. NAT tables to map NAT'd tuples to lan address tuples).</div><div><br></div><div>The controller deals with the decisions and the balancing of params (such as dynamic configuration of the policer to keep things "fair").</div><div><br></div><div>I never got much farther than sketches on paper, and laying out how I'd do it in a heavily multi-threaded userspace app (workers would use select() to receive the control messages in-band, instead of needing to do shared memory access). </div><div><br></div><div>I was also hoping that it would generalize to the hardware packet accelerators, but I think to really take advantage of that, they would need to be able to implement.</div><div><br></div><div>And, I never seem to have the time to try to stand up a rough framework for this, to try it out...</div></div></div>