[Bloat] Comcast upped service levels -> WNDR3800 can't cope...

Mon Sep 1 14:06:56 EDT 2014

On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:

> On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <chromatix99 at gmail.com> wrote:
>> 
>> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>> 
>>> Could I get you to also try HFSC?
>> 
>> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>> 
>> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.
> 
> If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think)
> (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly),
> where the others buffer up an extra packet until they can be delivered.

It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2!  The idea of manually specifying an "average packet size" in particular feels intuitively wrong to me.  Still, I might be able to try it later on.

Most class-based shapers are probably more complex to set up for simple needs than they need to be.  I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them.  They scale up reasonably well to complex situations, but such uses are relatively rare.

> In my quest for absolutely minimal latency I'd love to be rid of that
> last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
> operation or with a running estimate.

I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish.

> It would be cool to be able to program the ethernet hardware itself to
> return completion interrupts at a given transmit rate (so you could
> program the hardware to be any bandwidth not just 10/100/1000). Some
> hardware so far as I know supports this with a "pacing" feature.

Is there a summary of hardware features like this anywhere?  It'd be nice to see what us GEM and RTL proles are missing out on.  :-)

>> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.
> 
> You will see it bound by the softirq thread, but, what, exactly,
> inside that, is kind of unknown. (I presently lack time to build up
> profilable kernels on these low end arches. )

When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU.  I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there.  I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances.

But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without.  The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around.  Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%.  I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.

It's possible, however, that we're not really looking at a CPU limitation, but a timer problem.  The PowerBook is a "proper" desktop computer with hardware to match (modulo its age).  If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?

 - Jonathan Morton