[Cerowrt-devel] [Bloat] Comcast upped service levels -> WNDR3800 can't cope...

Mon Sep 1 14:32:18 EDT 2014

On Mon, Sep 1, 2014 at 11:06 AM, Jonathan Morton <chromatix99 at gmail.com> wrote:
>
> On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:
>
>> On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <chromatix99 at gmail.com> wrote:
>>>
>>> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>>>
>>>> Could I get you to also try HFSC?
>>>
>>> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>>>
>>> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.
>>
>> If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think)
>> (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly),
>> where the others buffer up an extra packet until they can be delivered.
>
> It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2!  The idea of manually specifying an "average packet size" in particular feels intuitively wrong to me.  Still, I might be able to try it later on.

I felt a ewma of egress packet sizes would be a better estimator, yes.

> Most class-based shapers are probably more complex to set up for simple needs than they need to be.  I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them.
> They scale up reasonably well to complex situations, but such uses are relatively rare.

>> In my quest for absolutely minimal latency I'd love to be rid of that
>> last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
>> operation or with a running estimate.
>
> I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish.

I agree that a simpler to use qdisc would be good. I'd like something
that preserves multiple (3-4) service classes (as pfifo_fast and
sch_fq do) using drr, deals with diffserv, and could be invoked with a
command line like:

tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std

I had started at that (basically pouring cerowrt's "simple.qos" code
into C with a simple lookup table for diffserv) many moons ago, but
the contents of the yurtlab and that code was stolen - and I was (and
remain) completely stuck on how to do soft rate limiting saner,
particularly in asymmetric scenarios.

("cake" stood for "Common Applications Kept Enhanced". fq_codel is not
a drop in replacement for pfifo_fast due to the classless nature of
it. sch_fq comes closer, but it's more server oriented. QFQ with 4
weighted bands + fq_codel can be made to do the levels of service
stuff fairly straight forwardly at line rate, but the tc "filter" code
tends to get rather long to handle all the diffserv classes...

So... we keep polishing the sqm system, and I keep tracking progress
in how diffserv classification will be done in the future (in ietf
groups like rmcat and dart), and figuring out how to deal better with
aggregating macs in general is what keeps me awake nights, more than
finishing cake...

We'll get there, eventually.

>> It would be cool to be able to program the ethernet hardware itself to
>> return completion interrupts at a given transmit rate (so you could
>> program the hardware to be any bandwidth not just 10/100/1000). Some
>> hardware so far as I know supports this with a "pacing" feature.
>
> Is there a summary of hardware features like this anywhere?  It'd be nice to see what us GEM and RTL proles are missing out on.  :-)

I'd like one. There are certain 3rd party firmwares like octeon's
where it seems possible to add more features to the firmware
co-processor, in particular.

>
>>> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.
>>
>> You will see it bound by the softirq thread, but, what, exactly,
>> inside that, is kind of unknown. (I presently lack time to build up
>> profilable kernels on these low end arches. )
>
> When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU.  I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there.  I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances.

perf and the older oprofile are our friends here.

> But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without.  The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around.  Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%.  I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.
>
> It's possible, however, that we're not really looking at a CPU limitation, but a timer problem.  The PowerBook is a "proper" desktop computer with hardware to match (modulo its age).  If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?

Both good questions worth further exploration.

>  - Jonathan Morton
>

-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article