[Bloat] Comcast upped service levels -> WNDR3800 can't cope...

Mon Sep 1 13:01:45 EDT 2014

On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton <chromatix99 at gmail.com> wrote:
>
> On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
>
>> Could I get you to also try HFSC?
>
> Once I got a kernel running that included it, and figured out how to make it do what I wanted...
>
> ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.

If you are feeling really inspired, try cbq. :) One thing I sort of
like about cbq is that it (I think)
(unlike htb presently) operates off an estimated size for the next
packet (which isn't dynamic, sadly),
where the others buffer up an extra packet until they can be delivered.

In my quest for absolutely minimal latency I'd love to be rid of that
last extra non-in-the-fq_codel-qdisc packet... either with a "peek"
operation or with a running estimate. I think this would (along with
killing the maxpacket check in codel) allow for a faster system with
less tuning (no tweaks below 2.5mbit in particular) across the entire
operational range of ethernet.

There would also need to be some support for what I call "GRO
slicing", where a large receive is split back into packets if a drop
decision could be made.

It would be cool to be able to program the ethernet hardware itself to
return completion interrupts at a given transmit rate (so you could
program the hardware to be any bandwidth not just 10/100/1000). Some
hardware so far as I know supports this with a "pacing" feature.

This doesn't help on inbound rate limiting, unfortunately, just egress.

> Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves.

You will see it bound by the softirq thread, but, what, exactly,
inside that, is kind of unknown. (I presently lack time to build up
profilable kernels on these low end arches. )

> Something about TBF causes more overhead - it goes through periods of lower CPU use similar to the other shapers, but then spends periods at considerably higher CPU load, all without changing the overall throughput.

> The flip side of this is that TBF might be producing a smoother stream of packets.  The receiving computer (which is fast enough to notice such things) reports a substantially larger number of recv() calls are required to take in the data from TBF than from anything else - averaging about 4.4KB rather than 9KB or so.  But at these data rates, it probably matters little.

Well, htb has various tuning options (see quantum and burst) that
alter it's behavior along the lines of what you re seeing from tbf.

>
> FWIW, apparently Apple's variant of the GEM chipset doesn't support jumbo frames.  This does, however, mean that I'm definitely working with an MTU of 1500, similar to what would be sent over the Internet.
>
> These tests were all run using nttpc.  I wanted to finally try out RRUL, but the wrappers fail to install via pip on my Gentoo boxes.  I'll need to investigate further before I can make pretty graphs like everyone else.
>
>  - Jonathan Morton
>

-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article