[Cerowrt-devel] [Bloat] Comcast upped service levels -> WNDR3800 can't cope...

Tue Sep 2 05:27:06 EDT 2014

On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:

>> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small.  The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC.  It has to do that anyway to route them, and without shaping there'd be more of them to handle.  The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact.  It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
> 
> In an ideal case, yes.  But is that how this gets managed?  (I have no idea, I'm certainly not a kernel developer).

It would be monumentally stupid to integrate two GigE MACs onto an SoC, and then to call it a "network processor", without adequate DMA support.  I don't think Atheros are that stupid.

Here's a more detailed datasheet:
	http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf

"Another memory factor is the ability to support multiple I/O operations in parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 ports that enable simultaneous access to and from five sources: the two gigabit Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."

It's a reasonable question, however, whether the driver uses that support properly.  Mainline Linux kernel code seems to support the SoC but not the Ethernet; if it were just a minor variant of some other Atheros hardware, I'd have expected to see it integrated into one of the existing drivers.  Or maybe it is, and my greps just aren't showing it.

At minimum, however, there are MMIO ranges reported for each MAC during OpenWRT's boot sequence.  That's where the ring buffers are.  The most the CPU has to do is read each packet from RAM and write it into those buffers, or vice versa for receive - I think that's what my PowerBook has to do.  Ideally, a bog-standard DMA engine would take over that simple duty.  Either way, that's something that has to happen whether it's shaped or not, so it's unlikely to be our problem.

The same goes for the wireless MACs, incidentally.  These are standard ath9k mini-PCI cards, and the drivers *are* in mainline.  There shouldn't be any surprises with them.

> If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load?

The qdiscs only deal with packet and socket headers, not the full packet data.  Even then, they largely pass pointers around, inserting the headers into linked lists rather than copying them into arrays.  I believe a lot of attention has been directed at cache-friendliness in this area, and the MIPS caches are of conventional type.

>> Which brings me back to the timers, and other items of black magic.
> 
> Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case).

It probably wouldn't manifest as high system load.  Rather, poor timer resolution or latency would show up as excessive delays between packets, during which the CPU is idle.  The packet egress times may turn out to be quantised - that would be a smoking gun, if detectable.

>> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link.  I assume that's not a factor here.
> 
> That's the usual suspicion.  But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments...

Sure.  But even turbocharged 'n' gear tops out at 450Mbps signalling, and much less than that is available even theoretically for TCP/IP throughput.  My point is that you're probably not running *your* tests over wireless.

> What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:
> 
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
>     * 10Mbps
>     * 20Mbps
>     * 50Mbps
>     * 100Mbps

Smaller increments at the high end of the range may prove to be useful.  I would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a bottleneck in a peripheral device, such as the PCI bus.  The way the kernel classifies that usage may also be revealing.

> Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?

That might be what's already happening.  We have to figure out that before we can work out a solution.

 - Jonathan Morton