Luckily, I don't mind being wrong (or even _way_ off the mark).

I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which). The desktop version (7457A) used external cache. The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB. The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest.

And the specs on the G4 that I'd dug up were desktop specs.

But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity. (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy. Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed. For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time.

The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates.

But, I don't think it's 32-bit, I think it's running two banks of 64MB chips in x16 mode. That's based on my experiences with other, similar chips. The AR7161 datasheet here: https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1, but not the bus width.

But even if it had an 8-bit bus, it sounds like it would have the ability to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps)

So that takes care of the argument for simply moving the payload around. In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.

In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer).

If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load?

And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did. I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi. (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.) Atheros publicity materials indicate that they increased the I-cache to 64KB for performance reasons, but saw no need to increase the D-cache at the same time.

But, they also have a core that's designed to do little to no processing of the data, just DMA from one side to the other, while validating the firewall rules... So it may be sufficient d-cache for that, without having the capacity to do anything else.

Which brings me back to the timers, and other items of black magic.

Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case).

Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.

That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments...

======

What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes:

* no shaping, anywhere

* egress shaping

* egress and ingress shaping at various limited levels:

* 10Mbps

* 20Mbps

* 50Mbps

* 100Mbps

-Aaron