Luckily, I don't mind being wrong (or even _way_ off the mark).

I don't think that's it.
> First a nitpick: the PowerBook version of the late-model G4 (7447A)
> doesn't have the external L3 cache interface, so it only has the 256KB or
> 512KB internal L2 cache (I forget which).  The desktop version (7457A) used
> external cache.  The G4 was considered to be *crippled* by its FSB by the
> end of its run, since it never adopted high-performance signalling
> techniques, nor moved the memory controller on-die; it was quoted that the
> G5 (970) could move data using *single-byte* operations faster than the
> *peak* throughput of the G4's FSB.  The only reason the G5 never made it
> into a PowerBook was because it wasn't battery-friendly in the slightest.

And the specs on the G4 that I'd dug up were desktop specs.

> But that makes little difference to your argument - compared to a cheap
> CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware,
> even if it is already a decade old.
> More compelling is that even at 16-bit width, the WNDR's RAM should have
> more bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x
> 32-bit, and I can push a steady 30MB/sec in both directions simultaneously,
> which corresponds in total to about half the PCI bus's theoretical
> capacity.  (The GEM reports 66MHz capability, but it shares the bus with an
> IDE controller which doesn't, so I assume it is stuck at 33MHz.)  A 16-bit
> RAM should be able to match PCI if it runs at 66MHz, which is the lower
> limit of JEDEC standards for SDRAM.
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which
> implies at least 200MHz unless the integrator was colossally stingy.
> Further, a little digging suggests that the memory bus should be 32-bit
> wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz,
> half the CPU core speed.  For an embedded SoC, that's really not too bad -
> it should be able to sustain 1GB/sec, in one direction at a time.

The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates.

But, I don't think it's 32-bit, I think it's running two banks of 64MB
chips in x16 mode.  That's based on my experiences with other, similar
chips.  The AR7161 datasheet here:
https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1,
but not the bus width.

But even if it had an 8-bit bus, it sounds like it would have the ability
to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps)

> So that takes care of the argument for simply moving the payload around.
> In any case, the WNDR demonstrably *can* cope with the available bandwidth
> if the shaping is turned off.
> For the purposes of shaping, the CPU shouldn't need to touch the majority
> of the payload - only the headers, which are relatively small.  The bulk of
> the payload should DMA from one NIC to RAM, then DMA back out of RAM to the
> other NIC.  It has to do that anyway to route them, and without shaping
> there'd be more of them to handle.  The difference might be in the data
> structures used by the shaper itself, but I think those are also reasonably
> compact.  It doesn't even have to touch userspace, since it's not acting as
> the endpoint as my PowerBook was during my tests.

In an ideal case, yes.  But is that how this gets managed?  (I have no
idea, I'm certainly not a kernel developer).

If the packet data is getting moved about from buffer to buffer (for
instance to do the htb calculations?) could that substantially change the
processing load?

> And while the MIPS 24K core is old, it's also been die-shrunk over the
> intervening years, so it runs a lot faster than it originally did.  I very
> much doubt that it's as refined as my G4, but it could probably hold its
> own relative to a comparable ARM SoC such as the Raspberry Pi.
> (Unfortunately, the latter doesn't have the I/O capacity to do high-speed
> networking - USB only.)  Atheros publicity materials indicate that they
> increased the I-cache to 64KB for performance reasons, but saw no need to
> increase the D-cache at the same time.

But, they also have a core that's designed to do little to no processing of
the data, just DMA from one side to the other, while validating the
firewall rules...  So it may be sufficient d-cache for that, without having
the capacity to do anything else.

> Which brings me back to the timers, and other items of black magic.

Which would point to under-utilizing the processor core, while still having
high load? (I'm not seeing that, I'm curious if that would be the case).

> Incidentally, transfer speed benchmarks involving wireless will certainly
> be limited by the wireless link.  I assume that's not a factor here.

That's the usual suspicion.  But these are RF-chamber, short-range lab
setups where the radios are running at full speed in perfect environments...


What this makes me realize is that I should go instrument the cpu stats
with each of the various operating modes:

* no shaping, anywhere
* egress shaping
* egress and ingress shaping at various limited levels:
    * 10Mbps
    * 20Mbps
    * 50Mbps
    * 100Mbps

