I don't think that's it.
First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which). The desktop version (7457A) used external cache. The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB. The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest.
But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old.
More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity. (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.
The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy. Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed. For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time.
So that takes care of the argument for simply moving the payload around. In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off.
For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did. I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi. (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.) Atheros publicity materials indicate that they increased the I-cache to 64KB for performance reasons, but saw no need to increase the D-cache at the same time.
Which brings me back to the timers, and other items of black magic.
Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link. I assume that's not a factor here.