Luckily, I don't mind being wrong (or even _way_ off the mark). I don't think that's it. > > First a nitpick: the PowerBook version of the late-model G4 (7447A) > doesn't have the external L3 cache interface, so it only has the 256KB or > 512KB internal L2 cache (I forget which). The desktop version (7457A) used > external cache. The G4 was considered to be *crippled* by its FSB by the > end of its run, since it never adopted high-performance signalling > techniques, nor moved the memory controller on-die; it was quoted that the > G5 (970) could move data using *single-byte* operations faster than the > *peak* throughput of the G4's FSB. The only reason the G5 never made it > into a PowerBook was because it wasn't battery-friendly in the slightest. > And the specs on the G4 that I'd dug up were desktop specs. > But that makes little difference to your argument - compared to a cheap > CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, > even if it is already a decade old. > > More compelling is that even at 16-bit width, the WNDR's RAM should have > more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x > 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, > which corresponds in total to about half the PCI bus's theoretical > capacity. (The GEM reports 66MHz capability, but it shares the bus with an > IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit > RAM should be able to match PCI if it runs at 66MHz, which is the lower > limit of JEDEC standards for SDRAM. > > The AR7161 datasheet says it has a DDR-capable SDRAM interface, which > implies at least 200MHz unless the integrator was colossally stingy. > Further, a little digging suggests that the memory bus should be 32-bit > wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, > half the CPU core speed. For an embedded SoC, that's really not too bad - > it should be able to sustain 1GB/sec, in one direction at a time. > The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates. But, I don't think it's 32-bit, I think it's running two banks of 64MB chips in x16 mode. That's based on my experiences with other, similar chips. The AR7161 datasheet here: https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1, but not the bus width. But even if it had an 8-bit bus, it sounds like it would have the ability to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps) > So that takes care of the argument for simply moving the payload around. > In any case, the WNDR demonstrably *can* cope with the available bandwidth > if the shaping is turned off. > > For the purposes of shaping, the CPU shouldn't need to touch the majority > of the payload - only the headers, which are relatively small. The bulk of > the payload should DMA from one NIC to RAM, then DMA back out of RAM to the > other NIC. It has to do that anyway to route them, and without shaping > there'd be more of them to handle. The difference might be in the data > structures used by the shaper itself, but I think those are also reasonably > compact. It doesn't even have to touch userspace, since it's not acting as > the endpoint as my PowerBook was during my tests. > In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer). If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load? > And while the MIPS 24K core is old, it's also been die-shrunk over the > intervening years, so it runs a lot faster than it originally did. I very > much doubt that it's as refined as my G4, but it could probably hold its > own relative to a comparable ARM SoC such as the Raspberry Pi. > (Unfortunately, the latter doesn't have the I/O capacity to do high-speed > networking - USB only.) Atheros publicity materials indicate that they > increased the I-cache to 64KB for performance reasons, but saw no need to > increase the D-cache at the same time. > But, they also have a core that's designed to do little to no processing of the data, just DMA from one side to the other, while validating the firewall rules... So it may be sufficient d-cache for that, without having the capacity to do anything else. > Which brings me back to the timers, and other items of black magic. > Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case). > > Incidentally, transfer speed benchmarks involving wireless will certainly > be limited by the wireless link. I assume that's not a factor here. > That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments... ====== What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes: * no shaping, anywhere * egress shaping * egress and ingress shaping at various limited levels: * 10Mbps * 20Mbps * 50Mbps * 100Mbps -Aaron