[Cerowrt-devel] [Bloat] Comcast upped service levels -> WNDR3800 can't cope...

Tue Sep 2 04:55:19 EDT 2014

Hi Jonathan, hi List,

On Sep 1, 2014, at 23:43 , Jonathan Morton <chromatix99 at gmail.com> wrote:

> 
> On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:
> 
>>>> But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without.  The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around.  Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%.  I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus.
>>>> 
>>>> It's possible, however, that we're not really looking at a CPU limitation, but a timer problem.  The PowerBook is a "proper" desktop computer with hardware to match (modulo its age).  If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer?
> 
>>> Both good questions worth further exploration.
> 
>> Doing some napkin math and some spec reading, I think that the memory bus is a likely factory.  The G4 had a fairly impressive memory bus for the day (64-bit?).  The WNDR3800 appears to be used in an x16 configuration (based on the numbers on the memory parts).  It may have *just* enough bw to push concurrent 3x3 802.11n through the software bridge interface, which short-circuits a lot of processing (IIRC).   
>> 
>> The typical way I've seen a home router being benchmarked for the "marketing numbers" is to flow tcp data to/from a wifi client to a wired client.  Single socket is used, for a uni-directional stream of data.  So long as they can hit peak rates (peak MCS), it will get marked as good for "up to 900Mbps!!" or whatever they want to say.
>> 
>> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, but there simply isn't room in the AR7161 for that, which puts further pressure on the bus.
> 
> I don't think that's it.
> 
> First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which).  The desktop version (7457A) used external cache.  The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB.  The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest.
> 
> But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old.
> 
> More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity.  (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.)  A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.
> 
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy.  Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed.  For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time.
> 
> So that takes care of the argument for simply moving the payload around.  In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off.

	That makes me wonder, couldn’t some reasonable batching help here? We know we can shape roughly 50Mbps combined with no batching (which corresponds to batching at packet size I guess), so get out the envelop, turn it around and go:
50*1000*1000 / (1500*8) = 4166.66666667 that would be around 4000 packet per second (well with small packets might be more). So what about batching up enough packets to take at least 250µs to transfer and push the whole bunch to the tx queue, and then sleep for 250µs? (Since I have not looked at the code and something like this might already be happening, I guess the idea is figure out the highest sustainable shaping frequency, and just make sure we do not attempt to make the shaping decicisios more often. Sure batching will introduce some latency, but if my silly numbers are roughly right, I would be quite happy to accept 1/4 ms added latency if the wndr3[7|8]00 would still last a bit in the future ;) )) Heck, what about running HTB simply from a 1ms timer instead of from a data driven timer?
	Now, since when HTB appeared on the scene it was quite costly for the machines of the time, I wonder whether there are any computation-demand mitigation mechanisms somewhere in the code base… 

Best Regards
	Sebastian

> 
> For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small.  The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC.  It has to do that anyway to route them, and without shaping there'd be more of them to handle.  The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact.  It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests.
> 
> And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did.  I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi.  (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.)  Atheros publicity materials indicate that they increased the I-cache to 64KB for performance reasons, but saw no need to increase the D-cache at the same time.
> 
> Which brings me back to the timers, and other items of black magic.
> 
> Incidentally, transfer speed benchmarks involving wireless will certainly be limited by the wireless link.  I assume that's not a factor here.
> 
> - Jonathan Morton
> 
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel