From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com [IPv6:2607:f8b0:4001:c03::230]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 78D7021F2D8; Mon, 1 Sep 2014 15:14:43 -0700 (PDT) Received: by mail-ie0-f176.google.com with SMTP id x19so6687484ier.21 for ; Mon, 01 Sep 2014 15:14:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=0dox5FIw0hYETjApo/GEgOYFId/zD/Na9q9pU9fMyMM=; b=uuAMuE91MSPTbu8vL1yW987zSEN95e1417ih7Krv96hLltdVPVXLyPu8p30qqkPhad t9n7dBUJ5xTxpC7ozkOcZZEGe4tzNHt8wwPYkcfb1EAnV6upLBZPlFJzJ/1q8x3v06Dk OPEVNZAolkoun4uvL2HCG+4e5PrWyWxzEomV1RibxFHkSQlSgTq1by6TkIUV2elVQte/ f/10ouqoeb0Oq4f2yZZLXj4fUDxy3WqhlnK9AUizbEXuB4FWSaltBIExG1SyPjrrAQ+E oFlNtxalcvdadAXxxLS8tsmdqcyJw+k3qJkdxg6fkwElBLWf0eVzbyyxULC9HGVJ+awh UUyw== MIME-Version: 1.0 X-Received: by 10.50.126.100 with SMTP id mx4mr24027508igb.1.1409609682861; Mon, 01 Sep 2014 15:14:42 -0700 (PDT) Received: by 10.64.243.196 with HTTP; Mon, 1 Sep 2014 15:14:42 -0700 (PDT) In-Reply-To: References: <87ppfijfjc.fsf@toke.dk> <4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com> <4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com> <0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com> Date: Mon, 1 Sep 2014 15:14:42 -0700 Message-ID: From: Aaron Wood To: Jonathan Morton Content-Type: multipart/alternative; boundary=047d7b3a96046b46810502085469 Cc: "cerowrt-devel@lists.bufferbloat.net" , bloat Subject: Re: [Bloat] Comcast upped service levels -> WNDR3800 can't cope... X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Sep 2014 22:14:43 -0000 --047d7b3a96046b46810502085469 Content-Type: text/plain; charset=UTF-8 Luckily, I don't mind being wrong (or even _way_ off the mark). I don't think that's it. > > First a nitpick: the PowerBook version of the late-model G4 (7447A) > doesn't have the external L3 cache interface, so it only has the 256KB or > 512KB internal L2 cache (I forget which). The desktop version (7457A) used > external cache. The G4 was considered to be *crippled* by its FSB by the > end of its run, since it never adopted high-performance signalling > techniques, nor moved the memory controller on-die; it was quoted that the > G5 (970) could move data using *single-byte* operations faster than the > *peak* throughput of the G4's FSB. The only reason the G5 never made it > into a PowerBook was because it wasn't battery-friendly in the slightest. > And the specs on the G4 that I'd dug up were desktop specs. > But that makes little difference to your argument - compared to a cheap > CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, > even if it is already a decade old. > > More compelling is that even at 16-bit width, the WNDR's RAM should have > more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x > 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, > which corresponds in total to about half the PCI bus's theoretical > capacity. (The GEM reports 66MHz capability, but it shares the bus with an > IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit > RAM should be able to match PCI if it runs at 66MHz, which is the lower > limit of JEDEC standards for SDRAM. > > The AR7161 datasheet says it has a DDR-capable SDRAM interface, which > implies at least 200MHz unless the integrator was colossally stingy. > Further, a little digging suggests that the memory bus should be 32-bit > wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, > half the CPU core speed. For an embedded SoC, that's really not too bad - > it should be able to sustain 1GB/sec, in one direction at a time. > The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates. But, I don't think it's 32-bit, I think it's running two banks of 64MB chips in x16 mode. That's based on my experiences with other, similar chips. The AR7161 datasheet here: https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1, but not the bus width. But even if it had an 8-bit bus, it sounds like it would have the ability to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps) > So that takes care of the argument for simply moving the payload around. > In any case, the WNDR demonstrably *can* cope with the available bandwidth > if the shaping is turned off. > > For the purposes of shaping, the CPU shouldn't need to touch the majority > of the payload - only the headers, which are relatively small. The bulk of > the payload should DMA from one NIC to RAM, then DMA back out of RAM to the > other NIC. It has to do that anyway to route them, and without shaping > there'd be more of them to handle. The difference might be in the data > structures used by the shaper itself, but I think those are also reasonably > compact. It doesn't even have to touch userspace, since it's not acting as > the endpoint as my PowerBook was during my tests. > In an ideal case, yes. But is that how this gets managed? (I have no idea, I'm certainly not a kernel developer). If the packet data is getting moved about from buffer to buffer (for instance to do the htb calculations?) could that substantially change the processing load? > And while the MIPS 24K core is old, it's also been die-shrunk over the > intervening years, so it runs a lot faster than it originally did. I very > much doubt that it's as refined as my G4, but it could probably hold its > own relative to a comparable ARM SoC such as the Raspberry Pi. > (Unfortunately, the latter doesn't have the I/O capacity to do high-speed > networking - USB only.) Atheros publicity materials indicate that they > increased the I-cache to 64KB for performance reasons, but saw no need to > increase the D-cache at the same time. > But, they also have a core that's designed to do little to no processing of the data, just DMA from one side to the other, while validating the firewall rules... So it may be sufficient d-cache for that, without having the capacity to do anything else. > Which brings me back to the timers, and other items of black magic. > Which would point to under-utilizing the processor core, while still having high load? (I'm not seeing that, I'm curious if that would be the case). > > Incidentally, transfer speed benchmarks involving wireless will certainly > be limited by the wireless link. I assume that's not a factor here. > That's the usual suspicion. But these are RF-chamber, short-range lab setups where the radios are running at full speed in perfect environments... ====== What this makes me realize is that I should go instrument the cpu stats with each of the various operating modes: * no shaping, anywhere * egress shaping * egress and ingress shaping at various limited levels: * 10Mbps * 20Mbps * 50Mbps * 100Mbps -Aaron --047d7b3a96046b46810502085469 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Luckily, I don't mind being wrong (or even _way_ off t= he mark).


I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn&#= 39;t have the external L3 cache interface, so it only has the 256KB or 512K= B internal L2 cache (I forget which).=C2=A0 The desktop version (7457A) use= d external cache.=C2=A0 The G4 was considered to be *crippled* by its FSB b= y the end of its run, since it never adopted high-performance signalling te= chniques, nor moved the memory controller on-die; it was quoted that the G5= (970) could move data using *single-byte* operations faster than the *peak= * throughput of the G4's FSB.=C2=A0 The only reason the G5 never made i= t into a PowerBook was because it wasn't battery-friendly in the slight= est.

And the specs on the G4 that I'd dug u= p were desktop specs.
=C2=A0
But that makes little difference to your argument - compared to a cheap CPE= -class embedded SoC, the PowerBook is eminently desktop-class hardware, eve= n if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should hav= e more bandwidth than my PowerBook's PCI bus.=C2=A0 Standard PCI is 33M= Hz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneo= usly, which corresponds in total to about half the PCI bus's theoretica= l capacity.=C2=A0 (The GEM reports 66MHz capability, but it shares the bus = with an IDE controller which doesn't, so I assume it is stuck at 33MHz.= )=C2=A0 A 16-bit RAM should be able to match PCI if it runs at 66MHz, which= is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which impli= es at least 200MHz unless the integrator was colossally stingy.=C2=A0 Furth= er, a little digging suggests that the memory bus should be 32-bit wide (he= nce two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CP= U core speed.=C2=A0 For an embedded SoC, that's really not too bad - it= should be able to sustain 1GB/sec, in one direction at a time.

The kernel boot messages report 170MHz DDR= operation, for 340MHz data-rates.

But, I don'= t think it's 32-bit, I think it's running two banks of 64MB chips i= n x16 mode. =C2=A0That's based on my experiences with other, similar ch= ips. =C2=A0The AR7161 datasheet here: https://wikidevi.com/files/Atheros/specshee= ts/AR7161.pdf notes it's DDR1, but not the bus width.

But even if it had an 8-bit bus, it sounds like it woul= d have the ability to move packets pretty well, so that's not the case = (2Gbps vs. 8Gbps)

=C2=A0
So that takes care of the argument for simply moving the payload around.=C2= =A0 In any case, the WNDR demonstrably *can* cope with the available bandwi= dth if the shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the majori= ty of the payload - only the headers, which are relatively small.=C2=A0 The= bulk of the payload should DMA from one NIC to RAM, then DMA back out of R= AM to the other NIC.=C2=A0 It has to do that anyway to route them, and with= out shaping there'd be more of them to handle.=C2=A0 The difference mig= ht be in the data structures used by the shaper itself, but I think those a= re also reasonably compact.=C2=A0 It doesn't even have to touch userspa= ce, since it's not acting as the endpoint as my PowerBook was during my= tests.

In an ideal case, yes. =C2=A0But is that h= ow this gets managed? =C2=A0(I have no idea, I'm certainly not a kernel= developer).

If the packet data is getting moved a= bout from buffer to buffer (for instance to do the htb calculations?) could= that substantially change the processing load?

=C2=A0
And while the MIPS 24K core is old, it's also been die-shrunk over the = intervening years, so it runs a lot faster than it originally did.=C2=A0 I = very much doubt that it's as refined as my G4, but it could probably ho= ld its own relative to a comparable ARM SoC such as the Raspberry Pi.=C2=A0= (Unfortunately, the latter doesn't have the I/O capacity to do high-sp= eed networking - USB only.)=C2=A0 Atheros publicity materials indicate that= they increased the I-cache to 64KB for performance reasons, but saw no nee= d to increase the D-cache at the same time.

But, they also have a core that's desi= gned to do little to no processing of the data, just DMA from one side to t= he other, while validating the firewall rules... =C2=A0So it may be suffici= ent d-cache for that, without having the capacity to do anything else.
=C2=A0
Which brings me back to the timers, and o= ther items of black magic.

Which would point to under-utilizing the p= rocessor core, while still having high load? (I'm not seeing that, I= 9;m curious if that would be the case).
=C2=A0

Incidentally, transfer speed benchmarks involving wireless will certainly b= e limited by the wireless link.=C2=A0 I assume that's not a factor here= .

That's the usual suspicion. =C2= =A0But these are RF-chamber, short-range lab setups where the radios are ru= nning at full speed in perfect environments...

=3D=3D=3D=3D=3D=3D

What this m= akes me realize is that I should go instrument the cpu stats with each of t= he various operating modes:

* no shaping, anywhere=
* egress shaping
* egress and ingress shaping at various limited = levels:
=C2=A0 =C2=A0 * 10Mbps
=C2=A0 =C2=A0 * 20Mbps
=C2=A0 =C2=A0 * 50Mbps
=C2=A0 =C2=A0 * 100Mbps

-Aaron
--047d7b3a96046b46810502085469--