From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <woody77@gmail.com>
Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com
	[IPv6:2607:f8b0:4001:c03::230])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 78D7021F2D8;
	Mon,  1 Sep 2014 15:14:43 -0700 (PDT)
Received: by mail-ie0-f176.google.com with SMTP id x19so6687484ier.21
	for <multiple recipients>; Mon, 01 Sep 2014 15:14:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=0dox5FIw0hYETjApo/GEgOYFId/zD/Na9q9pU9fMyMM=;
	b=uuAMuE91MSPTbu8vL1yW987zSEN95e1417ih7Krv96hLltdVPVXLyPu8p30qqkPhad
	t9n7dBUJ5xTxpC7ozkOcZZEGe4tzNHt8wwPYkcfb1EAnV6upLBZPlFJzJ/1q8x3v06Dk
	OPEVNZAolkoun4uvL2HCG+4e5PrWyWxzEomV1RibxFHkSQlSgTq1by6TkIUV2elVQte/
	f/10ouqoeb0Oq4f2yZZLXj4fUDxy3WqhlnK9AUizbEXuB4FWSaltBIExG1SyPjrrAQ+E
	oFlNtxalcvdadAXxxLS8tsmdqcyJw+k3qJkdxg6fkwElBLWf0eVzbyyxULC9HGVJ+awh
	UUyw==
MIME-Version: 1.0
X-Received: by 10.50.126.100 with SMTP id mx4mr24027508igb.1.1409609682861;
	Mon, 01 Sep 2014 15:14:42 -0700 (PDT)
Received: by 10.64.243.196 with HTTP; Mon, 1 Sep 2014 15:14:42 -0700 (PDT)
In-Reply-To: <D473BC2C-B0AD-425D-B27B-5145BCF7FDE3@gmail.com>
References: <CALQXh-Pkqh6Wq9xc9ky0ruztu74ziUoiaiG0LPr+DVLtDqz4mQ@mail.gmail.com>
	<CAA93jw7q6gfS78NW8NoKH5v-azXeYW2oKWp_batvE2VM9fjrYQ@mail.gmail.com>
	<A68CC928-9C56-416C-937C-5F7E7D8DA3AD@gmail.com>
	<87ppfijfjc.fsf@toke.dk>
	<4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com>
	<4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com>
	<CAA93jw4-3cPpDUKvHBp0q_waLq6QAreRUCzu1mYBQ7Xg0rPYGA@mail.gmail.com>
	<0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com>
	<CAA93jw6OUEXsOVQsUs+wWbSAmTQUhEukAXPd2=WrEo+_Fpin-g@mail.gmail.com>
	<DB33CAC6-C233-4CA5-ABD0-86216E440541@gmail.com>
	<CAA93jw4NuUKegPScS2fyqgZZas+gX0Z85n8ZD+_0K0no7MAxSA@mail.gmail.com>
	<CALQXh-Pjno3KWyQEbOEYtx8oagMPr03ojezxN6c27qLqXD12Vw@mail.gmail.com>
	<D473BC2C-B0AD-425D-B27B-5145BCF7FDE3@gmail.com>
Date: Mon, 1 Sep 2014 15:14:42 -0700
Message-ID: <CALQXh-M74PmcS0KVA2H-8ZE_GEM6vQ9dDPEH_-yhFCGekk246A@mail.gmail.com>
From: Aaron Wood <woody77@gmail.com>
To: Jonathan Morton <chromatix99@gmail.com>
Content-Type: multipart/alternative; boundary=047d7b3a96046b46810502085469
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] Comcast upped service levels -> WNDR3800 can't cope...
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 01 Sep 2014 22:14:43 -0000

--047d7b3a96046b46810502085469
Content-Type: text/plain; charset=UTF-8

Luckily, I don't mind being wrong (or even _way_ off the mark).


I don't think that's it.
>
> First a nitpick: the PowerBook version of the late-model G4 (7447A)
> doesn't have the external L3 cache interface, so it only has the 256KB or
> 512KB internal L2 cache (I forget which).  The desktop version (7457A) used
> external cache.  The G4 was considered to be *crippled* by its FSB by the
> end of its run, since it never adopted high-performance signalling
> techniques, nor moved the memory controller on-die; it was quoted that the
> G5 (970) could move data using *single-byte* operations faster than the
> *peak* throughput of the G4's FSB.  The only reason the G5 never made it
> into a PowerBook was because it wasn't battery-friendly in the slightest.
>

And the specs on the G4 that I'd dug up were desktop specs.


> But that makes little difference to your argument - compared to a cheap
> CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware,
> even if it is already a decade old.
>
> More compelling is that even at 16-bit width, the WNDR's RAM should have
> more bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x
> 32-bit, and I can push a steady 30MB/sec in both directions simultaneously,
> which corresponds in total to about half the PCI bus's theoretical
> capacity.  (The GEM reports 66MHz capability, but it shares the bus with an
> IDE controller which doesn't, so I assume it is stuck at 33MHz.)  A 16-bit
> RAM should be able to match PCI if it runs at 66MHz, which is the lower
> limit of JEDEC standards for SDRAM.
>
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which
> implies at least 200MHz unless the integrator was colossally stingy.
> Further, a little digging suggests that the memory bus should be 32-bit
> wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz,
> half the CPU core speed.  For an embedded SoC, that's really not too bad -
> it should be able to sustain 1GB/sec, in one direction at a time.
>

The kernel boot messages report 170MHz DDR operation, for 340MHz data-rates.

But, I don't think it's 32-bit, I think it's running two banks of 64MB
chips in x16 mode.  That's based on my experiences with other, similar
chips.  The AR7161 datasheet here:
https://wikidevi.com/files/Atheros/specsheets/AR7161.pdf notes it's DDR1,
but not the bus width.

But even if it had an 8-bit bus, it sounds like it would have the ability
to move packets pretty well, so that's not the case (2Gbps vs. 8Gbps)



> So that takes care of the argument for simply moving the payload around.
> In any case, the WNDR demonstrably *can* cope with the available bandwidth
> if the shaping is turned off.
>
> For the purposes of shaping, the CPU shouldn't need to touch the majority
> of the payload - only the headers, which are relatively small.  The bulk of
> the payload should DMA from one NIC to RAM, then DMA back out of RAM to the
> other NIC.  It has to do that anyway to route them, and without shaping
> there'd be more of them to handle.  The difference might be in the data
> structures used by the shaper itself, but I think those are also reasonably
> compact.  It doesn't even have to touch userspace, since it's not acting as
> the endpoint as my PowerBook was during my tests.
>

In an ideal case, yes.  But is that how this gets managed?  (I have no
idea, I'm certainly not a kernel developer).

If the packet data is getting moved about from buffer to buffer (for
instance to do the htb calculations?) could that substantially change the
processing load?



> And while the MIPS 24K core is old, it's also been die-shrunk over the
> intervening years, so it runs a lot faster than it originally did.  I very
> much doubt that it's as refined as my G4, but it could probably hold its
> own relative to a comparable ARM SoC such as the Raspberry Pi.
> (Unfortunately, the latter doesn't have the I/O capacity to do high-speed
> networking - USB only.)  Atheros publicity materials indicate that they
> increased the I-cache to 64KB for performance reasons, but saw no need to
> increase the D-cache at the same time.
>

But, they also have a core that's designed to do little to no processing of
the data, just DMA from one side to the other, while validating the
firewall rules...  So it may be sufficient d-cache for that, without having
the capacity to do anything else.


> Which brings me back to the timers, and other items of black magic.
>

Which would point to under-utilizing the processor core, while still having
high load? (I'm not seeing that, I'm curious if that would be the case).


>
> Incidentally, transfer speed benchmarks involving wireless will certainly
> be limited by the wireless link.  I assume that's not a factor here.
>

That's the usual suspicion.  But these are RF-chamber, short-range lab
setups where the radios are running at full speed in perfect environments...

======

What this makes me realize is that I should go instrument the cpu stats
with each of the various operating modes:

* no shaping, anywhere
* egress shaping
* egress and ingress shaping at various limited levels:
    * 10Mbps
    * 20Mbps
    * 50Mbps
    * 100Mbps

-Aaron

--047d7b3a96046b46810502085469
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Luckily, I don&#39;t mind being wrong (or even _way_ off t=
he mark).<div><br></div><div><br><div class=3D"gmail_extra"><div class=3D"g=
mail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-s=
tyle:solid;padding-left:1ex">
I don&#39;t think that&#39;s it.<br>
<br>
First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn&#=
39;t have the external L3 cache interface, so it only has the 256KB or 512K=
B internal L2 cache (I forget which).=C2=A0 The desktop version (7457A) use=
d external cache.=C2=A0 The G4 was considered to be *crippled* by its FSB b=
y the end of its run, since it never adopted high-performance signalling te=
chniques, nor moved the memory controller on-die; it was quoted that the G5=
 (970) could move data using *single-byte* operations faster than the *peak=
* throughput of the G4&#39;s FSB.=C2=A0 The only reason the G5 never made i=
t into a PowerBook was because it wasn&#39;t battery-friendly in the slight=
est.<br>
</blockquote><div><br></div><div>And the specs on the G4 that I&#39;d dug u=
p were desktop specs.</div><div>=C2=A0</div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-colo=
r:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

But that makes little difference to your argument - compared to a cheap CPE=
-class embedded SoC, the PowerBook is eminently desktop-class hardware, eve=
n if it is already a decade old.<br>
<br>
More compelling is that even at 16-bit width, the WNDR&#39;s RAM should hav=
e more bandwidth than my PowerBook&#39;s PCI bus.=C2=A0 Standard PCI is 33M=
Hz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneo=
usly, which corresponds in total to about half the PCI bus&#39;s theoretica=
l capacity.=C2=A0 (The GEM reports 66MHz capability, but it shares the bus =
with an IDE controller which doesn&#39;t, so I assume it is stuck at 33MHz.=
)=C2=A0 A 16-bit RAM should be able to match PCI if it runs at 66MHz, which=
 is the lower limit of JEDEC standards for SDRAM.<br>

<br>
The AR7161 datasheet says it has a DDR-capable SDRAM interface, which impli=
es at least 200MHz unless the integrator was colossally stingy.=C2=A0 Furth=
er, a little digging suggests that the memory bus should be 32-bit wide (he=
nce two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CP=
U core speed.=C2=A0 For an embedded SoC, that&#39;s really not too bad - it=
 should be able to sustain 1GB/sec, in one direction at a time.<br>
</blockquote><div><br></div><div>The kernel boot messages report 170MHz DDR=
 operation, for 340MHz data-rates.</div><div><br></div><div>But, I don&#39;=
t think it&#39;s 32-bit, I think it&#39;s running two banks of 64MB chips i=
n x16 mode. =C2=A0That&#39;s based on my experiences with other, similar ch=
ips. =C2=A0The AR7161 datasheet here: <a href=3D"https://wikidevi.com/files=
/Atheros/specsheets/AR7161.pdf">https://wikidevi.com/files/Atheros/specshee=
ts/AR7161.pdf</a> notes it&#39;s DDR1, but not the bus width.</div>
<div><br></div><div>But even if it had an 8-bit bus, it sounds like it woul=
d have the ability to move packets pretty well, so that&#39;s not the case =
(2Gbps vs. 8Gbps)</div><div><br></div><div>=C2=A0</div><blockquote class=3D=
"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;borde=
r-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
So that takes care of the argument for simply moving the payload around.=C2=
=A0 In any case, the WNDR demonstrably *can* cope with the available bandwi=
dth if the shaping is turned off.<br>
<br>
For the purposes of shaping, the CPU shouldn&#39;t need to touch the majori=
ty of the payload - only the headers, which are relatively small.=C2=A0 The=
 bulk of the payload should DMA from one NIC to RAM, then DMA back out of R=
AM to the other NIC.=C2=A0 It has to do that anyway to route them, and with=
out shaping there&#39;d be more of them to handle.=C2=A0 The difference mig=
ht be in the data structures used by the shaper itself, but I think those a=
re also reasonably compact.=C2=A0 It doesn&#39;t even have to touch userspa=
ce, since it&#39;s not acting as the endpoint as my PowerBook was during my=
 tests.<br>
</blockquote><div><br></div><div>In an ideal case, yes. =C2=A0But is that h=
ow this gets managed? =C2=A0(I have no idea, I&#39;m certainly not a kernel=
 developer).</div><div><br></div><div>If the packet data is getting moved a=
bout from buffer to buffer (for instance to do the htb calculations?) could=
 that substantially change the processing load?</div>
<div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,20=
4,204);border-left-style:solid;padding-left:1ex">
And while the MIPS 24K core is old, it&#39;s also been die-shrunk over the =
intervening years, so it runs a lot faster than it originally did.=C2=A0 I =
very much doubt that it&#39;s as refined as my G4, but it could probably ho=
ld its own relative to a comparable ARM SoC such as the Raspberry Pi.=C2=A0=
 (Unfortunately, the latter doesn&#39;t have the I/O capacity to do high-sp=
eed networking - USB only.)=C2=A0 Atheros publicity materials indicate that=
 they increased the I-cache to 64KB for performance reasons, but saw no nee=
d to increase the D-cache at the same time.<br>
</blockquote><div><br></div><div>But, they also have a core that&#39;s desi=
gned to do little to no processing of the data, just DMA from one side to t=
he other, while validating the firewall rules... =C2=A0So it may be suffici=
ent d-cache for that, without having the capacity to do anything else.</div=
>
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex">Which brings me back to the timers, and o=
ther items of black magic.<br>
</blockquote><div><br></div><div>Which would point to under-utilizing the p=
rocessor core, while still having high load? (I&#39;m not seeing that, I=
9;m curious if that would be the case).</div><div>=C2=A0</div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1p=
x;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1=
ex">

<br>
Incidentally, transfer speed benchmarks involving wireless will certainly b=
e limited by the wireless link.=C2=A0 I assume that&#39;s not a factor here=
.<br></blockquote><div><br></div><div>That&#39;s the usual suspicion. =C2=
=A0But these are RF-chamber, short-range lab setups where the radios are ru=
nning at full speed in perfect environments...</div>
<div><br></div><div>=3D=3D=3D=3D=3D=3D</div><div><br></div><div>What this m=
akes me realize is that I should go instrument the cpu stats with each of t=
he various operating modes:</div><div><br></div><div>* no shaping, anywhere=
</div><div>
* egress shaping</div><div>* egress and ingress shaping at various limited =
levels:</div><div>=C2=A0 =C2=A0 * 10Mbps</div><div>=C2=A0 =C2=A0 * 20Mbps</=
div><div>=C2=A0 =C2=A0 * 50Mbps</div><div>=C2=A0 =C2=A0 * 100Mbps</div><div=
><br></div><div>-Aaron</div></div></div>
</div></div>

--047d7b3a96046b46810502085469--