From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <moeller0@gmx.de>
Received: from mout.gmx.net (mout.gmx.net [212.227.17.20])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mout.gmx.net",
	Issuer "TeleSec ServerPass DE-1" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id B928421F2BE
	for <cerowrt-devel@lists.bufferbloat.net>;
	Tue,  2 Sep 2014 01:55:25 -0700 (PDT)
Received: from hms-beagle.lan ([134.2.89.70]) by mail.gmx.com (mrgmx101) with
	ESMTPSA (Nemesis) id 0MK4fR-1XNMIH1YAP-001SYu;
	Tue, 02 Sep 2014 10:55:20 +0200
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
From: Sebastian Moeller <moeller0@gmx.de>
In-Reply-To: <D473BC2C-B0AD-425D-B27B-5145BCF7FDE3@gmail.com>
Date: Tue, 2 Sep 2014 10:55:19 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <88DDD48D-8854-4E6E-BE02-F761C13C9201@gmx.de>
References: <CALQXh-Pkqh6Wq9xc9ky0ruztu74ziUoiaiG0LPr+DVLtDqz4mQ@mail.gmail.com>
	<CAA93jw7q6gfS78NW8NoKH5v-azXeYW2oKWp_batvE2VM9fjrYQ@mail.gmail.com>
	<A68CC928-9C56-416C-937C-5F7E7D8DA3AD@gmail.com>
	<87ppfijfjc.fsf@toke.dk>
	<4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com>
	<4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com>
	<CAA93jw4-3cPpDUKvHBp0q_waLq6QAreRUCzu1mYBQ7Xg0rPYGA@mail.gmail.com>
	<0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com>
	<CAA93jw6OUEXsOVQsUs+wWbSAmTQUhEukAXPd2=WrEo+_Fpin-g@mail.gmail.com>
	<DB33CAC6-C233-4CA5-ABD0-86216E440541@gmail.com>
	<CAA93jw4NuUKegPScS2fyqgZZas+gX0Z85n8ZD+_0K0no7MAxSA@mail.gmail.com>
	<CALQXh-Pjno3KWyQEbOEYtx8oagMPr03ojezxN6c27qLqXD12Vw@mail.gmail.com>
	<D473BC2C-B0AD-425D-B27B-5145BCF7FDE3@gmail.com>
To: Jonathan Morton <chromatix99@gmail.com>
X-Mailer: Apple Mail (2.1878.6)
X-Provags-ID: V03:K0:il+3PcmcETZ87iWg+Pb3VE7ivxwXskSxl71Os6fa8mKos43zq3l
	hujeUekNO3+Y8aCQhlzGmTg3BhbwOtceRIbWiTyYM3mI1nqp5NMStSSpPeAtU6Is8Dloax5
	jSB9CuLNsBfJMtcmTZZWfpo+aJmyizFK7wPy6SMLTfQyx900Sr/W7EjY38XML2U6AXOkhlP
	EGpSzqZqSYCBX4px+I89Q==
X-UI-Out-Filterresults: notjunk:1;
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] [Bloat] Comcast upped service levels ->
	WNDR3800 can't cope...
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Tue, 02 Sep 2014 08:55:26 -0000

Hi Jonathan, hi List,


On Sep 1, 2014, at 23:43 , Jonathan Morton <chromatix99@gmail.com> =
wrote:

>=20
> On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:
>=20
>>>> But this doesn't really answer the question of why the WNDR has so =
much lower a ceiling with shaping than without.  The G4 is powerful =
enough that the overhead of shaping simply disappears next to the =
overhead of shoving data around.  Even when I turn up the shaping knob =
to a value quite close to the hardware's unshaped capabilities (eg. =
400Mbps one-way), most of the shapers stick to the requested limit like =
glue, and even the worst offender is within 10%.  I estimate that it's =
using only about 500 clocks per packet *unless* it saturates the PCI =
bus.
>>>>=20
>>>> It's possible, however, that we're not really looking at a CPU =
limitation, but a timer problem.  The PowerBook is a "proper" desktop =
computer with hardware to match (modulo its age).  If all the shapers =
now depend on the high-resolution timer, how high-resolution is the =
WNDR's timer?
>=20
>>> Both good questions worth further exploration.
>=20
>> Doing some napkin math and some spec reading, I think that the memory =
bus is a likely factory.  The G4 had a fairly impressive memory bus for =
the day (64-bit?).  The WNDR3800 appears to be used in an x16 =
configuration (based on the numbers on the memory parts).  It may have =
*just* enough bw to push concurrent 3x3 802.11n through the software =
bridge interface, which short-circuits a lot of processing (IIRC).  =20
>>=20
>> The typical way I've seen a home router being benchmarked for the =
"marketing numbers" is to flow tcp data to/from a wifi client to a wired =
client.  Single socket is used, for a uni-directional stream of data.  =
So long as they can hit peak rates (peak MCS), it will get marked as =
good for "up to 900Mbps!!" or whatever they want to say.
>>=20
>> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. =
2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, =
but there simply isn't room in the AR7161 for that, which puts further =
pressure on the bus.
>=20
> I don't think that's it.
>=20
> First a nitpick: the PowerBook version of the late-model G4 (7447A) =
doesn't have the external L3 cache interface, so it only has the 256KB =
or 512KB internal L2 cache (I forget which).  The desktop version =
(7457A) used external cache.  The G4 was considered to be *crippled* by =
its FSB by the end of its run, since it never adopted high-performance =
signalling techniques, nor moved the memory controller on-die; it was =
quoted that the G5 (970) could move data using *single-byte* operations =
faster than the *peak* throughput of the G4's FSB.  The only reason the =
G5 never made it into a PowerBook was because it wasn't battery-friendly =
in the slightest.
>=20
> But that makes little difference to your argument - compared to a =
cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class =
hardware, even if it is already a decade old.
>=20
> More compelling is that even at 16-bit width, the WNDR's RAM should =
have more bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz =
x 32-bit, and I can push a steady 30MB/sec in both directions =
simultaneously, which corresponds in total to about half the PCI bus's =
theoretical capacity.  (The GEM reports 66MHz capability, but it shares =
the bus with an IDE controller which doesn't, so I assume it is stuck at =
33MHz.)  A 16-bit RAM should be able to match PCI if it runs at 66MHz, =
which is the lower limit of JEDEC standards for SDRAM.
>=20
> The AR7161 datasheet says it has a DDR-capable SDRAM interface, which =
implies at least 200MHz unless the integrator was colossally stingy.  =
Further, a little digging suggests that the memory bus should be 32-bit =
wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, =
half the CPU core speed.  For an embedded SoC, that's really not too bad =
- it should be able to sustain 1GB/sec, in one direction at a time.
>=20
> So that takes care of the argument for simply moving the payload =
around.  In any case, the WNDR demonstrably *can* cope with the =
available bandwidth if the shaping is turned off.

	That makes me wonder, couldn=92t some reasonable batching help =
here? We know we can shape roughly 50Mbps combined with no batching =
(which corresponds to batching at packet size I guess), so get out the =
envelop, turn it around and go:
50*1000*1000 / (1500*8) =3D 4166.66666667 that would be around 4000 =
packet per second (well with small packets might be more). So what about =
batching up enough packets to take at least 250=B5s to transfer and push =
the whole bunch to the tx queue, and then sleep for 250=B5s? (Since I =
have not looked at the code and something like this might already be =
happening, I guess the idea is figure out the highest sustainable =
shaping frequency, and just make sure we do not attempt to make the =
shaping decicisios more often. Sure batching will introduce some =
latency, but if my silly numbers are roughly right, I would be quite =
happy to accept 1/4 ms added latency if the wndr3[7|8]00 would still =
last a bit in the future ;) )) Heck, what about running HTB simply from =
a 1ms timer instead of from a data driven timer?
	Now, since when HTB appeared on the scene it was quite costly =
for the machines of the time, I wonder whether there are any =
computation-demand mitigation mechanisms somewhere in the code base=85=20=


Best Regards
	Sebastian


>=20
> For the purposes of shaping, the CPU shouldn't need to touch the =
majority of the payload - only the headers, which are relatively small.  =
The bulk of the payload should DMA from one NIC to RAM, then DMA back =
out of RAM to the other NIC.  It has to do that anyway to route them, =
and without shaping there'd be more of them to handle.  The difference =
might be in the data structures used by the shaper itself, but I think =
those are also reasonably compact.  It doesn't even have to touch =
userspace, since it's not acting as the endpoint as my PowerBook was =
during my tests.
>=20
> And while the MIPS 24K core is old, it's also been die-shrunk over the =
intervening years, so it runs a lot faster than it originally did.  I =
very much doubt that it's as refined as my G4, but it could probably =
hold its own relative to a comparable ARM SoC such as the Raspberry Pi.  =
(Unfortunately, the latter doesn't have the I/O capacity to do =
high-speed networking - USB only.)  Atheros publicity materials indicate =
that they increased the I-cache to 64KB for performance reasons, but saw =
no need to increase the D-cache at the same time.
>=20
> Which brings me back to the timers, and other items of black magic.
>=20
> Incidentally, transfer speed benchmarks involving wireless will =
certainly be limited by the wireless link.  I assume that's not a factor =
here.
>=20
> - Jonathan Morton
>=20
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel