From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <chromatix99@gmail.com>
Received: from mail-lb0-x232.google.com (mail-lb0-x232.google.com
	[IPv6:2a00:1450:4010:c04::232])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id E8CDF21F2DE;
	Mon,  1 Sep 2014 14:43:33 -0700 (PDT)
Received: by mail-lb0-f178.google.com with SMTP id v6so6496496lbi.37
	for <multiple recipients>; Mon, 01 Sep 2014 14:43:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=subject:mime-version:content-type:from:in-reply-to:date:cc
	:content-transfer-encoding:message-id:references:to;
	bh=eODwwmQkgWoP0bx3PXz0eGJC+JUcbYfmHi4WVhemKdo=;
	b=E1CtsJhbHhxEImkjRwoP7MY797yD45AFUD+lhsKXPssu5A40RlN8tuTwphDMXF9Zws
	52OfnS1tbfuOor7fwj338AraKK//TEnZ5t81Fggh5yqecKAh0ocAh/4wm37bokO+Aqn5
	qUeS2bbZHZgstv8njFiYBpgSJFnbSnyiyspun+EG3IGnIu0davCh6RACee27pwZEdpsQ
	U3RLjOIyzS0ixBo1OuMfYcrno0DcGdshrjt6UwLkRltkMt0wKyYoTxDYEah4sdsicr5l
	eJPOfzKuO7FyRrWTaDkXD3fFXN71gWOOfxrMqfD+nuE4o9FJMRNU/K48GMCeNrEu3nXS
	IHTA==
X-Received: by 10.152.43.14 with SMTP id s14mr12958827lal.28.1409607811429;
	Mon, 01 Sep 2014 14:43:31 -0700 (PDT)
Received: from bass.home.chromatix.fi (87-93-123-167.bb.dnainternet.fi.
	[87.93.123.167])
	by mx.google.com with ESMTPSA id kq4sm1275702lac.13.2014.09.01.14.43.29
	for <multiple recipients>
	(version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
	Mon, 01 Sep 2014 14:43:30 -0700 (PDT)
Mime-Version: 1.0 (Apple Message framework v1085)
Content-Type: text/plain; charset=us-ascii
From: Jonathan Morton <chromatix99@gmail.com>
In-Reply-To: <CALQXh-Pjno3KWyQEbOEYtx8oagMPr03ojezxN6c27qLqXD12Vw@mail.gmail.com>
Date: Tue, 2 Sep 2014 00:43:28 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <D473BC2C-B0AD-425D-B27B-5145BCF7FDE3@gmail.com>
References: <CALQXh-Pkqh6Wq9xc9ky0ruztu74ziUoiaiG0LPr+DVLtDqz4mQ@mail.gmail.com>
	<CAA93jw7q6gfS78NW8NoKH5v-azXeYW2oKWp_batvE2VM9fjrYQ@mail.gmail.com>
	<A68CC928-9C56-416C-937C-5F7E7D8DA3AD@gmail.com>
	<87ppfijfjc.fsf@toke.dk>
	<4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com>
	<4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com>
	<CAA93jw4-3cPpDUKvHBp0q_waLq6QAreRUCzu1mYBQ7Xg0rPYGA@mail.gmail.com>
	<0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com>
	<CAA93jw6OUEXsOVQsUs+wWbSAmTQUhEukAXPd2=WrEo+_Fpin-g@mail.gmail.com>
	<DB33CAC6-C233-4CA5-ABD0-86216E440541@gmail.com>
	<CAA93jw4NuUKegPScS2fyqgZZas+gX0Z85n8ZD+_0K0no7MAxSA@mail.gmail.com>
	<CALQXh-Pjno3KWyQEbOEYtx8oagMPr03ojezxN6c27qLqXD12Vw@mail.gmail.com>
To: Aaron Wood <woody77@gmail.com>
X-Mailer: Apple Mail (2.1085)
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] [Bloat] Comcast upped service levels ->
	WNDR3800 can't cope...
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 01 Sep 2014 21:43:34 -0000


On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:

>>> But this doesn't really answer the question of why the WNDR has so =
much lower a ceiling with shaping than without.  The G4 is powerful =
enough that the overhead of shaping simply disappears next to the =
overhead of shoving data around.  Even when I turn up the shaping knob =
to a value quite close to the hardware's unshaped capabilities (eg. =
400Mbps one-way), most of the shapers stick to the requested limit like =
glue, and even the worst offender is within 10%.  I estimate that it's =
using only about 500 clocks per packet *unless* it saturates the PCI =
bus.
>>>=20
>>> It's possible, however, that we're not really looking at a CPU =
limitation, but a timer problem.  The PowerBook is a "proper" desktop =
computer with hardware to match (modulo its age).  If all the shapers =
now depend on the high-resolution timer, how high-resolution is the =
WNDR's timer?

>> Both good questions worth further exploration.

> Doing some napkin math and some spec reading, I think that the memory =
bus is a likely factory.  The G4 had a fairly impressive memory bus for =
the day (64-bit?).  The WNDR3800 appears to be used in an x16 =
configuration (based on the numbers on the memory parts).  It may have =
*just* enough bw to push concurrent 3x3 802.11n through the software =
bridge interface, which short-circuits a lot of processing (IIRC).  =20
>=20
> The typical way I've seen a home router being benchmarked for the =
"marketing numbers" is to flow tcp data to/from a wifi client to a wired =
client.  Single socket is used, for a uni-directional stream of data.  =
So long as they can hit peak rates (peak MCS), it will get marked as =
good for "up to 900Mbps!!" or whatever they want to say.
>=20
> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. =
2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, =
but there simply isn't room in the AR7161 for that, which puts further =
pressure on the bus.

I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) =
doesn't have the external L3 cache interface, so it only has the 256KB =
or 512KB internal L2 cache (I forget which).  The desktop version =
(7457A) used external cache.  The G4 was considered to be *crippled* by =
its FSB by the end of its run, since it never adopted high-performance =
signalling techniques, nor moved the memory controller on-die; it was =
quoted that the G5 (970) could move data using *single-byte* operations =
faster than the *peak* throughput of the G4's FSB.  The only reason the =
G5 never made it into a PowerBook was because it wasn't battery-friendly =
in the slightest.

But that makes little difference to your argument - compared to a cheap =
CPE-class embedded SoC, the PowerBook is eminently desktop-class =
hardware, even if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should have =
more bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x =
32-bit, and I can push a steady 30MB/sec in both directions =
simultaneously, which corresponds in total to about half the PCI bus's =
theoretical capacity.  (The GEM reports 66MHz capability, but it shares =
the bus with an IDE controller which doesn't, so I assume it is stuck at =
33MHz.)  A 16-bit RAM should be able to match PCI if it runs at 66MHz, =
which is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which =
implies at least 200MHz unless the integrator was colossally stingy.  =
Further, a little digging suggests that the memory bus should be 32-bit =
wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, =
half the CPU core speed.  For an embedded SoC, that's really not too bad =
- it should be able to sustain 1GB/sec, in one direction at a time.

So that takes care of the argument for simply moving the payload around. =
 In any case, the WNDR demonstrably *can* cope with the available =
bandwidth if the shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the =
majority of the payload - only the headers, which are relatively small.  =
The bulk of the payload should DMA from one NIC to RAM, then DMA back =
out of RAM to the other NIC.  It has to do that anyway to route them, =
and without shaping there'd be more of them to handle.  The difference =
might be in the data structures used by the shaper itself, but I think =
those are also reasonably compact.  It doesn't even have to touch =
userspace, since it's not acting as the endpoint as my PowerBook was =
during my tests.

And while the MIPS 24K core is old, it's also been die-shrunk over the =
intervening years, so it runs a lot faster than it originally did.  I =
very much doubt that it's as refined as my G4, but it could probably =
hold its own relative to a comparable ARM SoC such as the Raspberry Pi.  =
(Unfortunately, the latter doesn't have the I/O capacity to do =
high-speed networking - USB only.)  Atheros publicity materials indicate =
that they increased the I-cache to 64KB for performance reasons, but saw =
no need to increase the D-cache at the same time.

Which brings me back to the timers, and other items of black magic.

Incidentally, transfer speed benchmarks involving wireless will =
certainly be limited by the wireless link.  I assume that's not a factor =
here.

 - Jonathan Morton