From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-1" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id B928421F2BE for ; Tue, 2 Sep 2014 01:55:25 -0700 (PDT) Received: from hms-beagle.lan ([134.2.89.70]) by mail.gmx.com (mrgmx101) with ESMTPSA (Nemesis) id 0MK4fR-1XNMIH1YAP-001SYu; Tue, 02 Sep 2014 10:55:20 +0200 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) From: Sebastian Moeller In-Reply-To: Date: Tue, 2 Sep 2014 10:55:19 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <88DDD48D-8854-4E6E-BE02-F761C13C9201@gmx.de> References: <87ppfijfjc.fsf@toke.dk> <4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com> <4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com> <0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com> To: Jonathan Morton X-Mailer: Apple Mail (2.1878.6) X-Provags-ID: V03:K0:il+3PcmcETZ87iWg+Pb3VE7ivxwXskSxl71Os6fa8mKos43zq3l hujeUekNO3+Y8aCQhlzGmTg3BhbwOtceRIbWiTyYM3mI1nqp5NMStSSpPeAtU6Is8Dloax5 jSB9CuLNsBfJMtcmTZZWfpo+aJmyizFK7wPy6SMLTfQyx900Sr/W7EjY38XML2U6AXOkhlP EGpSzqZqSYCBX4px+I89Q== X-UI-Out-Filterresults: notjunk:1; Cc: "cerowrt-devel@lists.bufferbloat.net" , bloat Subject: Re: [Cerowrt-devel] [Bloat] Comcast upped service levels -> WNDR3800 can't cope... X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Sep 2014 08:55:26 -0000 Hi Jonathan, hi List, On Sep 1, 2014, at 23:43 , Jonathan Morton = wrote: >=20 > On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote: >=20 >>>> But this doesn't really answer the question of why the WNDR has so = much lower a ceiling with shaping than without. The G4 is powerful = enough that the overhead of shaping simply disappears next to the = overhead of shoving data around. Even when I turn up the shaping knob = to a value quite close to the hardware's unshaped capabilities (eg. = 400Mbps one-way), most of the shapers stick to the requested limit like = glue, and even the worst offender is within 10%. I estimate that it's = using only about 500 clocks per packet *unless* it saturates the PCI = bus. >>>>=20 >>>> It's possible, however, that we're not really looking at a CPU = limitation, but a timer problem. The PowerBook is a "proper" desktop = computer with hardware to match (modulo its age). If all the shapers = now depend on the high-resolution timer, how high-resolution is the = WNDR's timer? >=20 >>> Both good questions worth further exploration. >=20 >> Doing some napkin math and some spec reading, I think that the memory = bus is a likely factory. The G4 had a fairly impressive memory bus for = the day (64-bit?). The WNDR3800 appears to be used in an x16 = configuration (based on the numbers on the memory parts). It may have = *just* enough bw to push concurrent 3x3 802.11n through the software = bridge interface, which short-circuits a lot of processing (IIRC). =20 >>=20 >> The typical way I've seen a home router being benchmarked for the = "marketing numbers" is to flow tcp data to/from a wifi client to a wired = client. Single socket is used, for a uni-directional stream of data. = So long as they can hit peak rates (peak MCS), it will get marked as = good for "up to 900Mbps!!" or whatever they want to say. >>=20 >> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. = 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, = but there simply isn't room in the AR7161 for that, which puts further = pressure on the bus. >=20 > I don't think that's it. >=20 > First a nitpick: the PowerBook version of the late-model G4 (7447A) = doesn't have the external L3 cache interface, so it only has the 256KB = or 512KB internal L2 cache (I forget which). The desktop version = (7457A) used external cache. The G4 was considered to be *crippled* by = its FSB by the end of its run, since it never adopted high-performance = signalling techniques, nor moved the memory controller on-die; it was = quoted that the G5 (970) could move data using *single-byte* operations = faster than the *peak* throughput of the G4's FSB. The only reason the = G5 never made it into a PowerBook was because it wasn't battery-friendly = in the slightest. >=20 > But that makes little difference to your argument - compared to a = cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class = hardware, even if it is already a decade old. >=20 > More compelling is that even at 16-bit width, the WNDR's RAM should = have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz = x 32-bit, and I can push a steady 30MB/sec in both directions = simultaneously, which corresponds in total to about half the PCI bus's = theoretical capacity. (The GEM reports 66MHz capability, but it shares = the bus with an IDE controller which doesn't, so I assume it is stuck at = 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, = which is the lower limit of JEDEC standards for SDRAM. >=20 > The AR7161 datasheet says it has a DDR-capable SDRAM interface, which = implies at least 200MHz unless the integrator was colossally stingy. = Further, a little digging suggests that the memory bus should be 32-bit = wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, = half the CPU core speed. For an embedded SoC, that's really not too bad = - it should be able to sustain 1GB/sec, in one direction at a time. >=20 > So that takes care of the argument for simply moving the payload = around. In any case, the WNDR demonstrably *can* cope with the = available bandwidth if the shaping is turned off. That makes me wonder, couldn=92t some reasonable batching help = here? We know we can shape roughly 50Mbps combined with no batching = (which corresponds to batching at packet size I guess), so get out the = envelop, turn it around and go: 50*1000*1000 / (1500*8) =3D 4166.66666667 that would be around 4000 = packet per second (well with small packets might be more). So what about = batching up enough packets to take at least 250=B5s to transfer and push = the whole bunch to the tx queue, and then sleep for 250=B5s? (Since I = have not looked at the code and something like this might already be = happening, I guess the idea is figure out the highest sustainable = shaping frequency, and just make sure we do not attempt to make the = shaping decicisios more often. Sure batching will introduce some = latency, but if my silly numbers are roughly right, I would be quite = happy to accept 1/4 ms added latency if the wndr3[7|8]00 would still = last a bit in the future ;) )) Heck, what about running HTB simply from = a 1ms timer instead of from a data driven timer? Now, since when HTB appeared on the scene it was quite costly = for the machines of the time, I wonder whether there are any = computation-demand mitigation mechanisms somewhere in the code base=85=20= Best Regards Sebastian >=20 > For the purposes of shaping, the CPU shouldn't need to touch the = majority of the payload - only the headers, which are relatively small. = The bulk of the payload should DMA from one NIC to RAM, then DMA back = out of RAM to the other NIC. It has to do that anyway to route them, = and without shaping there'd be more of them to handle. The difference = might be in the data structures used by the shaper itself, but I think = those are also reasonably compact. It doesn't even have to touch = userspace, since it's not acting as the endpoint as my PowerBook was = during my tests. >=20 > And while the MIPS 24K core is old, it's also been die-shrunk over the = intervening years, so it runs a lot faster than it originally did. I = very much doubt that it's as refined as my G4, but it could probably = hold its own relative to a comparable ARM SoC such as the Raspberry Pi. = (Unfortunately, the latter doesn't have the I/O capacity to do = high-speed networking - USB only.) Atheros publicity materials indicate = that they increased the I-cache to 64KB for performance reasons, but saw = no need to increase the D-cache at the same time. >=20 > Which brings me back to the timers, and other items of black magic. >=20 > Incidentally, transfer speed benchmarks involving wireless will = certainly be limited by the wireless link. I assume that's not a factor = here. >=20 > - Jonathan Morton >=20 > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel