From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-x22f.google.com (mail-lb0-x22f.google.com [IPv6:2a00:1450:4010:c04::22f]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 7428B21F1FF; Tue, 2 Sep 2014 02:27:12 -0700 (PDT) Received: by mail-lb0-f175.google.com with SMTP id u10so7274572lbd.34 for ; Tue, 02 Sep 2014 02:27:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=fkkWoVl4UfcTzLVe0co8n9yb+49ItmhjwqkLW5YsnqE=; b=HoumPTBaBHmpwAumhM6MvvY6we0e7uXsDJKGcfVV+y5Nqg17Y3PkaPZhqD5LS8Li1A 9SLilBC+zt2mcoqY5NkQO5OdwLKYdiY5TfdloIgDaowGkQz56FSpwAefb8xz5+6zJkvh Ri9Aj6yOM8iGAsIywEYVCUV8phFA91xiCrJjqvbx3EvAgszhuDjHK/jlhHLPIDQnL3vS qp/wmjm3uWieLdwO6FyM4Yl+Yz4OjIbnD6tpHc9wrawj/VqtkFDg9fofQiUwJajI0tZl LNPSDwd1FYR3fgEdSG4RO3iRqdl7ipOSEgH0cgDJFPNMUdKZp+GH1zVBLLRIwu6/CTAs zXaQ== X-Received: by 10.152.27.38 with SMTP id q6mr11392471lag.60.1409650029554; Tue, 02 Sep 2014 02:27:09 -0700 (PDT) Received: from bass.home.chromatix.fi (87-93-123-167.bb.dnainternet.fi. [87.93.123.167]) by mx.google.com with ESMTPSA id 9sm4731332lbq.33.2014.09.02.02.27.07 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 02 Sep 2014 02:27:08 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Jonathan Morton In-Reply-To: Date: Tue, 2 Sep 2014 12:27:06 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <87ppfijfjc.fsf@toke.dk> <4FF4917C-1B6D-4D5F-81B6-5FC177F12BFC@gmail.com> <4DA71387-6720-4A2F-B462-2E1295604C21@gmail.com> <0DB9E121-7073-4DE9-B7E2-73A41BCBA1D1@gmail.com> To: Aaron Wood X-Mailer: Apple Mail (2.1085) Cc: "cerowrt-devel@lists.bufferbloat.net" , bloat Subject: Re: [Bloat] Comcast upped service levels -> WNDR3800 can't cope... X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Sep 2014 09:27:13 -0000 On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote: >> For the purposes of shaping, the CPU shouldn't need to touch the = majority of the payload - only the headers, which are relatively small. = The bulk of the payload should DMA from one NIC to RAM, then DMA back = out of RAM to the other NIC. It has to do that anyway to route them, = and without shaping there'd be more of them to handle. The difference = might be in the data structures used by the shaper itself, but I think = those are also reasonably compact. It doesn't even have to touch = userspace, since it's not acting as the endpoint as my PowerBook was = during my tests. >=20 > In an ideal case, yes. But is that how this gets managed? (I have no = idea, I'm certainly not a kernel developer). It would be monumentally stupid to integrate two GigE MACs onto an SoC, = and then to call it a "network processor", without adequate DMA support. = I don't think Atheros are that stupid. Here's a more detailed datasheet: = http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.= pdf "Another memory factor is the ability to support multiple I/O operations = in parallel via the WNPU's various ports. The on-chip SRAM in AR7100 = WNPUs has 5 ports that enable simultaneous access to and from five = sources: the two gigabit Ethernet ports, the PCI port, the USB 2.0 port = and the MIPS processor." It's a reasonable question, however, whether the driver uses that = support properly. Mainline Linux kernel code seems to support the SoC = but not the Ethernet; if it were just a minor variant of some other = Atheros hardware, I'd have expected to see it integrated into one of the = existing drivers. Or maybe it is, and my greps just aren't showing it. At minimum, however, there are MMIO ranges reported for each MAC during = OpenWRT's boot sequence. That's where the ring buffers are. The most = the CPU has to do is read each packet from RAM and write it into those = buffers, or vice versa for receive - I think that's what my PowerBook = has to do. Ideally, a bog-standard DMA engine would take over that = simple duty. Either way, that's something that has to happen whether = it's shaped or not, so it's unlikely to be our problem. The same goes for the wireless MACs, incidentally. These are standard = ath9k mini-PCI cards, and the drivers *are* in mainline. There = shouldn't be any surprises with them. > If the packet data is getting moved about from buffer to buffer (for = instance to do the htb calculations?) could that substantially change = the processing load? The qdiscs only deal with packet and socket headers, not the full packet = data. Even then, they largely pass pointers around, inserting the = headers into linked lists rather than copying them into arrays. I = believe a lot of attention has been directed at cache-friendliness in = this area, and the MIPS caches are of conventional type. >> Which brings me back to the timers, and other items of black magic. >=20 > Which would point to under-utilizing the processor core, while still = having high load? (I'm not seeing that, I'm curious if that would be the = case). It probably wouldn't manifest as high system load. Rather, poor timer = resolution or latency would show up as excessive delays between packets, = during which the CPU is idle. The packet egress times may turn out to = be quantised - that would be a smoking gun, if detectable. >> Incidentally, transfer speed benchmarks involving wireless will = certainly be limited by the wireless link. I assume that's not a factor = here. >=20 > That's the usual suspicion. But these are RF-chamber, short-range lab = setups where the radios are running at full speed in perfect = environments... Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, = and much less than that is available even theoretically for TCP/IP = throughput. My point is that you're probably not running *your* tests = over wireless. > What this makes me realize is that I should go instrument the cpu = stats with each of the various operating modes: >=20 > * no shaping, anywhere > * egress shaping > * egress and ingress shaping at various limited levels: > * 10Mbps > * 20Mbps > * 50Mbps > * 100Mbps Smaller increments at the high end of the range may prove to be useful. = I would expect the CPU usage to climb nonlinearly (busy-waiting) if = there's a bottleneck in a peripheral device, such as the PCI bus. The = way the kernel classifies that usage may also be revealing. > Heck, what about running HTB simply from a 1ms timer instead of from a = data driven timer? That might be what's already happening. We have to figure out that = before we can work out a solution. - Jonathan Morton