From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com [IPv6:2607:f8b0:4003:c01::234]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id A0AC121F61F for ; Thu, 25 Jun 2015 13:13:15 -0700 (PDT) Received: by obpn3 with SMTP id n3so54335012obp.0 for ; Thu, 25 Jun 2015 13:13:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=1GcHeujxBhFU6ukUVGADA+uCJqT0+omBPE5P51q2ado=; b=jfkTXdEKJc/TigbmgifsR3HQkFuKxN+hzIO7gn48XE+yvd79MXwE5Rwdk9zhcv/CNS AQNAgYQuE6p/7Hppe3d35a9J6juza8MnPDOdDq2O8sVmio5Hmv0IXFNta9nyB3k325Fe gl8lHXgEp0sOXefuUps3b4L2tayNOw3HE8AXq7PNq6c2llr9g7fbWGaA5vhOBgM2vuiF kJv7YISpZfjel6efw+/W7i7vvE2J2Eiaycme9Tb0qGomyX8K8aI863Xsdr0E2P10U+Vk X8zLd1ZesY5R9zGgv7mn/WYBwB7st8xLjq6Ql0UayML3uZ4MTFJE6KAmZk+IgYAya2tL AFLQ== MIME-Version: 1.0 X-Received: by 10.202.92.68 with SMTP id q65mr27153761oib.11.1435263194567; Thu, 25 Jun 2015 13:13:14 -0700 (PDT) Received: by 10.202.105.129 with HTTP; Thu, 25 Jun 2015 13:13:14 -0700 (PDT) In-Reply-To: References: <26463A88-821B-44B7-A728-64BCB0B7C7BB@gmx.de> <55847E32.9000405@gmail.com> <5584823E.4040207@gmail.com> <0129B5FB-9D1B-45FF-84CA-492A6A0B638B@gmx.de> Date: Thu, 25 Jun 2015 13:13:14 -0700 Message-ID: From: Dave Taht To: Mikael Abrahamsson Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: cerowrt-devel Subject: Re: [Cerowrt-devel] performance numbers from WRT1200AC (Re: Latest build test - new sqm-scripts seem to work; "cake overhead 40" didn't) X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Jun 2015 20:13:43 -0000 On Thu, Jun 25, 2015 at 2:12 AM, Mikael Abrahamsson wrot= e: > On Wed, 24 Jun 2015, Dave Taht wrote: > >> From what I see here you are rarely, if ever, engaging fq_codel >> properly. Latencies are pretty high. In particular, I would suspect >> you are hitting offloads hard, and the current (fixed in linux 4.1) >> codel drop algorithm stops dropping below "maxpacket", which was meant >> in the ns2 code to be a MTU, but in the linux code ended up being a >> TSO sized (64k!) packet. >> >> tc -s qdisc show # will show the maxpacket. > > > http://swm.pp.se/aqm/qdisc-show.txt Ugh. 64k maxpacket. I did not even know that was possible until I saw it (GRO is usually for tcp acks, and peaks at 24k or so). I will check to see if openwrt backported that crucial codel patch. Or we can switch sqm "simplest" to use tbf, or we can do cake=C2=B4s peeling.... > I discovered also that I wasn't running ECN, I had set tcp_ecn =3D 2 on t= he > linux box. All the tests done in > http://swm.pp.se/aqm/flent-mikabr-150625-1.tar are now done with ECN > working, without iperf3 running, and with and without SQM, but with defau= lt > offload setting. > > Also, now iperf3 says "0 packet lost" when I use that. cool. :) Your router has it off, too, and if you plan to use it for other things than routing, you might want to turn it on (gleaned from your metadata, thx!) >> 2) Please run your flent tests with -x --disable-log > > > Done. > >> Use -t "title" to differentiate between variables under test. > > > Done. > >> 3) I also tend to use flent's --remote-metadata=3Droot@your_openwrtbox >> to get the stats on that box into the metadata. You have to add your >> local .ssh/id_rsa.pub key to >> your_openwrtbox:/etc/dropbear/authorized_keys file to do this. > > > Done. Unfortunately the core piece of metadata I wanted from the router was the qdisc statistics. Didnt parse. Will file bug. > >> 4) With all that in hand, sticking up a tarball of the results makes >> for easy plotting of various other graphs, and using the flent-gui, >> you can combine results from each run easily, also. > > > See above. > >> 5) try disabling offloads on all interfaces on the router (or running >> cake) > > > This is my next thing to test, I have some other things I need to try fir= st. > >> My usual suite of tests is rrul, rrul_be, tcp_1up, tcp_1down, and >> tcp_2up_delay. > > > Done. > >> and rtt_fair (if you have more than one target server available)... >> all without the iperf stuff.... > > > I do not have another machine readily available here at home. > >> 6) I am pretty interested as to what happens *without* sqm at the max >> forwarding rate with fq_codel engaged on all these tests. > > > This is included. Why did you want to do this test? Just to see if the > WRT1200AC can do wirespeed forwarding with fq_codel on? Because with the > setup I have, I don't see how there can be any buffering going on because > it's a single gig port both in and out. Well, it was good to see fq_codel actually engaging on a few of your hardware queues. There are plenty of reasons why your egress might not match your ingress periodically and vice versa. To quote from the BQL commit: ( https://lwn.net/Articles/454378/ ) "Hardware queuing limits are typically specified in terms of a number hardware descriptors, each of which has a variable size. The variability of the size of individual queued items can have a very wide range. For instance with the e1000 NIC the size could range from 64 bytes to 4K (with TSO enabled). This variability makes it next to impossible to choose a single queue limit that prevents starvation and provides lowest possible latency. The objective of byte queue limits is to set the limit to be the minimum needed to prevent starvation between successive transmissions to the hardware. The latency between two transmissions can be variable in a system. It is dependent on interrupt frequency, NAPI polling latencies, scheduling of the queuing discipline, lock contention, etc. Therefore we propose that byte queue limits should be dynamic and change in iaccordance with networking stack latencies a system encounters." In particular, aggressive NAPI settings bug me, in routers. And I am unfond of hardware multiqueue as presently implemented. Sometimes I think it would be better to use them for QoS rather than fq with birthday problems. Another test idea for you would be to enable fq_codel on just the main ethernet devices on the router, and not use hw mq on it at all. (tc qdisc add dev each_ethernet_device_on_router root fq_codel) 2) You still have 15ms of delay at various rates. That is quite a lot more than what I see on a rangeley (where we get below 5ms on sparse traffic (partially due to local cpu scheduling delays), and 200*us* or so when measured at the router with cake) I also see packet loss of the measurement flows in most tests, which are possibly due to gro forcing out smaller packets, or some other factor. It is possible however, that the observed buffering here, is actually on your host, server, or switch. (you have pfifo_fast on your host) Try fq_codel on host and server (and/or sch_fq) and see what happens. Disable tso/gro/gso on your server/host also. That leaves the switch which I have no insight into. What switch chip is it? (see /etc/config/network) - on the cerowrt project we got less buffering out of the switch by enabling jumbo frames. 3) As for latency damage: http://snapon.lab.bufferbloat.net/~d/gro_damage_to_latency.png # this can also be somewhat due to ecn. 4) The diffserv marking behavior here was puzzling (losing BE entirely) - my assumption is this was with also the simultaneous iperf flows (280mbit on the uplink)? http://snapon.lab.bufferbloat.net/~d/puzzling_diffserv_marking_behavior.png 5) Also on your hosts and servers, two sysctls seem to be helping reduce the local bloat. net.ipv4.tcp_limit_output_bytes =3D 8192 # I have seen 4k net.ipv4.tcp_notsent_lowat =3D 8192 # or 16k - I do not know the right settings really for either of these The default for the first was originally proposed to be 4k or so, but that interacted badly with aggregating wifi drivers on single threaded benchmarks. (sigh) More recently it was bumped to 256k to make the xen people happier. (vms suck) On raw hardware, it seems like the lower settings are pretty optimal in my range of tests.... The second is cheshire=C2=B4s big change to osx and available in linux for ages - It increases the number of context switches but keeps way more traffic in the app, not the kernel. I think the latter can also be set via a setsockopt. Have fun exploring more variables! :) 6) I do sometimes find it hard to care about the last 15ms, given the 100s or 1000s of ms we are saving elsewhere.... and how rotten wifi and wireless are... but I suppose in a world that is moving towards 2-4ms e2e latency along the ISP link on fiber, that 15ms is a lot. (I am pretty sure everyone here is aware that my personal end-goal for this work is to get to where I could play music with a drummer down the street, and that that is about 2.7ms... and I have been waiting for 20 years to get to be able to do that. I will probably be too old and deaf and crippled by the time that happens to be able play anything but chopsticks, and too poor to live near anyplace with fiber... but...) > -- > Mikael Abrahamsson email: swmike@swm.pp.se --=20 Dave T=C3=A4ht worldwide bufferbloat report: http://www.dslreports.com/speedtest/results/bufferbloat And: What will it take to vastly improve wifi for everyone? https://plus.google.com/u/0/explore/makewififast