[Cerowrt-devel] performance numbers from WRT1200AC (Re: Latest build test - new sqm-scripts seem to work; "cake overhead 40" didn't)

Thu Jun 25 16:13:14 EDT 2015

On Thu, Jun 25, 2015 at 2:12 AM, Mikael Abrahamsson <swmike at swm.pp.se> wrote:
> On Wed, 24 Jun 2015, Dave Taht wrote:
>
>> From what I see here you are rarely, if ever, engaging fq_codel
>> properly. Latencies are pretty high. In particular, I would suspect
>> you are hitting offloads hard, and the current (fixed in linux 4.1)
>> codel drop algorithm stops dropping below "maxpacket", which was meant
>> in the ns2 code to be a MTU, but in the linux code ended up being a
>> TSO sized (64k!) packet.
>>
>> tc -s qdisc show # will show the maxpacket.
>
>
> http://swm.pp.se/aqm/qdisc-show.txt

Ugh. 64k maxpacket. I did not even know that was possible until I saw
it (GRO is usually for tcp acks, and peaks at 24k or so).

I will check to see if openwrt backported that crucial codel patch. Or
we can switch sqm "simplest" to use tbf, or we can do cake´s
peeling....

> I discovered also that I wasn't running ECN, I had set tcp_ecn = 2 on the
> linux box. All the tests done in
> http://swm.pp.se/aqm/flent-mikabr-150625-1.tar are now done with ECN
> working, without iperf3 running, and with and without SQM, but with default
> offload setting.
>
> Also, now iperf3 says "0 packet lost" when I use that.

cool. :) Your router has it off, too, and if you plan to use it for
other things than routing, you might want to turn it on (gleaned from
your metadata, thx!)

>> 2) Please run your flent tests with -x --disable-log
>
>
> Done.
>
>> Use -t "title" to differentiate between variables under test.
>
>
> Done.
>
>> 3) I also tend to use flent's --remote-metadata=root at your_openwrtbox
>> to get the stats on that box into the metadata. You have to add your
>> local .ssh/id_rsa.pub key to
>> your_openwrtbox:/etc/dropbear/authorized_keys file to do this.
>
>
> Done.

Unfortunately the core piece of metadata I wanted from the router was
the qdisc statistics. Didnt parse. Will file bug.

>
>> 4) With all that in hand, sticking up a tarball of the results makes
>> for easy plotting of various other graphs, and using the flent-gui,
>> you can combine results from each run easily, also.
>
>
> See above.
>
>> 5) try disabling offloads on all interfaces on the router (or running
>> cake)
>
>
> This is my next thing to test, I have some other things I need to try first.
>
>> My usual suite of tests is rrul, rrul_be, tcp_1up, tcp_1down, and
>> tcp_2up_delay.
>
>
> Done.
>
>> and rtt_fair (if you have more than one target server available)...
>> all without the iperf stuff....
>
>
> I do not have another machine readily available here at home.
>
>> 6) I am pretty interested as to what happens *without* sqm at the max
>> forwarding rate with fq_codel engaged on all these tests.
>
>
> This is included. Why did you want to do this test? Just to see if the
> WRT1200AC can do wirespeed forwarding with fq_codel on? Because with the
> setup I have, I don't see how there can be any buffering going on because
> it's a single gig port both in and out.

Well, it was good to see fq_codel actually engaging on a few of your
hardware queues. There are plenty of reasons why your egress might not
match your ingress periodically and vice versa.

To quote from the BQL commit: ( https://lwn.net/Articles/454378/ )

"Hardware queuing limits are typically specified in terms of a number
hardware descriptors, each of which has a variable size. The variability
of the size of individual queued items can have a very wide range. For
instance with the e1000 NIC the size could range from 64 bytes to 4K
(with TSO enabled). This variability makes it next to impossible to
choose a single queue limit that prevents starvation and provides lowest
possible latency.

The objective of byte queue limits is to set the limit to be the
minimum needed to prevent starvation between successive transmissions to
the hardware. The latency between two transmissions can be variable in a
system. It is dependent on interrupt frequency, NAPI polling latencies,
scheduling of the queuing discipline, lock contention, etc. Therefore we
propose that byte queue limits should be dynamic and change in
iaccordance with networking stack latencies a system encounters."

In particular, aggressive NAPI settings bug me, in routers. And I am
unfond of hardware multiqueue as presently implemented. Sometimes I
think it would be better to use them for QoS rather than fq with
birthday problems.

Another test idea for you would be to enable fq_codel on just the main
ethernet devices on the router, and not use hw mq on it at all. (tc
qdisc add dev each_ethernet_device_on_router root fq_codel)

2) You still have 15ms of delay at various rates. That is quite a lot
more than what I see on a rangeley (where we get below 5ms on sparse
traffic (partially due to local cpu scheduling delays), and 200*us* or
so when measured at the router with cake) I also see packet loss of
the measurement flows in most tests, which are possibly due to gro
forcing out smaller packets, or some other factor.

It is possible however, that the observed buffering here, is actually
on your host, server, or switch. (you have pfifo_fast on your host)
Try fq_codel on host and server (and/or sch_fq) and see what happens.
Disable tso/gro/gso on your server/host also. That leaves the switch
which I have no insight into. What switch chip is it? (see
/etc/config/network) - on the cerowrt project we got less buffering
out of the switch by enabling jumbo frames.

3) As for latency damage:

http://snapon.lab.bufferbloat.net/~d/gro_damage_to_latency.png # this
can also be somewhat due to ecn.

4) The diffserv marking behavior here was puzzling (losing BE
entirely) - my assumption is this was with also the simultaneous iperf
flows (280mbit on the uplink)?

http://snapon.lab.bufferbloat.net/~d/puzzling_diffserv_marking_behavior.png

5) Also on your hosts and servers, two sysctls seem to be helping
reduce the local bloat.

net.ipv4.tcp_limit_output_bytes = 8192 #  I have seen 4k
net.ipv4.tcp_notsent_lowat = 8192 # or 16k - I do not know the right
settings really for either of these

The default for the first was originally proposed to be 4k or so, but
that interacted badly with aggregating wifi drivers on single threaded
benchmarks. (sigh) More recently it was bumped to 256k to make the xen
people happier. (vms suck) On raw hardware, it seems like the lower
settings are pretty optimal in my range of tests....

The second is cheshire´s big change to osx and available in linux for
ages - It increases the number of context switches but keeps way more
traffic in the app, not the kernel. I think the latter can also be set
via a setsockopt.

Have fun exploring more variables! :)

6) I do sometimes find it hard to care about the last 15ms, given the
100s or 1000s of ms we are saving elsewhere.... and how rotten wifi
and wireless are... but I suppose in a world that is moving towards
2-4ms e2e latency along the ISP link on fiber, that 15ms is a lot. (I
am pretty sure everyone here is aware that my personal end-goal for
this work is to get to where I could play music with a drummer down
the street, and that that is about 2.7ms... and I have been waiting
for 20 years to get to be able to do that. I will probably be too old
and deaf and crippled by the time that happens to be able play
anything but chopsticks, and too poor to live near anyplace with
fiber... but...)

> --
> Mikael Abrahamsson    email: swmike at swm.pp.se

-- 
Dave Täht
worldwide bufferbloat report:
http://www.dslreports.com/speedtest/results/bufferbloat
And:
What will it take to vastly improve wifi for everyone?
https://plus.google.com/u/0/explore/makewififast