some AQM work thus far (was: Re: cerowrt-bql-3 available)

Mon Dec 19 07:04:06 EST 2011

I've kind of switched my focus to ethernet for the nonce...

On Mon, Dec 19, 2011 at 10:50 AM, Dave Taht <dave.taht at gmail.com> wrote:

> (PLEASE NOTE: you can prototype schedulers/AQMs/shapers just fine on an x86 box)

Actually, um, er, you can't. If you want to operate a line rate -
100Mbit, 10Mbit as examples,
and do anything with these schedulers, it helps to have a bql-enabled
kernel. The one
I'm using is up at:

http://huchra.bufferbloat.net/~d/bql/

If you want to be running via htb or hsfc to create a 'soft' line rate
(like 4Mbit), bql is not
strictly required, but it does seem to help.

Absolutely required is to disable tso,gso,and ufo on the ethernet driver.

I don't know if this is a bug or not, but although the e1000e 'says'
it's turning tso off
when running at a 100Mbit line rate, gso remains on - and you still
have to explicitly
still turn off tso and gso in sequence.

So most of my FQ/AQM scripts have the following in them in the pre-up stage

IFACE=eth0 # pick an interface...

ethtool -s $IFACE advertise 0x008 # switch to 100Mbit line rate for testing
ethtool -K $IFACE tso off  # Turn off TSO  (I'm developing an intense
hatred for TSO)
ethtool -K $IFACE gso off # Turn off GSO (on by default since feb, sadly)
ethtool -K $IFACE ufo off  # Turn off UFO (very few drivers have this on)
ethtool -G $IFACE tx 64 # Can't go any lower than this on

If you don't turn off tso/gso/ufo, you end up sending colossal
superpacket bursts
through the scheduler - up to 64k in size - which really messes with byte
oriented AQMs by default, and packet oriented ones nearly as much.

I think this explains the issues I was having with SFB, as one example,
and RED as another. They both kind of expect to have more and different
information in the queue about the available streams.

At 100Mbit, I still get best results with disabling the automatic MIAD
algorithm in BQL to instead have a hard limit of 6000 bytes.

I've experimented with the /proc/sys/net/ipv4/tcp_low_latency
parameter to no real effect.

And you have to remember there are always TWO (just like the sith)
systems involved
at minimum. More than once I would get a weird result and realize that
I'd screwed
up on the system not under test or the system inbetween. I should
probably return
to SFB and see what a FQing server does to it, and then look at what TSO/GSO
does to it.

So, all the above is seemingly required to get to where an AQM has
some effect on streams,
from my x86 laptop at present.

I note that in an experiment I did recently on cerowrt, a tx ring of
16 on it was actually slower than a  tx ring of 4 by at gigE speeds
talking to it's netperf daemon (280 vs 264 Mbit/sec) I still find that
result puzzling but it seems repeatable.  However what really matters
is forwarding performance, and tend to think that a larger tx ring
will help in that case, but all the same....

A last note - in doing this testing, and observing the results at
gigabit, on cerowrt (which can barely host the test daemon and run at
280Mbit (they will forward at 500+)), it appears that the connection
to it's local switch
(gigabit) is so fast and buffered that a decent portion of the
buffering is happening there. Trying to FQ in software on cerowrt at
those speeds rarely actually happens.

Similarly, running at GigE line rate, FQ does not happen very often on
the x86 box - packets get sent out in bunches and the packet scheduler
doesn't have enough time before dumping them to the hardware with any
given bunch to do anything intelligent with them. As near as I can
figure, in order to get decent Fair Queuing out of a desktop or server
at these speeds, how we dequeue packets from the TCP portion of the
stack needs to be rethunk, not just the txqueue portion I'm fiddling
with now.

Another thought would be to rate-limit/FQ/manage on the home
desktops/laptops themselves
multiple outgoing streams to destinations outside of the home network.
I'm not sure to what extent this would help, but getting the bandwidth
ratio (say you have 4Mbit) down from 1GigE - which is a factor of 127
- down to 4Mbit - might allow for more opportunities for more valuable
packets to get moved to the head of line on the originating machine
and compensate for the bursty nature of the incoming streams these
days....

Even if the rest of your family is all banging on the network, your
outgoing bandwidth estimate is then only off by a factor of 4, rather
than 1000.

I wish I had better tools to analyze the 'fairness' of multiple
streams than the mark #1 eyeball. The closest thing
I've found of late was jain's fairness index, and it doesn't do this.

I have some screenshots and packet captures of QFQ,-TSO,-GSO,-UFO vs
PFIFO_FAST+GSO+TSO if anybody wants them.  The first is lPr0n that
Nagle would love, the second, more like a horror movie
in comparison. Actually, I'll stick them up on bug #305 after I get on
a less (ironically) bufferbloated network...

http://www.bufferbloat.net/issues/305

I hope to play with Eric's new adaptive RED over the holiday.

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net