[Codel] fq_codel : interval servo

Sun Sep 2 14:08:19 EDT 2012

On Sat, Sep 1, 2012 at 5:53 AM, Eric Dumazet <eric.dumazet at gmail.com> wrote:
> On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote:
>
>> I realize that 10GigE and datacenter host based work is sexy and fun,
>> but getting stuff that runs well in today's 1-20Mbit environments is
>> my own priority, going up to 100Mbit, with something that can be
>> embedded in a SoC. The latest generation of SoCs all do QoS in
>> hardware... badly.
>
> Maybe 'datacenter' word was badly chosen and you obviously jumped on it,
> because it meant different things for you.

I am hypersensitive about optimizing for sub-ms problems when there are
huge multi-second problems like in cable, wifi, and cellular. Recent paper:

http://conferences.sigcomm.org/sigcomm/2012/paper/cellnet/p1.pdf

Sorry.

If the srtt idea can scale UP as well as down sanely, cool. I'm
concerned about how different TCPs might react to this and have a
long comment about the placement of this at this layer at the bottom
of this email.

> Point was that when your machine has flows with quite different RTT, 1
> ms on your local LAN, and 100 ms on different continent, current control
> law might clamp long distance communications, or have slow response time
> for the LAN traffic.

fq_codel, far less likely, and if you have a collision between long distance
and local streams in a single queue, there, what will happen if you fiddle
with srrt?

> The shortest path you have, the sooner you should drop packets because
> losses have much less impact on latencies.

Sure.

> Yuchung idea sounds very good and my intuition is it will give
> tremendous results for standard linux qdisc setups ( a single qdisc per
> device)

I tend to agree.

> To get similar effects, you could use two (or more) fq codels per
> ethernet device.

Ugh.

> One fq_codel with interval = 1 or 5 ms for LAN communications
> One fq_codel with interval = 100 ms for other communications
and one mfq_codel with a calculated maxpacket, weird interval, etc
for wifi.

> tc filters to select the right qdisc by destination addresses

Meh. A simple default might be "Am I going out the default route for this?"
> Then we are a bit far from codel spirit (no knob qdisc)
>
> I am pretty sure you noticed that if your ethernet adapter is only used
> for LAN communications, you have to setup codel interval to a much
> smaller value than the 100 ms default to get reasonably fast answer to
> congestion.

At 100Mbit, (as I've noted elsewhere), BQL choses defaults about double
optimum (6-7k), and gso is currently left on. With those disabled, I tend to run
a pretty congested network, and rarely notice.  That does not mean that
reaction time isn't an issue, it is merely masked so well that I don't care.

> Just make this automatic, because people dont want to think about it.

Like you, I want one qdisc to rule them all, with sane defaults.

I do feel it is very necessary to add in one pfifo_fast-like behavior in
fq_codel: deprioritizing background traffic, in its own
set of fq'd flows. Simple way to do that is to have a bkweight of,
say 20, and only check "q->slow_flows" on that interval of packet
deliveries.

This is the only way I can think of to survive bittorrent-like flows, and to
capture the intent of traffic marked background.

However, I did want to talk to the using-codel-to-solve-everything issue
for fixing host bufferbloat...

Fixing host bufferbloat by adding local tcp awareness is a neat idea,
don't let me stop you! But...

Codel will push stuff down to, but not below, 5ms of latency (or
target). In fq_codel you will typically end up with 1 packet outstanding in
each active queue under heavy load. At 10Mbit it's pretty easy to
have it strain mightily and fail to get to 5ms, particularly on torrent-like
workloads.

The "right" amount of host latency to aim for is ... 0, or as close to it as
you can get.  Fiddling with codel target and interval on the host to
get less host latency is well and good, but you can't get to 0 that way...

The best queue on a host is no extra queue.

I spent some time evaluating linux fq_codel vs the ns2 nfq_codel version I
just got working. In 150 bidirectional competing streams, at 100Mbit,
it retained about 30% less packets in queue (110 vs 140). Next up
on my list is longer RTTs and wifi, but all else was pretty equivalent.

The effects of fiddling with /proc/sys/net/ipv4/tcp_limit_output_bytes
was even more remarkable. At 6000, I would get down to
a nice steady 71-81 packets in queue on that 150 stream workload.

So, I started thinking through and playing with how TSQ works:

At one hop 100Mbit, with a BQL of 3000 and a tcp_limit_output_bytes of 6000,
all offloads off, nfq_codel on both ends, I get single stream throughoutput
of 92.85Mbit.  Backlog in qdisc is, 0.

2 netperf streams, bidirectional: 91.47 each, darn close to theoretical, less
than one packet in the backlog.

4 streams backlogs a little over 3. (and sums to 91.94 in each direction)

8, backlog of 8. (optimal throughput)

Repeating the 8 stream test with tcp_output_limit of 1500, I get
packets outstanding of around 3, and optimal throughput. (1 stream test:
42Mbit throughput (obviously starved), 150 streams: 82...)

8 streams, limit set to 127k, I get 50 packets outstanding in the queue,
and the same throughput. (150 streams, ~100)

So I might argue that a more "right" number for tcp_output_bytes is
not 128k per TCP socket, but (BQL_limit*2/active_sockets), in conjunction
with fq_codel. I realize that that raises interesting questions as to when
to use TSO/GSO, and how to schedule tcp packet releases, and pushes
the window reduction issue all the way up into the tcp stack rather
than responding to indications from the qdisc... but it does
get you closer to a 0 backlog in qdisc.

And *usually* the bottleneck link is not on the host but on something
inbetween, and that's where your signalling comes from, anyway.

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"