[Codel] hardware multiqueue in fq_codel?

Thu Jul 11 17:48:05 PDT 2013

On Thu, Jul 11, 2013 at 5:06 PM, Eric Dumazet <eric.dumazet at gmail.com> wrote:
> On Thu, 2013-07-11 at 14:18 -0700, Dave Taht wrote:
>
>> I have incidentally long thought that you are also tweaking target and
>> interval for your environment?
>
> Google data centers are spread all over the world, and speed of light
> was not increased by wonderful Google engineers,
> thats a real pity I can tell you.

Heh. Interesting. I once circulated a proposal among some VCs for
developing neutrino based networking. After extolling the advantages
for 3 pages (shrinking RTT's halfway around the world by a FACTOR OF
3!), I then moved to the problem of shrinking the emitter and detector
from their present sizes down to handset sizes on page 4,
with a big picture of the sun, and a big picture of the moutainful of water...

Was a nice april 1 prank, that... still, it would be nice to do one
day. So many more crazy ideas were being circulated in the late 90s.

>
>>
>> > Whole point of codel is that number of packets in the queue is
>> > irrelevant. Only sojourn time is.
>>
>> Which is a lovely thing in worlds with infinite amounts of memory.
>
> The limit is a tunable, not hard-coded in qdisc, like other qdiscs.
>
> I chose a packet count limit because pfifo was the target, and pfifo
> limit is given in packets, not in bytes.
>
> There is bfifo variant for a byte limited fifo, so feel free to add
> bcodel and fq_bcodel.

I've often thought that a byte limit was far saner than a strict
packet limit, yes.

> Linux Qdiscs have some existing semantic.
>
> If every new qdisc had strange and new behavior, what would happen ?

http://xkcd.com/927/

:)

>>
>> > Now if your host has memory concerns, that's a separate issue, and you
>> > can adjust the qdisc limit, or add a callback from mm to be able to
>> > shrink queue in case of memory pressure, if you deal with non elastic
>> > flows.
>>
>> So my take on this is that the default limit should be 1k on devices
>> with less than 256MB of ram overall, and 10k (or more) elsewhere. This
>> matches current txqueuelen behavior and has the least surprise.
>
> What is current 'txqueuelen behavior' ?

The default is largely 1000. I have seen a multitude of other settings
ranging from 0 to 10s of thousands, mostly done from userspace. The
vast majority of which are wrong for most workloads...

> Current 'txqueuelen behavior' is a limit of 16.000 packets on typical
> 10G hardware with 16 tx queues, not 1000 as you seem to believe.

Huh? no, I'm aware multiqueue is like that.

It's one of my kvetches in that I've been able to easily crash small
wireless routers with multiple SSIDs with the default queue lengths,
by using classification (e.g. rrul)

This was sort of my motivation for wanting a single qdisc for multiple
hardware queues, or at least, a single limit covering multiple queues,
as by and large the 3 extra queues in wifi are unused, but as best I
can tell from our dialog today that's not doable. I don't care about
the single queue lock that much either (still single core here)...

> If txqueuelen was a percentage of available memory, we would be in a bad
> situation : With txqueuelen being 100 15 years ago, it would be 100.000
> today, since ram was increased by 3 order of magnitude.
>
> 0.1 % would be too much for my laptop with 4GB of ram, and not enough
> for your 256MB host.

I concur that percentage based sizing is nuts. an outer limit based on
the maximum bandwidth based on the devices' current speed would not be
horrible, but byte rather than packet based.

> I do not want to bloat codel/fq_codel with some strange concerns about
> skb truesize resizing. No other qdisc does this, with codel/fq_codel
> would be different ?

actually that patch has been in openwrt for months and months against
the common qdiscs not just fq_codel.

It could be a compile time option, as I said, for small devices. I
thought we'd discussed this back then?

> This adds a huge performance penalty and latencies. A router should
> never even touch (read or write) a _single_ byte of the payload.

I tend to agree, but the core problem solved here was the memory
starvation problem from 2k+ memory allocations for (predominately)
64byte packets on the received path. The performance hit only incurs
in case of overload/attack.

in my tests, I observed a mild increase in forwarding performance but
there were many other variables at the time. Certainly forwarding
performance on the box I work on is at an all time high, even when
"attacked". And reliability is up.

> Whole point having a queue is to absorb bursts : You don't want to spend
> cpu cycles when bursts are coming, or else there wont be bursts anymore,
> but losses episodes.

The default (and somewhat arbitrary) limit of 128 above was chosen so
that if a queue built during a burst we still wouldn't bother touching
the packets until things got out of hand - and *it was there to fix
the allocation problem without which we ran out of memory* in which to
put the bursts in the first place.

I would certainly not mind smarter hardware that had rx rings for
multiple sizes of packets.

-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html