[Codel] hardware multiqueue in fq_codel?

Dave Taht dave.taht at gmail.com
Thu Jul 11 17:18:31 EDT 2013


On Thu, Jul 11, 2013 at 11:54 AM, Eric Dumazet <eric.dumazet at gmail.com> wrote:
> On Thu, 2013-07-11 at 11:06 -0700, Dave Taht wrote:
>
>> Gotcha. So what I actually did (felix did, in openwrt, actually) was
>> just make fq_codel the default qdisc to avoid having to inspect things
>> to set the number of queues in mq and mqprio. I see, for example, that
>> mq is the default for tg3...
>>
>> http://snapon.lab.bufferbloat.net/~cero2/deb/patches/0003-Use-FQ_codel-by-default.patch
>>
>> I just added it to htb and hfsc too:
>>
>> http://snapon.lab.bufferbloat.net/~cero2/deb/patches/0008-Make-fq_codel-the-default-qdisc-for-htb-and-hfsc.patch
>>
>> There's a patch to obsolete pfifo_fast entirely in openwrt, which is a
>> tad premature.
>>
>> A remaining concern is to what this affects:
>>
>> A) people that expect ifconfig X txqueuelen Y to do anything will be
>> misled. Perhaps this could be fixed by having the fq_codel default
>> limit be txqueuelen rather than the default (and overlarge) limit of
>> 10k, but as tons of people are supplying oddball txqueuelens, I tend
>> to think just ignoring txqueuelen going forward is more the right
>> thing.
>>
>> Do you actually get close to 10k packets outstanding in 10GigE under
>> any sane circumstances?
>
>
> 10GigE can send 10.000.000 packets per second.
>
> 10k is only 1ms of buffering, which is pretty low considering the cpu
> able to restart a queue might be blocked ~10 ms in a softirq handler.

I have incidentally long thought that you are also tweaking target and
interval for your environment?

> Whole point of codel is that number of packets in the queue is
> irrelevant. Only sojourn time is.

Which is a lovely thing in worlds with infinite amounts of memory.

> Now if your host has memory concerns, that's a separate issue, and you
> can adjust the qdisc limit, or add a callback from mm to be able to
> shrink queue in case of memory pressure, if you deal with non elastic
> flows.

So my take on this is that the default limit should be 1k on devices
with less than 256MB of ram overall, and 10k (or more) elsewhere. This
matches current txqueuelen behavior and has the least surprise.

It does strike me as useful but probably hurtful to try and resize the
queue when it gets too large as a callback from the mm subsystem,
better to just drop packets?

There are other patches out there to reduce memory pressure under load
(also used in openwrt) by reducing skb size, those have also worked
out well... typically they look like:

 static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
-       if (likely(skb_queue_len(&sch->q) < sch->limit))
+       if (likely(skb_queue_len(&sch->q) < sch->limit)) {
+               if (skb_queue_len(&sch->q) > 128)
+                       skb = skb_reduce_truesize(skb);
                return qdisc_enqueue_tail(skb, sch);
-
+       }

If these were wrapped in a define


>
>>
>> B) people that expect pfifo_fast semantics, for which substituting
>> fq_codel behaves oddly in two ways -
>>
>> 1) if you are explicitly setting skb->priority for the default
>> pfifo_fast 3 bands  and expecting a result, nothing happens - but in
>> the general case, people setting skb->priority are trying to get
>> better latency in the first place, and I really don't think almost
>> anybody will notice.

I can also sit down and go through all the various overloaded uses
that skb->priority has which make life really confusing and difficult.
I really don't mind ignoring it entirely by default. :)

>> 2) if you are using a filter on pfifo_fast that expects 3 bands, and
>> end up using fq_codel by default anyway we get DRR-like behavior over
>> codel rather than strict prioritization and lose fq_codel's full
>> benefits... which is still a win IMHO. I am not fond of being able to
>> starve the other two bands....

>> 3) trying to explicitly set pfifo_fast via tc doesn't work with this patch.
>>
>> 4) ECN processing is enabled by default (but off by default in sysctl)
>
> There is no 'one solution fits every needs'.
>
> codel is _not_ a replacement of pfifo_fast, its a replacement for pfifo.

Semantically here I'm trying to "replace the default qdisc" that
99.98% of people use, not "replace pfifo_fast" (that 99.99% of people
use)

or rather, come up with a strategy for doing such, one day, in some
more easily deployable fashion.

> If you want to replace pfifo_fast, you want PRIO + 3 codel, because
> pfifo_fast is really PRIO + 3 pfifo.

This is where this dialog died last time. This time however I'm trying
to assemble consensus as to the steps required to build a viable
*default* qdisc that is better than pfifo_fast, for desktops, servers,
android boxes, routers, etc - which fq_codel seems to win at (nearly)
across the board.

Certainly those users that override pfifo_fast should be allowed to
continue to do so.

I agree a three tier system on top of fq_codel, would be a pure
superset of pfifo_fast, and probably better in a few respects than
pure fq_codel, but disagree strongly that aping the existing
pfifo_fast PRIO-like behavior is desirable in a replacement for the
default qdisc.  Given the actual frequency of prio 1 traffic in your
data it seems reasonable, but given the actual frequency of prio 3
(background) in mine, completely starving the background queue in the
presence of a prio 1 or 2 flow is highly undesirable.

I'm just repeating my position from last time...

My problem with writing a prio_fq_codel qdisc that is a little more
like pfifo_fast is multifold.

1) there is a cpu hit in a level of drr/sfq + fq_codel that is kind of
unknown (I imagine sch_prio is trivial)

2) there is a memory hit in adding 3 fq_codel-like queues to every
qdisc. In the case of mq on wifi, you end up with 12 queues per ssid
and that's a rather huge hit on memory...

I kind of hope that we are in agreement, at least, that it would be
nice if pfifo_fast went the way of the dodo?

So based on this dialog here, and over on lwn.net
(http://lwn.net/Articles/558603/)

I think the beginnings of a way forward would be for me to

A) change my existing patch converting all instances of pfifo_fast to
fq_codel to be a configurable define instead (CONFIG_QDISC_DEFAULT)
that can be set more easily in a distro than dealing with tc directly,
which makes it vastly easier to apply to mq devices by default, in
particular.

B) Work on some sort of fq_codel derivative that has the desired 3
tier behavior. The simplest would be something that has a single queue
for prio 1 and a single queue for background and services the first in
the fast queue, and the background queue, well, darn it, I dunno.

C) come up with more ways of meeting the memory needs of both teeny
and gingormous devices as per above

>
>



-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html



More information about the Codel mailing list