From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it1-x143.google.com (mail-it1-x143.google.com [IPv6:2607:f8b0:4864:20::143]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 43EB83B29E for ; Mon, 18 Feb 2019 15:43:08 -0500 (EST) Received: by mail-it1-x143.google.com with SMTP id m137so1132227ita.0 for ; Mon, 18 Feb 2019 12:43:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9Kli05ww2txqmR7LFLd1pS0XvZhbfKATA3L0TMxDo4U=; b=aFnluOX1TFFcU2ikHxHwFbRCt+NtMsMJNTHmJOo8+mPwSvvJDnVqufa3mOiW6ctTDC lKyic7Xhx94YPLrhoy9JLIZbTMndahRUTDqiVK4uksggainLMKu+gv52fcBuozydIC+j WEy4ewKyvJP3hv01A2puuCRB9VQm810OkYBtFvyvEteUAYUENlVc/UZXbrSw8QvrOiU8 mOdRsElS8EEdhWZ+wTVkjl59tfC5/0BpufhW/g3c8D39tn6keAlGaGKyWr4TdR8lANL/ 6TeSvHezfmXv/iBqVNSEmAQnBggF7A5nVXMXdlXJcyGX0AP326P7hxizXAx7S6jlEQ4V T3qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9Kli05ww2txqmR7LFLd1pS0XvZhbfKATA3L0TMxDo4U=; b=l3rU4M1ovdH6ntvLvWFaLqBNG5EF/zJivajV5K2mArPF+0reP/46n48h651kCFE63h tDGK8KCz9gGAkrTmThGbEEN4r7uCu71Y6sHLkaGsUgqDYxpC0w4GFgN6fGx03H+l1REu tlxeC5zdULFUdbiLquyr1t6z9SV0IvdusALiIa58kisgXwWBz9OHM7GLkJu1IpVn39Jc gRmYWiCrqE0AjI3ms8asDZPAkkV8bFri2rCP1zb4RA6SnH2s7kEnO5WJW9+7uOpD7VUs VF2jsOLUd5Oes+5YPPl4g1DDiA5gf0PX+SbY30VHmManqw1a+v/atz8cxdeSrUUZ9QvA nMMQ== X-Gm-Message-State: AHQUAubIz+1gjFTh5ywB/fOAGzwDp7QmLfS6Y2f1/5lQFC1U8RkBCugC ugLCWeOx/6od+uDOK2s+ETneaMHwjHZY9sZ7iss= X-Google-Smtp-Source: AHgI3IZF3tsfGRBj/pYgMzE+aVb/1CpVuOMJq16VpHF2Q/9lPu/5a6ne6HNEmeSl8jI6ddzMjvPVz6JkG9zzJNVTZvc= X-Received: by 2002:a24:17:: with SMTP id 23mr516916ita.158.1550522587740; Mon, 18 Feb 2019 12:43:07 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Adrian Popescu Date: Mon, 18 Feb 2019 22:42:56 +0200 Message-ID: To: Dave Taht Cc: Cake List Content-Type: multipart/alternative; boundary="0000000000000efcdd0582312cec" Subject: Re: [Cake] Dropping dropped X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Feb 2019 20:43:08 -0000 --0000000000000efcdd0582312cec Content-Type: text/plain; charset="UTF-8" Hello, This answers some of my own questions. It seems the mirred and ifb combination is indeed what reduces performance in my case. All optimizations made to fq_codel didn't help with ingress. A simple fq_police would be a better solution for ingress than cake or fq_codel. On Sat, Feb 16, 2019 at 11:35 AM Adrian Popescu wrote: > Hello, > > On Fri, Feb 15, 2019 at 10:45 PM Dave Taht wrote: > >> I still regard inbound shaping as our biggest deployment problem, >> especially on cheap hardware. >> >> Some days I want to go back to revisiting the ideas in the "bobbie" >> shaper, other days... >> >> In terms of speeding up cake: >> >> * At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a >> single cpu, in softirq. A lwn article just went by about a proposed >> set of improvements for that: >> https://lwn.net/SubscriberLink/779738/771e8f7050c26ade/ > > Will this help devices with a single core CPU? > > >> >> >> * Hardware multiqueue is more and more common (APU2 has 4). FQ_codel >> is inherently parallel and could take advantage of hardware >> multiqueue, if there was a better way to express it. What happens >> nowadays is you get the "mq" scheduler with 4 fq_codel instances, when >> running at line rate, but I tend to think with 64 hardware queues, >> increasingly common in the >10GigE, having 64k fq_codel queues is >> excessive. I'd love it if there was a way to have there be a divisor >> in the mq -> subqdisc code so that we would have, oh, 32 queues per hw >> queue in this case. >> >> Worse, there's no way to attach a global shaped instance to that >> hardware, e.g. in cake, which forces all those hardware queues (even >> across cpus) into one. The ingress mirred code, here, is also a >> problem. a "cake-mq" seemed feasible (basically you just turn the >> shaper tracking into an atomic operation in three places), but the >> overlying qdisc architecture for sch_mq -> subqdiscs has to be >> extended or bypassed, somehow. (there's no way for sch_mq to >> automagically pass sub-qdisc options to the next qdisc, and there's no >> reason to have sch_mq >> > > The problem I deal with is performance on even lower end hardware with a > single queue. My experience with mq has been limited. > > >> >> * I really liked the ingress "skb list" rework, but I'm not sure how >> to get that from A to B. >> > > What was this skb list rework? Is there a patch somewhere? > > >> >> * and I have a long standing dream of being able to kill off mirred >> entirely and just be able to write >> >> tc qdisc add dev eth0 ingress cake bandwidth X >> > > Ingress on its own seems to be a performance hit. Do you think this would > reduce the performance hit? > > >> >> * native codel is 32 bit, cake is 64 bit. I >> > > Was there something else you forgot to write here? > > >> >> * hashing three times as cake does is expensive. Getting a partial >> hash and combining it into a final would be faster. >> > > Could you elaborate how this would look, please? I've read the code a > while ago. It might be that I didn't figure out all the places where > hashing is done. > > >> >> * 8 way set associative is slower than 4 way and almost >> indistinguishable from 8. Even direct mapping >> > > This should be easy to address by changing the 8 ways to to 4. Was there > something else you wanted to write here? > > >> >> * The cake blue code is rarely triggered and inline >> >> I really did want cake to be faster than htb+fq_codel, I started a >> project to basically ressurrect "early cake" - which WAS 40% faster >> than htb+fq_codel and add in the idea *only* of an atomic builtin >> hw-mq shaper a while back, but haven't got back to it. >> >> https://github.com/dtaht/fq_codel_fast >> >> with everything I ripped out in that it was about 5% less cpu to start >> with. >> > > Perhaps further improvements made to the codel_vars struct will also help > fq_codel_fast. Do you think this could be improved further? > > A cake_fast might be worth a shot. > > >> >> I can't tell you how many times I've looked over >> >> https://elixir.bootlin.com/linux/latest/source/net/sched/sch_mqprio.c >> >> hoping that enlightment would strike and there was a clean way to get >> rid of that layer of abstraction. >> >> But coming up with how to run more stuff in parallel was beyond my >> rcu-foo. >> > --0000000000000efcdd0582312cec Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,

This answers some of = my own questions.

It seems the mirred and ifb = combination is indeed what reduces performance in my case. All optimization= s made to fq_codel didn't help with ingress.

= A simple fq_police would be a better solution for ingress than cake or fq_c= odel.


On Sat, Feb 16, 2019 at 11:35 AM Adrian Popes= cu <adriannnpopescu@gmail.c= om> wrote:
Hello,

On Fri, Feb 15, 2019 at 10:45 PM Dave Tah= t <dave.taht@gm= ail.com> wrote:
I still regard inbound shaping as our biggest deployment problem, especially on cheap hardware.

Some days I want to go back to revisiting the ideas in the "bobbie&quo= t;
shaper, other days...

In terms of speeding up cake:

* At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a
single cpu, in softirq. A lwn article just went by about a proposed
set of improvements for that:
https://lwn.net/SubscriberLink/779738/771e8f7= 050c26ade/
Will this help devices with a single core C= PU?
=C2=A0


* Hardware multiqueue is more and more common (APU2 has 4). FQ_codel
is inherently parallel and could take advantage of hardware
multiqueue, if there was a better way to express it. What happens
nowadays is you get the "mq" scheduler with 4 fq_codel instances,= when
running at line rate, but I tend to think with 64 hardware queues,
increasingly common in the >10GigE, having 64k fq_codel queues is
excessive. I'd love it if there was a way to have there be a divisor in the mq -> subqdisc code so that we would have, oh, 32 queues per hw queue in this case.

Worse, there's no way to attach a global shaped instance to that
hardware, e.g. in cake, which forces all those hardware queues (even
across cpus) into one. The ingress mirred code, here, is also a
problem. a "cake-mq" seemed feasible (basically you just turn the=
shaper tracking into an atomic operation in three places), but the
overlying qdisc architecture for sch_mq -> subqdiscs has to be
extended or bypassed, somehow. (there's no way for sch_mq to
automagically pass sub-qdisc options to the next qdisc, and there's no<= br> reason to have sch_mq

The problem I dea= l with is performance on even lower end hardware with a single queue. My ex= perience with mq has been limited.
=C2=A0

* I really liked the ingress "skb list" rework, but I'm not s= ure how
to get that from A to B.

What was this = skb list rework? Is there a patch somewhere?
=C2=A0

* and I have a long standing dream of being able to kill off mirred
entirely and just be able to write

tc qdisc add dev eth0 ingress cake bandwidth X

Ingress on its own seems to be a performance hit. Do you think this= would reduce the performance hit?
=C2=A0

*=C2=A0 native codel is 32 bit, cake is 64 bit. I

=
Was there something else you forgot to write here?
=C2=A0

* hashing three times as cake does is expensive. Getting a partial
hash and combining it into a final would be faster.
Could you elaborate how this would look, please? I've read= the code a while ago. It might be that I didn't figure out all the pla= ces where hashing is done.
=C2=A0

* 8 way set associative is slower than 4 way and almost
indistinguishable from 8. Even direct mapping

This should be easy to address by changing the 8 ways to to 4. Was t= here something else you wanted to write here?
=C2=A0

* The cake blue code is rarely triggered and inline

I really did want cake to be faster than htb+fq_codel, I started a
project to basically ressurrect "early cake" - which WAS 40% fast= er
than htb+fq_codel and add in the idea *only* of an atomic builtin
hw-mq shaper a while back, but haven't got back to it.

https://github.com/dtaht/fq_codel_fast

with everything I ripped out in that it was about 5% less cpu to start with= .

Perhaps further improvements made to = the codel_vars struct will also help fq_codel_fast. Do you think this could= be improved further?

A cake_fast might be worth a= shot.
=C2=A0

I can't tell you how many times I've looked over

https://elixir.bootlin.com/linu= x/latest/source/net/sched/sch_mqprio.c

hoping that enlightment would strike and there was a clean way to get
rid of that layer of abstraction.

But coming up with how to run more stuff in parallel was beyond my rcu-foo.=
--0000000000000efcdd0582312cec--