[Cake] Dropping dropped

Cake - FQ_codel the next generation
 help / color / mirror / Atom feed

* [Cake] Dropping dropped
@ 2019-02-14 14:01 Adrian Popescu
  2019-02-14 14:35 ` Toke Høiland-Jørgensen
  2019-02-15 20:45 ` Dave Taht
  0 siblings, 2 replies; 7+ messages in thread
From: Adrian Popescu @ 2019-02-14 14:01 UTC (permalink / raw)
  To: Cake List


[-- Attachment #1.1: Type: text/plain, Size: 389 bytes --]

Hello,

I've taken a look at cake's source code to see what simple changes could be
made to attempt to speed it up. There seemed to be a per flow variable
called dropped which might not be that useful for regular users. The
attached patch removes it.

Perhaps cake could be optimized further for slow devices. What's the
recommended solution for profiling the kernel on mips with openwrt?

[-- Attachment #1.2: Type: text/html, Size: 516 bytes --]

[-- Attachment #2: cake-remove-dropped.patch --]
[-- Type: application/octet-stream, Size: 1077 bytes --]

diff --git a/sch_cake.c b/sch_cake.c
index 3a26db0..d1ea1d6 100644
--- a/sch_cake.c
+++ b/sch_cake.c
@@ -136,7 +136,6 @@ struct cake_flow {
 	struct sk_buff	  *tail;
 	struct list_head  flowchain;
 	s32		  deficit;
-	u32		  dropped;
 	struct cobalt_vars cvars;
 	u16		  srchost; /* index into cake_host table */
 	u16		  dsthost;
@@ -1594,7 +1593,6 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	sch->qstats.backlog -= len;
 	qdisc_tree_reduce_backlog(sch, 1, len);
 
-	flow->dropped++;
 	b->tin_dropped++;
 	sch->qstats.drops++;
 
@@ -2191,7 +2189,6 @@ retry:
 			flow->deficit -= len;
 			b->tin_deficit -= len;
 		}
-		flow->dropped++;
 		b->tin_dropped++;
 		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
 #if LINUX_VERSION_CODE < KERNEL_VERSION(4, 8, 0)
@@ -3088,7 +3085,6 @@ static int cake_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 			cake_maybe_unlock(sch);
 		}
 		qs.backlog = b->backlogs[idx % CAKE_QUEUES];
-		qs.drops = flow->dropped;
 	}
 	if (gnet_stats_copy_queue(d, NULL, &qs, qs.qlen) < 0)
 		return -1;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-14 14:01 [Cake] Dropping dropped Adrian Popescu
@ 2019-02-14 14:35 ` Toke Høiland-Jørgensen
  2019-02-15  8:23   ` Adrian Popescu
  2019-02-15 20:45 ` Dave Taht
  1 sibling, 1 reply; 7+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-02-14 14:35 UTC (permalink / raw)
  To: Adrian Popescu, Cake List

Adrian Popescu <adriannnpopescu@gmail.com> writes:

> Hello,
>
> I've taken a look at cake's source code to see what simple changes could be
> made to attempt to speed it up. There seemed to be a per flow variable
> called dropped which might not be that useful for regular users. The
> attached patch removes it.

I appreciate the sentiment; however, we keep a lot of statistics for a
reason, and we don't want to just drop those. Besides, a single variable
increment is probably not going to make much difference for performance.

> Perhaps cake could be optimized further for slow devices. What's the
> recommended solution for profiling the kernel on mips with openwrt?

No doubt; however, I'd suggest going about it by actually measuring any
bottle-necks and working from there. The 'perf' tool works on openwrt
these days, I think; that would probably be a good place to start.

-Toke

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-14 14:35 ` Toke Høiland-Jørgensen
@ 2019-02-15  8:23   ` Adrian Popescu
  2019-02-15 10:55     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 7+ messages in thread
From: Adrian Popescu @ 2019-02-15  8:23 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Cake List

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

On Thu, Feb 14, 2019 at 4:35 PM Toke Høiland-Jørgensen <toke@redhat.com>
wrote:

> Adrian Popescu <adriannnpopescu@gmail.com> writes:
>
> > Hello,
> >
> > I've taken a look at cake's source code to see what simple changes could
> be
> > made to attempt to speed it up. There seemed to be a per flow variable
> > called dropped which might not be that useful for regular users. The
> > attached patch removes it.
>
> I appreciate the sentiment; however, we keep a lot of statistics for a
> reason, and we don't want to just drop those. Besides, a single variable
> increment is probably not going to make much difference for performance.
>

Perhaps this seemed I implied that it was more than an experiment or
something meant to be merged. Is sharing a patch to start a conversation
frowned upon?

The point is that low end routers can't run cake at high speeds. It doesn't
make sense to buy more expensive hardware for home networks.


>
> > Perhaps cake could be optimized further for slow devices. What's the
> > recommended solution for profiling the kernel on mips with openwrt?
>
> No doubt; however, I'd suggest going about it by actually measuring any
> bottle-necks and working from there. The 'perf' tool works on openwrt
> these days, I think; that would probably be a good place to start.
>

Thanks.


>
> -Toke
>

[-- Attachment #2: Type: text/html, Size: 2186 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-15  8:23   ` Adrian Popescu
@ 2019-02-15 10:55     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 7+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-02-15 10:55 UTC (permalink / raw)
  To: Adrian Popescu; +Cc: Cake List

Adrian Popescu <adriannnpopescu@gmail.com> writes:

> On Thu, Feb 14, 2019 at 4:35 PM Toke Høiland-Jørgensen <toke@redhat.com>
> wrote:
>
>> Adrian Popescu <adriannnpopescu@gmail.com> writes:
>>
>> > Hello,
>> >
>> > I've taken a look at cake's source code to see what simple changes could
>> be
>> > made to attempt to speed it up. There seemed to be a per flow variable
>> > called dropped which might not be that useful for regular users. The
>> > attached patch removes it.
>>
>> I appreciate the sentiment; however, we keep a lot of statistics for a
>> reason, and we don't want to just drop those. Besides, a single variable
>> increment is probably not going to make much difference for performance.
>>
>
> Perhaps this seemed I implied that it was more than an experiment or
> something meant to be merged. Is sharing a patch to start a conversation
> frowned upon?

Not at all. Just pointing out why this particular approach is probably
not going to get you that far :)

> The point is that low end routers can't run cake at high speeds. It
> doesn't make sense to buy more expensive hardware for home networks.

Yeah, I'm aware of the issue. However, I'm not too hopeful that it will
be possible to squeeze significantly more performance out of the old
hardware, sadly. I'd love to be proven wrong, though!

-Toke

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-14 14:01 [Cake] Dropping dropped Adrian Popescu
  2019-02-14 14:35 ` Toke Høiland-Jørgensen
@ 2019-02-15 20:45 ` Dave Taht
  2019-02-16  9:35   ` Adrian Popescu
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Taht @ 2019-02-15 20:45 UTC (permalink / raw)
  To: Adrian Popescu; +Cc: Cake List

I still regard inbound shaping as our biggest deployment problem,
especially on cheap hardware.

Some days I want to go back to revisiting the ideas in the "bobbie"
shaper, other days...

In terms of speeding up cake:

* At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a
single cpu, in softirq. A lwn article just went by about a proposed
set of improvements for that:
https://lwn.net/SubscriberLink/779738/771e8f7050c26ade/

* Hardware multiqueue is more and more common (APU2 has 4). FQ_codel
is inherently parallel and could take advantage of hardware
multiqueue, if there was a better way to express it. What happens
nowadays is you get the "mq" scheduler with 4 fq_codel instances, when
running at line rate, but I tend to think with 64 hardware queues,
increasingly common in the >10GigE, having 64k fq_codel queues is
excessive. I'd love it if there was a way to have there be a divisor
in the mq -> subqdisc code so that we would have, oh, 32 queues per hw
queue in this case.

Worse, there's no way to attach a global shaped instance to that
hardware, e.g. in cake, which forces all those hardware queues (even
across cpus) into one. The ingress mirred code, here, is also a
problem. a "cake-mq" seemed feasible (basically you just turn the
shaper tracking into an atomic operation in three places), but the
overlying qdisc architecture for sch_mq -> subqdiscs has to be
extended or bypassed, somehow. (there's no way for sch_mq to
automagically pass sub-qdisc options to the next qdisc, and there's no
reason to have sch_mq

* I really liked the ingress "skb list" rework, but I'm not sure how
to get that from A to B.

* and I have a long standing dream of being able to kill off mirred
entirely and just be able to write

tc qdisc add dev eth0 ingress cake bandwidth X

*  native codel is 32 bit, cake is 64 bit. I

* hashing three times as cake does is expensive. Getting a partial
hash and combining it into a final would be faster.

* 8 way set associative is slower than 4 way and almost
indistinguishable from 8. Even direct mapping

* The cake blue code is rarely triggered and inline

I really did want cake to be faster than htb+fq_codel, I started a
project to basically ressurrect "early cake" - which WAS 40% faster
than htb+fq_codel and add in the idea *only* of an atomic builtin
hw-mq shaper a while back, but haven't got back to it.

https://github.com/dtaht/fq_codel_fast

with everything I ripped out in that it was about 5% less cpu to start with.

I can't tell you how many times I've looked over

https://elixir.bootlin.com/linux/latest/source/net/sched/sch_mqprio.c

hoping that enlightment would strike and there was a clean way to get
rid of that layer of abstraction.

But coming up with how to run more stuff in parallel was beyond my rcu-foo.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-15 20:45 ` Dave Taht
@ 2019-02-16  9:35   ` Adrian Popescu
  2019-02-18 20:42     ` Adrian Popescu
  0 siblings, 1 reply; 7+ messages in thread
From: Adrian Popescu @ 2019-02-16  9:35 UTC (permalink / raw)
  To: Dave Taht; +Cc: Cake List

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

Hello,

On Fri, Feb 15, 2019 at 10:45 PM Dave Taht <dave.taht@gmail.com> wrote:

> I still regard inbound shaping as our biggest deployment problem,
> especially on cheap hardware.
>
> Some days I want to go back to revisiting the ideas in the "bobbie"
> shaper, other days...
>
> In terms of speeding up cake:
>
> * At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a
> single cpu, in softirq. A lwn article just went by about a proposed
> set of improvements for that:
> https://lwn.net/SubscriberLink/779738/771e8f7050c26ade/

Will this help devices with a single core CPU?


>
>
> * Hardware multiqueue is more and more common (APU2 has 4). FQ_codel
> is inherently parallel and could take advantage of hardware
> multiqueue, if there was a better way to express it. What happens
> nowadays is you get the "mq" scheduler with 4 fq_codel instances, when
> running at line rate, but I tend to think with 64 hardware queues,
> increasingly common in the >10GigE, having 64k fq_codel queues is
> excessive. I'd love it if there was a way to have there be a divisor
> in the mq -> subqdisc code so that we would have, oh, 32 queues per hw
> queue in this case.
>
> Worse, there's no way to attach a global shaped instance to that
> hardware, e.g. in cake, which forces all those hardware queues (even
> across cpus) into one. The ingress mirred code, here, is also a
> problem. a "cake-mq" seemed feasible (basically you just turn the
> shaper tracking into an atomic operation in three places), but the
> overlying qdisc architecture for sch_mq -> subqdiscs has to be
> extended or bypassed, somehow. (there's no way for sch_mq to
> automagically pass sub-qdisc options to the next qdisc, and there's no
> reason to have sch_mq
>

The problem I deal with is performance on even lower end hardware with a
single queue. My experience with mq has been limited.


>
> * I really liked the ingress "skb list" rework, but I'm not sure how
> to get that from A to B.
>

What was this skb list rework? Is there a patch somewhere?


>
> * and I have a long standing dream of being able to kill off mirred
> entirely and just be able to write
>
> tc qdisc add dev eth0 ingress cake bandwidth X
>

Ingress on its own seems to be a performance hit. Do you think this would
reduce the performance hit?


>
> *  native codel is 32 bit, cake is 64 bit. I
>

Was there something else you forgot to write here?


>
> * hashing three times as cake does is expensive. Getting a partial
> hash and combining it into a final would be faster.
>

Could you elaborate how this would look, please? I've read the code a while
ago. It might be that I didn't figure out all the places where hashing is
done.


>
> * 8 way set associative is slower than 4 way and almost
> indistinguishable from 8. Even direct mapping
>

This should be easy to address by changing the 8 ways to to 4. Was there
something else you wanted to write here?


>
> * The cake blue code is rarely triggered and inline
>
> I really did want cake to be faster than htb+fq_codel, I started a
> project to basically ressurrect "early cake" - which WAS 40% faster
> than htb+fq_codel and add in the idea *only* of an atomic builtin
> hw-mq shaper a while back, but haven't got back to it.
>
> https://github.com/dtaht/fq_codel_fast
>
> with everything I ripped out in that it was about 5% less cpu to start
> with.
>

Perhaps further improvements made to the codel_vars struct will also help
fq_codel_fast. Do you think this could be improved further?

A cake_fast might be worth a shot.


>
> I can't tell you how many times I've looked over
>
> https://elixir.bootlin.com/linux/latest/source/net/sched/sch_mqprio.c
>
> hoping that enlightment would strike and there was a clean way to get
> rid of that layer of abstraction.
>
> But coming up with how to run more stuff in parallel was beyond my rcu-foo.
>

[-- Attachment #2: Type: text/html, Size: 6162 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] Dropping dropped
  2019-02-16  9:35   ` Adrian Popescu
@ 2019-02-18 20:42     ` Adrian Popescu
  0 siblings, 0 replies; 7+ messages in thread
From: Adrian Popescu @ 2019-02-18 20:42 UTC (permalink / raw)
  To: Dave Taht; +Cc: Cake List

[-- Attachment #1: Type: text/plain, Size: 4371 bytes --]

Hello,

This answers some of my own questions.

It seems the mirred and ifb combination is indeed what reduces performance
in my case. All optimizations made to fq_codel didn't help with ingress.

A simple fq_police would be a better solution for ingress than cake or
fq_codel.


On Sat, Feb 16, 2019 at 11:35 AM Adrian Popescu <adriannnpopescu@gmail.com>
wrote:

> Hello,
>
> On Fri, Feb 15, 2019 at 10:45 PM Dave Taht <dave.taht@gmail.com> wrote:
>
>> I still regard inbound shaping as our biggest deployment problem,
>> especially on cheap hardware.
>>
>> Some days I want to go back to revisiting the ideas in the "bobbie"
>> shaper, other days...
>>
>> In terms of speeding up cake:
>>
>> * At higher speeds (e.g. > 200mbit) cake tends to bottleneck on a
>> single cpu, in softirq. A lwn article just went by about a proposed
>> set of improvements for that:
>> https://lwn.net/SubscriberLink/779738/771e8f7050c26ade/
>
> Will this help devices with a single core CPU?
>
>
>>
>>
>> * Hardware multiqueue is more and more common (APU2 has 4). FQ_codel
>> is inherently parallel and could take advantage of hardware
>> multiqueue, if there was a better way to express it. What happens
>> nowadays is you get the "mq" scheduler with 4 fq_codel instances, when
>> running at line rate, but I tend to think with 64 hardware queues,
>> increasingly common in the >10GigE, having 64k fq_codel queues is
>> excessive. I'd love it if there was a way to have there be a divisor
>> in the mq -> subqdisc code so that we would have, oh, 32 queues per hw
>> queue in this case.
>>
>> Worse, there's no way to attach a global shaped instance to that
>> hardware, e.g. in cake, which forces all those hardware queues (even
>> across cpus) into one. The ingress mirred code, here, is also a
>> problem. a "cake-mq" seemed feasible (basically you just turn the
>> shaper tracking into an atomic operation in three places), but the
>> overlying qdisc architecture for sch_mq -> subqdiscs has to be
>> extended or bypassed, somehow. (there's no way for sch_mq to
>> automagically pass sub-qdisc options to the next qdisc, and there's no
>> reason to have sch_mq
>>
>
> The problem I deal with is performance on even lower end hardware with a
> single queue. My experience with mq has been limited.
>
>
>>
>> * I really liked the ingress "skb list" rework, but I'm not sure how
>> to get that from A to B.
>>
>
> What was this skb list rework? Is there a patch somewhere?
>
>
>>
>> * and I have a long standing dream of being able to kill off mirred
>> entirely and just be able to write
>>
>> tc qdisc add dev eth0 ingress cake bandwidth X
>>
>
> Ingress on its own seems to be a performance hit. Do you think this would
> reduce the performance hit?
>
>
>>
>> *  native codel is 32 bit, cake is 64 bit. I
>>
>
> Was there something else you forgot to write here?
>
>
>>
>> * hashing three times as cake does is expensive. Getting a partial
>> hash and combining it into a final would be faster.
>>
>
> Could you elaborate how this would look, please? I've read the code a
> while ago. It might be that I didn't figure out all the places where
> hashing is done.
>
>
>>
>> * 8 way set associative is slower than 4 way and almost
>> indistinguishable from 8. Even direct mapping
>>
>
> This should be easy to address by changing the 8 ways to to 4. Was there
> something else you wanted to write here?
>
>
>>
>> * The cake blue code is rarely triggered and inline
>>
>> I really did want cake to be faster than htb+fq_codel, I started a
>> project to basically ressurrect "early cake" - which WAS 40% faster
>> than htb+fq_codel and add in the idea *only* of an atomic builtin
>> hw-mq shaper a while back, but haven't got back to it.
>>
>> https://github.com/dtaht/fq_codel_fast
>>
>> with everything I ripped out in that it was about 5% less cpu to start
>> with.
>>
>
> Perhaps further improvements made to the codel_vars struct will also help
> fq_codel_fast. Do you think this could be improved further?
>
> A cake_fast might be worth a shot.
>
>
>>
>> I can't tell you how many times I've looked over
>>
>> https://elixir.bootlin.com/linux/latest/source/net/sched/sch_mqprio.c
>>
>> hoping that enlightment would strike and there was a clean way to get
>> rid of that layer of abstraction.
>>
>> But coming up with how to run more stuff in parallel was beyond my
>> rcu-foo.
>>
>

[-- Attachment #2: Type: text/html, Size: 6937 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-02-18 20:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14 14:01 [Cake] Dropping dropped Adrian Popescu
2019-02-14 14:35 ` Toke Høiland-Jørgensen
2019-02-15  8:23   ` Adrian Popescu
2019-02-15 10:55     ` Toke Høiland-Jørgensen
2019-02-15 20:45 ` Dave Taht
2019-02-16  9:35   ` Adrian Popescu
2019-02-18 20:42     ` Adrian Popescu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox