Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51

Cake - FQ_codel the next generation
 help / color / mirror / Atom feed

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
       [not found]       ` <1488484262.16753.0@smtp.autistici.org>
@ 2017-03-02 21:10         ` Dave Täht
  2017-03-02 23:16           ` John Yates
  2017-03-02 23:55           ` John Yates
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Täht @ 2017-03-02 21:10 UTC (permalink / raw)
  To: lede-dev, cake

On 3/2/17 11:51 AM, Stijn Segers wrote:
> Thanks Sebastian, turned out to be a silly syntax error, I have it all
> disabled now. Ethtool -k and ethtool -K printing/requiring different
> stuff doesn't help of course :-)
> 
> I re-enabled SQM, will see how that works out with the offloading disabled.

Would be good to know. I lost a bit of sleep lately (given how badly we
got bit by RCU on the ATF front, I worry about cake... but I can't see
how that would break, there.)

In terms of general "why does shaping use so much cpu"...

I am keen to stress that the core fq_codel algorithm is very lightweight
and barely shows up on traces when used without software rate limiting
and with BQL.

You CAN see a difference in forwarding performance at really high native
rates if you use pfifo and compare it to fq_codel on some platforms -
pfifo-fast is simpler overall. To experiment, you can re-enable
pfifo-fast in scenarios if you want - (tc qdisc add dev whatever pfifo
limit somethingsane, or bfifo something sane)

... however things like nat and firewall rules tend to dominate the
forwarding costs, and fq_codel reduces latency muchly over pfifo), and
the principal use of fq_codel is for sqm (and now wifi).

As for software rate shaping - this is very cpu intensive no matter how
you do it. I wish we didn't have to do it - and with certain (mostly old
DSL) modems that do flow control you don't.

The only one I know that gets this right is the transverse geode that
david woodhouse has. One of my disappointments across the industry is
not seeing BQL roll out universally on any dsl firmwares, starting, oh,
5 years ago.

If we had ethernet devices with a programmable timer (only interrupt me
on 40mbit rate) we could also completely eliminate software rate shaping....

anyway my benchmarks are showing that:

cake in it's "besteffort" mode smokes HTB + fq_codel, affording
over 40% more headroom in terms of cpu with bandwidth. (Independent
confirmation across more cpu types is need)

In the default mode, with the new 3 tier classification, wash, nat and
triple-isolate/dual-host/dual-src features - which we hope are going to
help folk deal with torrent better in particular - it's a wash.

cake is a LOT more cpu intense than fq_codel is, especially in its
default modes, which it makes up for by being more unified. Mostly.

If you are running low on cpu and are trying to shape inbound on most of
these low-end mips devices to speeds > 60Mbits, I'd highly recommend
switching to using "besteffort" on that rather than the 3 QoS queue
default. Most ISPs are not classifying traffic well, anyway, and FQ
solves nearly everything, especially per host fq....

But none of what I just said applies if there's a bug somewhere else!
GRO has given me fits for years now, and I'm scarred by that.

In terms of cpu costs in cake/fq_codel - dequeue, hashing, and
timestamping show up most on a trace. The rate limiting effort where all
that is happening shows up in softirq dominating the platform.

I have *always* worried that there exists devices (particularly
multi-cores) without a first class high speed internal clock facility,
but thus far haven't had an issue with it (unlike on BSD, which has
internal timings good to only a ms).

As for speeding up hashing, I've been looking over various algorithms to
do that for years now, I'm open to suggestions. The fastest new ones
tend to depend on co-processor support. The fastest I've seen relies on
the CRC32 instruction which is only in some intel platforms.

Cake could certainly use a big round of profiling but it is generally my
hope that we won big with it, in its present form.

I welcome (especially flent) benchmarks of sqm on various architectures
we've not explored fully - notably arm ones -

My hat is off to all that have worked so hard to make this subsystem -
and also all of lede - work so well, in this release.

> Cheers
> 
> Stijn
> 
> 
> _______________________________________________
> Lede-dev mailing list
> Lede-dev@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-02 21:10         ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht
@ 2017-03-02 23:16           ` John Yates
  2017-03-03  0:00             ` Jonathan Morton
  2017-03-02 23:55           ` John Yates
  1 sibling, 1 reply; 14+ messages in thread
From: John Yates @ 2017-03-02 23:16 UTC (permalink / raw)
  To: Dave Täht; +Cc: lede-dev, cake

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

On Thu, Mar 2, 2017 at 4:10 PM, Dave Täht <dave@taht.net> wrote:

> As for speeding up hashing, I've been looking over various algorithms to
> do that for years now, I'm open to suggestions. The fastest new ones
> tend to depend on co-processor support. The fastest I've seen relies on
> the CRC32 instruction which is only in some intel platforms.
>

This is an area where I have a fair amount of experience...

What are the requirements for this hashing function?
- How much data is being hashed?  I am guessing a limited number of bytes
rather than an entire packet payload.
- What is the typical number of hash table buckets?  Is it prime or a power
of 2?

/john

[-- Attachment #2: Type: text/html, Size: 1801 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-02 23:16           ` John Yates
@ 2017-03-03  0:00             ` Jonathan Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Morton @ 2017-03-03  0:00 UTC (permalink / raw)
  To: John Yates; +Cc: Dave Täht, cake, lede-dev

> On 3 Mar, 2017, at 01:16, John Yates <john@yates-sheets.org> wrote:
> 
> What are the requirements for this hashing function?
> - How much data is being hashed?  I am guessing a limited number of bytes rather than an entire packet payload.

Generally it’s what we call the “5-tuple”: two addresses (which could be IPv4, IPv6, or potentially Ethernet MAC), two port numbers (16 bits each), and a transport protocol ID (1 byte).

In Cake, the hash function is by default run three times, in order to get separate hashes over just the source address and just the destination address, as well as the full 5-tuple.  These are necessary to operate the triple-isolate algorithm.  There may be an opportunity for optimisation by producing all three hashes in parallel.

> - What is the typical number of hash table buckets?  Is it prime or a power of 2?

It’s a power of two, yes.  The actual number of buckets is 1024, but Cake uses the full 32-bit hash as a “tag” for hash collision detection without having to store and compare the entire 5-tuple.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-02 21:10         ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht
  2017-03-02 23:16           ` John Yates
@ 2017-03-02 23:55           ` John Yates
  2017-03-03  0:02             ` Jonathan Morton
  1 sibling, 1 reply; 14+ messages in thread
From: John Yates @ 2017-03-02 23:55 UTC (permalink / raw)
  To: Dave Täht; +Cc: lede-dev, cake

On Thu, Mar 2, 2017 at 4:10 PM, Dave Täht <dave@taht.net> wrote:

> As for speeding up hashing, I've been looking over various algorithms to
> do that for years now, I'm open to suggestions. The fastest new ones
> tend to depend on co-processor support. The fastest I've seen relies on
> the CRC32 instruction which is only in some intel platforms.

This is an area where I have a fair amount of experience.

It is a misconception that CRC is a good hash function.  It is good
at detecting errors but has poor avalanche performance.

What are the requirements for this hashing function?
- How much data is being hashed?  (I would guess a limited number
  of bytes rather than an entire packet payload.)
- What is the typical number of hash table buckets?  Must it be a
  power of 2?  Or are you willing to make it a prime number?

Assuming you can afford a 1KB lookup table I would suggest
the SBox hash in figure four of this article:

http://papa.bretmulvey.com/post/124028832958/hash-functions

The virtue of a prime number of buckets is that when you mod
your 32-bit hash value to get a bucket index you harvest _all_
of the entropy in the hash, not just the entropy in the bits you
preserve.

/john

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-02 23:55           ` John Yates
@ 2017-03-03  0:02             ` Jonathan Morton
  2017-03-03  4:31               ` Eric Luehrsen
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Morton @ 2017-03-03  0:02 UTC (permalink / raw)
  To: John Yates; +Cc: Dave Täht, cake, lede-dev


> On 3 Mar, 2017, at 01:55, John Yates <john@yates-sheets.org> wrote:
> 
> The virtue of a prime number of buckets is that when you mod
> your 32-bit hash value to get a bucket index you harvest _all_
> of the entropy in the hash, not just the entropy in the bits you
> preserve.

True, but you incur the cost of a division, which is very much non-trivial on ARM CPUs, which are increasingly common in CPE.

 - Jonathan Morton


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  0:02             ` Jonathan Morton
@ 2017-03-03  4:31               ` Eric Luehrsen
  2017-03-03  4:35                 ` Jonathan Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Luehrsen @ 2017-03-03  4:31 UTC (permalink / raw)
  To: Jonathan Morton, John Yates; +Cc: cake, lede-dev, Dave Täht

On 03/02/2017 07:02 PM, Jonathan Morton wrote:
>> On 3 Mar, 2017, at 01:55, John Yates <john@yates-sheets.org> wrote:
>>
>> The virtue of a prime number of buckets is that when you mod
>> your 32-bit hash value to get a bucket index you harvest _all_
>> of the entropy in the hash, not just the entropy in the bits you
>> preserve.
> True, but you incur the cost of a division, which is very much non-trivial on ARM CPUs, which are increasingly common in CPE.
>
>   - Jonathan Morton
Also with SQM you may not what idealized entropy in your queue 
distribution. It is desired by some to have host-connection fairness, 
and not so much interest in stream-type fairness. So overlap in a few 
hash "tags" may not be always such a bad thing depending on how it works 
itself out.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  4:31               ` Eric Luehrsen
@ 2017-03-03  4:35                 ` Jonathan Morton
  2017-03-03  5:00                   ` Eric Luehrsen
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Morton @ 2017-03-03  4:35 UTC (permalink / raw)
  To: Eric Luehrsen; +Cc: John Yates, cake, lede-dev, Dave Täht


> On 3 Mar, 2017, at 06:31, Eric Luehrsen <ericluehrsen@hotmail.com> wrote:
> 
> Also with SQM you may not what idealized entropy in your queue 
> distribution. It is desired by some to have host-connection fairness, 
> and not so much interest in stream-type fairness. So overlap in a few 
> hash "tags" may not be always such a bad thing depending on how it works 
> itself out.

That sort of thing is explicitly catered for by the triple-isolate algorithm.  I don’t want to rely on particular hash behaviour to achieve an inferior result.  I’d much rather have a good hash with maximal entropy.

 - Jonathan Morton


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  4:35                 ` Jonathan Morton
@ 2017-03-03  5:00                   ` Eric Luehrsen
  2017-03-03  5:49                     ` Jonathan Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Luehrsen @ 2017-03-03  5:00 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: John Yates, cake, lede-dev, Dave Täht

On 03/02/2017 11:35 PM, Jonathan Morton wrote:
>> On 3 Mar, 2017, at 06:31, Eric Luehrsen <ericluehrsen@hotmail.com> wrote:
>>
>> Also with SQM you may not what idealized entropy in your queue
>> distribution. It is desired by some to have host-connection fairness,
>> and not so much interest in stream-type fairness. So overlap in a few
>> hash "tags" may not be always such a bad thing depending on how it works
>> itself out.
> That sort of thing is explicitly catered for by the triple-isolate algorithm.  I don’t want to rely on particular hash behaviour to achieve an inferior result.  I’d much rather have a good hash with maximal entropy.
>
>   - Jonathan Morton
That's not what I was going for. Agree, it would not be good to depend 
on an inferior hash. You mentioned divide as a "cost." So I was 
proposing a thought around a "benefit" estimate. If hash collisions are 
not as important (or are they), then what is "benefit / cost?"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  5:00                   ` Eric Luehrsen
@ 2017-03-03  5:49                     ` Jonathan Morton
  2017-03-03  6:21                       ` Dave Taht
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Morton @ 2017-03-03  5:49 UTC (permalink / raw)
  To: Eric Luehrsen; +Cc: John Yates, cake, lede-dev, Dave Täht

> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> wrote:
> 
> That's not what I was going for. Agree, it would not be good to depend 
> on an inferior hash. You mentioned divide as a "cost." So I was 
> proposing a thought around a "benefit" estimate. If hash collisions are 
> not as important (or are they), then what is "benefit / cost?"

The computational cost of one divide is not the only consideration I have in mind.

Cake’s set-associative hash is fundamentally predicated on the number of hash buckets *not* being prime, as it requires further decomposing the hash into a major and minor part when a collision is detected.  The minor part is then iterated to try to locate a matching or free bucket.

This is considerably easier to do and reason about when everything is a power of two.  Then, modulus is a masking operation, and divide is a shift, either of which can be done in one cycle flat.

AFAIK, however, the main CPU cost of the hash function in Cake is not the hash itself, but the packet dissection required to obtain the data it operates on.  This is something a profile would shed more light on.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  5:49                     ` Jonathan Morton
@ 2017-03-03  6:21                       ` Dave Taht
  2017-03-06 13:30                         ` Benjamin Cronce
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Taht @ 2017-03-03  6:21 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Eric Luehrsen, cake

As this is devolving into a cake specific discussion, removing the
lede mailing list.

On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
>
>> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> wrote:
>>
>> That's not what I was going for. Agree, it would not be good to depend
>> on an inferior hash. You mentioned divide as a "cost." So I was
>> proposing a thought around a "benefit" estimate. If hash collisions are
>> not as important (or are they), then what is "benefit / cost?"
>
> The computational cost of one divide is not the only consideration I have in mind.
>
> Cake’s set-associative hash is fundamentally predicated on the number of hash buckets *not* being prime, as it requires further decomposing the hash into a major and minor part when a collision is detected.  The minor part is then iterated to try to locate a matching or free bucket.
>
> This is considerably easier to do and reason about when everything is a power of two.  Then, modulus is a masking operation, and divide is a shift, either of which can be done in one cycle flat.
>
> AFAIK, however, the main CPU cost of the hash function in Cake is not the hash itself, but the packet dissection required to obtain the data it operates on.  This is something a profile would shed more light on.

Tried. Mips wasn't a good target.

The jhash3 setup cost is bad, but I agree flow dissection can be
deeply expensive. As well as the other 42+ functions a packet needs to
traverse to get from ingress to egress.

But staying on hashing:

One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash
if one already existed (injected already by tcp, or by hardware, or
the tunneling tool). we only need to compute a partial hash on the
smaller subset of keys in that case (if we can rely on the skb->hash
which we cannot do in the nat case)

Another thing I did, long ago, was read the (60s-era!) liturature
about set-associative cpu cache architectures... and...

In all of these cases I really, really wanted to just punt all this
extra work to hardware in ingress - computing 3 hashes can be easily
done in parallel there and appended to the packet as it completes.

I have been working quite a bit more with the arm architecture of
late, and the "perf" profiler over there is vastly better than the
mips one we've had.

(and aarch64 is *nice*. So is NEON)

- but I hadn't got around to dinking with cake there until yesterday.

One thing I'm noticing is that even the gigE capable arms have weak or
non-existent L2 caches, and generally struggle to get past 700Mbits
bidirectionally on the network.

some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are here:

http://www.taht.net/~d/lime-2/

The rrul tests were not particularly pleasing. [1]

...

A second thing on my mind is to be able to take advantage of A) more cores

... and B) hardware that increasingly has 4 or more lanes in it.

1)  Presently fq_codel (and cake's) behavior there when set as a
default qdisc is sub-optimal - if you have 64 hardware queues you end
up with 64 instances, each with 1024 queues. While this might be
awesome from a FQ perspective I really don't think the aqm will be as
good. Or maybe it might be - what happens with 64000 queues at
100Mbit?

2) It's currently impossible to shape network traffic across cores.
I'd like to imagine that with a single atomic exchange or sloppily
shared values shaping would be feasible.

(also softirq is a single thread, I believe)

3) mq and mqprio are commonly deployed on the high end for this.

So I've thought about doing up another version - call it - I dunno -
smq - "smart multi-queue" - and seeing how far we could get.

>  - Jonathan Morton
>
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake

[1] If you are on this list and are not using flent, tough. I'm not
going through the trouble of generating graphs myself anymore.

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-03  6:21                       ` Dave Taht
@ 2017-03-06 13:30                         ` Benjamin Cronce
  2017-03-06 14:44                           ` Jonathan Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Cronce @ 2017-03-06 13:30 UTC (permalink / raw)
  To: Dave Taht; +Cc: Jonathan Morton, cake, Eric Luehrsen

[-- Attachment #1: Type: text/plain, Size: 5576 bytes --]

On Fri, Mar 3, 2017 at 12:21 AM, Dave Taht <dave.taht@gmail.com> wrote:

> As this is devolving into a cake specific discussion, removing the
> lede mailing list.
>
> On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99@gmail.com>
> wrote:
> >
> >> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com>
> wrote:
> >>
> >> That's not what I was going for. Agree, it would not be good to depend
> >> on an inferior hash. You mentioned divide as a "cost." So I was
> >> proposing a thought around a "benefit" estimate. If hash collisions are
> >> not as important (or are they), then what is "benefit / cost?"
> >
> > The computational cost of one divide is not the only consideration I
> have in mind.
> >
> > Cake’s set-associative hash is fundamentally predicated on the number of
> hash buckets *not* being prime, as it requires further decomposing the hash
> into a major and minor part when a collision is detected.  The minor part
> is then iterated to try to locate a matching or free bucket.
> >
> > This is considerably easier to do and reason about when everything is a
> power of two.  Then, modulus is a masking operation, and divide is a shift,
> either of which can be done in one cycle flat.
> >
> > AFAIK, however, the main CPU cost of the hash function in Cake is not
> the hash itself, but the packet dissection required to obtain the data it
> operates on.  This is something a profile would shed more light on.
>
> Tried. Mips wasn't a good target.
>
> The jhash3 setup cost is bad, but I agree flow dissection can be
> deeply expensive. As well as the other 42+ functions a packet needs to
> traverse to get from ingress to egress.
>
> But staying on hashing:
>
> One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash
> if one already existed (injected already by tcp, or by hardware, or
> the tunneling tool). we only need to compute a partial hash on the
> smaller subset of keys in that case (if we can rely on the skb->hash
> which we cannot do in the nat case)
>
> Another thing I did, long ago, was read the (60s-era!) liturature
> about set-associative cpu cache architectures... and...
>
> In all of these cases I really, really wanted to just punt all this
> extra work to hardware in ingress - computing 3 hashes can be easily
> done in parallel there and appended to the packet as it completes.
>
> I have been working quite a bit more with the arm architecture of
> late, and the "perf" profiler over there is vastly better than the
> mips one we've had.
>
> (and aarch64 is *nice*. So is NEON)
>
> - but I hadn't got around to dinking with cake there until yesterday.
>
> One thing I'm noticing is that even the gigE capable arms have weak or
> non-existent L2 caches, and generally struggle to get past 700Mbits
> bidirectionally on the network.
>
> some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are
> here:
>
> http://www.taht.net/~d/lime-2/
>
> The rrul tests were not particularly pleasing. [1]
>
> ...
>
> A second thing on my mind is to be able to take advantage of A) more cores
>
> ... and B) hardware that increasingly has 4 or more lanes in it.
>
> 1)  Presently fq_codel (and cake's) behavior there when set as a
> default qdisc is sub-optimal - if you have 64 hardware queues you end
> up with 64 instances, each with 1024 queues. While this might be
> awesome from a FQ perspective I really don't think the aqm will be as
> good. Or maybe it might be - what happens with 64000 queues at
> 100Mbit?
>
> 2) It's currently impossible to shape network traffic across cores.
> I'd like to imagine that with a single atomic exchange or sloppily
> shared values shaping would be feasible.
>
>
When you need to worry about multithreading, many times perfect is very
much the enemy of good. Depending on how quickly you need to make the
network react, you could do something along the lines of a "shared pool" of
bandwidth. Each core gets a split of the bandwidth, any unused bandwidth
can be added to the pool, and cores that want more bandwidth can take
bandwidth from the pool.

You could treat it like task stealing, except each core can generate tokens
that represent a quantum of bandwidth that is only valid for some interval.
If a core suddenly needs bandwidth, it can attempt to "take back" from its
publicly shared pool. If other cores have already borrowed, it can attempt
to borrow from another core. If it can't find any spare bandwidth, it just
waits for some interval related to how long a quantum is valid, and assumes
it's safe.

Or something.. I don't know, it's 7am and I just woke up.


> (also softirq is a single thread, I believe)
>
> 3) mq and mqprio are commonly deployed on the high end for this.
>
> So I've thought about doing up another version - call it - I dunno -
> smq - "smart multi-queue" - and seeing how far we could get.
>
> >  - Jonathan Morton
> >
> > _______________________________________________
> > Cake mailing list
> > Cake@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cake
>
>
>
> [1] If you are on this list and are not using flent, tough. I'm not
> going through the trouble of generating graphs myself anymore.
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
>

[-- Attachment #2: Type: text/html, Size: 7184 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-06 13:30                         ` Benjamin Cronce
@ 2017-03-06 14:44                           ` Jonathan Morton
  2017-03-06 18:08                             ` Benjamin Cronce
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Morton @ 2017-03-06 14:44 UTC (permalink / raw)
  To: Benjamin Cronce; +Cc: Dave Taht, cake, Eric Luehrsen

> On 6 Mar, 2017, at 15:30, Benjamin Cronce <bcronce@gmail.com> wrote:
> 
> You could treat it like task stealing, except each core can generate tokens that represent a quantum of bandwidth that is only valid for some interval.

You’re obviously thinking of a token-bucket based shaper here.  CAKE uses a deficit-mode shaper which deliberately works a different way - it’s more accurate on short timescales, and this actually makes a positive difference in several important cases.

The good news is that there probably is a way to explicitly and efficiently share bandwidth in any desired ratio across different CAKE instances, assuming a shared-memory location can be established.  I don’t presently have the mental bandwidth to actually try doing that, though.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-06 14:44                           ` Jonathan Morton
@ 2017-03-06 18:08                             ` Benjamin Cronce
  2017-03-06 18:46                               ` Jonathan Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Cronce @ 2017-03-06 18:08 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Dave Taht, cake, Eric Luehrsen

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

Depends on how short of a timescale you're talking about. Shared global
state that is being read and written to very quickly by multiple threads is
bad enough for a single package system, but when you start getting to
something like an AMD Ryzen or NUMA, shared global state becomes really
expensive. Accuracy is expensive. Loosen the accuracy and gain scalability.

I would be interested in the pseduo-code or high level of what state needs
to be shared and how that state is used.

I was also thinking more of some hybrid. Instead of a "token" representing
a bucked amount of bandwidth that can be immediately used, I was thinking
more of like a "future" of bandwidth that could be used. So instead of
saying "here's a token of bandwidth", you have each core doing it's own
deficit bandwidth shaping, but when a token is received, a core can
temporarily increase its assigned shaping bandwidth. If I remember
correctly, cake already supports having its bandwidth changed on the fly.

Of course it may be simpler to say cake is meant to be used on no more than
8 cores with a non-numa CPU system with all cores having a shared
low-latency cache connecting the cores.

On Mon, Mar 6, 2017 at 8:44 AM, Jonathan Morton <chromatix99@gmail.com>
wrote:

>
> > On 6 Mar, 2017, at 15:30, Benjamin Cronce <bcronce@gmail.com> wrote:
> >
> > You could treat it like task stealing, except each core can generate
> tokens that represent a quantum of bandwidth that is only valid for some
> interval.
>
> You’re obviously thinking of a token-bucket based shaper here.  CAKE uses
> a deficit-mode shaper which deliberately works a different way - it’s more
> accurate on short timescales, and this actually makes a positive difference
> in several important cases.
>
> The good news is that there probably is a way to explicitly and
> efficiently share bandwidth in any desired ratio across different CAKE
> instances, assuming a shared-memory location can be established.  I don’t
> presently have the mental bandwidth to actually try doing that, though.
>
>  - Jonathan Morton
>
>

[-- Attachment #2: Type: text/html, Size: 2641 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51
  2017-03-06 18:08                             ` Benjamin Cronce
@ 2017-03-06 18:46                               ` Jonathan Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Morton @ 2017-03-06 18:46 UTC (permalink / raw)
  To: Benjamin Cronce; +Cc: Dave Taht, cake, Eric Luehrsen

> On 6 Mar, 2017, at 20:08, Benjamin Cronce <bcronce@gmail.com> wrote:
> 
> Depends on how short of a timescale you're talking about. Shared global state that is being read and written to very quickly by multiple threads is bad enough for a single package system, but when you start getting to something like an AMD Ryzen or NUMA, shared global state becomes really expensive. Accuracy is expensive. Loosen the accuracy and gain scalability.

I’m talking about timer event latency timescales, so approx 1ms on Linux.  The deficit-mode shaper automatically and naturally adapts to whatever timer latency is actually experienced.  A token-bucket shaper has to be configured in advance with a burst size, which it uses whether or not it is warranted to do so.

The effects are measurable on single TCP flows at 20Mbps (so slightly more than 1Kpps peak), as they modify Codel’s behaviour.  Cake achieves higher average throughput than HTB+fq_codel with its more accurate shaping, because Codel isn’t forced into overcorrecting after accepting several sub-bucket bursts in sequence.

Anyway, these are concerns I would want to go away and think about for a while, before committing to a design.  That’s precisely why I don’t have mental bandwidth for it right now.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-03-06 18:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <e955b05f85fea5661cfe306be0a28250@inventati.org>
     [not found] ` <07479F0A-40DD-44E5-B67E-28117C7CF228@gmx.de>
     [not found]   ` <1488400107.3610.1@smtp.autistici.org>
     [not found]     ` <2B251BF1-C965-444D-A831-9981861E453E@gmx.de>
     [not found]       ` <1488484262.16753.0@smtp.autistici.org>
2017-03-02 21:10         ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht
2017-03-02 23:16           ` John Yates
2017-03-03  0:00             ` Jonathan Morton
2017-03-02 23:55           ` John Yates
2017-03-03  0:02             ` Jonathan Morton
2017-03-03  4:31               ` Eric Luehrsen
2017-03-03  4:35                 ` Jonathan Morton
2017-03-03  5:00                   ` Eric Luehrsen
2017-03-03  5:49                     ` Jonathan Morton
2017-03-03  6:21                       ` Dave Taht
2017-03-06 13:30                         ` Benjamin Cronce
2017-03-06 14:44                           ` Jonathan Morton
2017-03-06 18:08                             ` Benjamin Cronce
2017-03-06 18:46                               ` Jonathan Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox