* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 [not found] ` <1488484262.16753.0@smtp.autistici.org> @ 2017-03-02 21:10 ` Dave Täht 2017-03-02 23:16 ` John Yates 2017-03-02 23:55 ` John Yates 0 siblings, 2 replies; 14+ messages in thread From: Dave Täht @ 2017-03-02 21:10 UTC (permalink / raw) To: lede-dev, cake On 3/2/17 11:51 AM, Stijn Segers wrote: > Thanks Sebastian, turned out to be a silly syntax error, I have it all > disabled now. Ethtool -k and ethtool -K printing/requiring different > stuff doesn't help of course :-) > > I re-enabled SQM, will see how that works out with the offloading disabled. Would be good to know. I lost a bit of sleep lately (given how badly we got bit by RCU on the ATF front, I worry about cake... but I can't see how that would break, there.) In terms of general "why does shaping use so much cpu"... I am keen to stress that the core fq_codel algorithm is very lightweight and barely shows up on traces when used without software rate limiting and with BQL. You CAN see a difference in forwarding performance at really high native rates if you use pfifo and compare it to fq_codel on some platforms - pfifo-fast is simpler overall. To experiment, you can re-enable pfifo-fast in scenarios if you want - (tc qdisc add dev whatever pfifo limit somethingsane, or bfifo something sane) ... however things like nat and firewall rules tend to dominate the forwarding costs, and fq_codel reduces latency muchly over pfifo), and the principal use of fq_codel is for sqm (and now wifi). As for software rate shaping - this is very cpu intensive no matter how you do it. I wish we didn't have to do it - and with certain (mostly old DSL) modems that do flow control you don't. The only one I know that gets this right is the transverse geode that david woodhouse has. One of my disappointments across the industry is not seeing BQL roll out universally on any dsl firmwares, starting, oh, 5 years ago. If we had ethernet devices with a programmable timer (only interrupt me on 40mbit rate) we could also completely eliminate software rate shaping.... anyway my benchmarks are showing that: cake in it's "besteffort" mode smokes HTB + fq_codel, affording over 40% more headroom in terms of cpu with bandwidth. (Independent confirmation across more cpu types is need) In the default mode, with the new 3 tier classification, wash, nat and triple-isolate/dual-host/dual-src features - which we hope are going to help folk deal with torrent better in particular - it's a wash. cake is a LOT more cpu intense than fq_codel is, especially in its default modes, which it makes up for by being more unified. Mostly. If you are running low on cpu and are trying to shape inbound on most of these low-end mips devices to speeds > 60Mbits, I'd highly recommend switching to using "besteffort" on that rather than the 3 QoS queue default. Most ISPs are not classifying traffic well, anyway, and FQ solves nearly everything, especially per host fq.... But none of what I just said applies if there's a bug somewhere else! GRO has given me fits for years now, and I'm scarred by that. In terms of cpu costs in cake/fq_codel - dequeue, hashing, and timestamping show up most on a trace. The rate limiting effort where all that is happening shows up in softirq dominating the platform. I have *always* worried that there exists devices (particularly multi-cores) without a first class high speed internal clock facility, but thus far haven't had an issue with it (unlike on BSD, which has internal timings good to only a ms). As for speeding up hashing, I've been looking over various algorithms to do that for years now, I'm open to suggestions. The fastest new ones tend to depend on co-processor support. The fastest I've seen relies on the CRC32 instruction which is only in some intel platforms. Cake could certainly use a big round of profiling but it is generally my hope that we won big with it, in its present form. I welcome (especially flent) benchmarks of sqm on various architectures we've not explored fully - notably arm ones - My hat is off to all that have worked so hard to make this subsystem - and also all of lede - work so well, in this release. > Cheers > > Stijn > > > _______________________________________________ > Lede-dev mailing list > Lede-dev@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/lede-dev ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-02 21:10 ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht @ 2017-03-02 23:16 ` John Yates 2017-03-03 0:00 ` Jonathan Morton 2017-03-02 23:55 ` John Yates 1 sibling, 1 reply; 14+ messages in thread From: John Yates @ 2017-03-02 23:16 UTC (permalink / raw) To: Dave Täht; +Cc: lede-dev, cake [-- Attachment #1: Type: text/plain, Size: 694 bytes --] On Thu, Mar 2, 2017 at 4:10 PM, Dave Täht <dave@taht.net> wrote: > As for speeding up hashing, I've been looking over various algorithms to > do that for years now, I'm open to suggestions. The fastest new ones > tend to depend on co-processor support. The fastest I've seen relies on > the CRC32 instruction which is only in some intel platforms. > This is an area where I have a fair amount of experience... What are the requirements for this hashing function? - How much data is being hashed? I am guessing a limited number of bytes rather than an entire packet payload. - What is the typical number of hash table buckets? Is it prime or a power of 2? /john [-- Attachment #2: Type: text/html, Size: 1801 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-02 23:16 ` John Yates @ 2017-03-03 0:00 ` Jonathan Morton 0 siblings, 0 replies; 14+ messages in thread From: Jonathan Morton @ 2017-03-03 0:00 UTC (permalink / raw) To: John Yates; +Cc: Dave Täht, cake, lede-dev > On 3 Mar, 2017, at 01:16, John Yates <john@yates-sheets.org> wrote: > > What are the requirements for this hashing function? > - How much data is being hashed? I am guessing a limited number of bytes rather than an entire packet payload. Generally it’s what we call the “5-tuple”: two addresses (which could be IPv4, IPv6, or potentially Ethernet MAC), two port numbers (16 bits each), and a transport protocol ID (1 byte). In Cake, the hash function is by default run three times, in order to get separate hashes over just the source address and just the destination address, as well as the full 5-tuple. These are necessary to operate the triple-isolate algorithm. There may be an opportunity for optimisation by producing all three hashes in parallel. > - What is the typical number of hash table buckets? Is it prime or a power of 2? It’s a power of two, yes. The actual number of buckets is 1024, but Cake uses the full 32-bit hash as a “tag” for hash collision detection without having to store and compare the entire 5-tuple. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-02 21:10 ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht 2017-03-02 23:16 ` John Yates @ 2017-03-02 23:55 ` John Yates 2017-03-03 0:02 ` Jonathan Morton 1 sibling, 1 reply; 14+ messages in thread From: John Yates @ 2017-03-02 23:55 UTC (permalink / raw) To: Dave Täht; +Cc: lede-dev, cake On Thu, Mar 2, 2017 at 4:10 PM, Dave Täht <dave@taht.net> wrote: > As for speeding up hashing, I've been looking over various algorithms to > do that for years now, I'm open to suggestions. The fastest new ones > tend to depend on co-processor support. The fastest I've seen relies on > the CRC32 instruction which is only in some intel platforms. This is an area where I have a fair amount of experience. It is a misconception that CRC is a good hash function. It is good at detecting errors but has poor avalanche performance. What are the requirements for this hashing function? - How much data is being hashed? (I would guess a limited number of bytes rather than an entire packet payload.) - What is the typical number of hash table buckets? Must it be a power of 2? Or are you willing to make it a prime number? Assuming you can afford a 1KB lookup table I would suggest the SBox hash in figure four of this article: http://papa.bretmulvey.com/post/124028832958/hash-functions The virtue of a prime number of buckets is that when you mod your 32-bit hash value to get a bucket index you harvest _all_ of the entropy in the hash, not just the entropy in the bits you preserve. /john ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-02 23:55 ` John Yates @ 2017-03-03 0:02 ` Jonathan Morton 2017-03-03 4:31 ` Eric Luehrsen 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Morton @ 2017-03-03 0:02 UTC (permalink / raw) To: John Yates; +Cc: Dave Täht, cake, lede-dev > On 3 Mar, 2017, at 01:55, John Yates <john@yates-sheets.org> wrote: > > The virtue of a prime number of buckets is that when you mod > your 32-bit hash value to get a bucket index you harvest _all_ > of the entropy in the hash, not just the entropy in the bits you > preserve. True, but you incur the cost of a division, which is very much non-trivial on ARM CPUs, which are increasingly common in CPE. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 0:02 ` Jonathan Morton @ 2017-03-03 4:31 ` Eric Luehrsen 2017-03-03 4:35 ` Jonathan Morton 0 siblings, 1 reply; 14+ messages in thread From: Eric Luehrsen @ 2017-03-03 4:31 UTC (permalink / raw) To: Jonathan Morton, John Yates; +Cc: cake, lede-dev, Dave Täht On 03/02/2017 07:02 PM, Jonathan Morton wrote: >> On 3 Mar, 2017, at 01:55, John Yates <john@yates-sheets.org> wrote: >> >> The virtue of a prime number of buckets is that when you mod >> your 32-bit hash value to get a bucket index you harvest _all_ >> of the entropy in the hash, not just the entropy in the bits you >> preserve. > True, but you incur the cost of a division, which is very much non-trivial on ARM CPUs, which are increasingly common in CPE. > > - Jonathan Morton Also with SQM you may not what idealized entropy in your queue distribution. It is desired by some to have host-connection fairness, and not so much interest in stream-type fairness. So overlap in a few hash "tags" may not be always such a bad thing depending on how it works itself out. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 4:31 ` Eric Luehrsen @ 2017-03-03 4:35 ` Jonathan Morton 2017-03-03 5:00 ` Eric Luehrsen 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Morton @ 2017-03-03 4:35 UTC (permalink / raw) To: Eric Luehrsen; +Cc: John Yates, cake, lede-dev, Dave Täht > On 3 Mar, 2017, at 06:31, Eric Luehrsen <ericluehrsen@hotmail.com> wrote: > > Also with SQM you may not what idealized entropy in your queue > distribution. It is desired by some to have host-connection fairness, > and not so much interest in stream-type fairness. So overlap in a few > hash "tags" may not be always such a bad thing depending on how it works > itself out. That sort of thing is explicitly catered for by the triple-isolate algorithm. I don’t want to rely on particular hash behaviour to achieve an inferior result. I’d much rather have a good hash with maximal entropy. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 4:35 ` Jonathan Morton @ 2017-03-03 5:00 ` Eric Luehrsen 2017-03-03 5:49 ` Jonathan Morton 0 siblings, 1 reply; 14+ messages in thread From: Eric Luehrsen @ 2017-03-03 5:00 UTC (permalink / raw) To: Jonathan Morton; +Cc: John Yates, cake, lede-dev, Dave Täht On 03/02/2017 11:35 PM, Jonathan Morton wrote: >> On 3 Mar, 2017, at 06:31, Eric Luehrsen <ericluehrsen@hotmail.com> wrote: >> >> Also with SQM you may not what idealized entropy in your queue >> distribution. It is desired by some to have host-connection fairness, >> and not so much interest in stream-type fairness. So overlap in a few >> hash "tags" may not be always such a bad thing depending on how it works >> itself out. > That sort of thing is explicitly catered for by the triple-isolate algorithm. I don’t want to rely on particular hash behaviour to achieve an inferior result. I’d much rather have a good hash with maximal entropy. > > - Jonathan Morton That's not what I was going for. Agree, it would not be good to depend on an inferior hash. You mentioned divide as a "cost." So I was proposing a thought around a "benefit" estimate. If hash collisions are not as important (or are they), then what is "benefit / cost?" ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 5:00 ` Eric Luehrsen @ 2017-03-03 5:49 ` Jonathan Morton 2017-03-03 6:21 ` Dave Taht 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Morton @ 2017-03-03 5:49 UTC (permalink / raw) To: Eric Luehrsen; +Cc: John Yates, cake, lede-dev, Dave Täht > On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> wrote: > > That's not what I was going for. Agree, it would not be good to depend > on an inferior hash. You mentioned divide as a "cost." So I was > proposing a thought around a "benefit" estimate. If hash collisions are > not as important (or are they), then what is "benefit / cost?" The computational cost of one divide is not the only consideration I have in mind. Cake’s set-associative hash is fundamentally predicated on the number of hash buckets *not* being prime, as it requires further decomposing the hash into a major and minor part when a collision is detected. The minor part is then iterated to try to locate a matching or free bucket. This is considerably easier to do and reason about when everything is a power of two. Then, modulus is a masking operation, and divide is a shift, either of which can be done in one cycle flat. AFAIK, however, the main CPU cost of the hash function in Cake is not the hash itself, but the packet dissection required to obtain the data it operates on. This is something a profile would shed more light on. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 5:49 ` Jonathan Morton @ 2017-03-03 6:21 ` Dave Taht 2017-03-06 13:30 ` Benjamin Cronce 0 siblings, 1 reply; 14+ messages in thread From: Dave Taht @ 2017-03-03 6:21 UTC (permalink / raw) To: Jonathan Morton; +Cc: Eric Luehrsen, cake As this is devolving into a cake specific discussion, removing the lede mailing list. On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99@gmail.com> wrote: > >> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> wrote: >> >> That's not what I was going for. Agree, it would not be good to depend >> on an inferior hash. You mentioned divide as a "cost." So I was >> proposing a thought around a "benefit" estimate. If hash collisions are >> not as important (or are they), then what is "benefit / cost?" > > The computational cost of one divide is not the only consideration I have in mind. > > Cake’s set-associative hash is fundamentally predicated on the number of hash buckets *not* being prime, as it requires further decomposing the hash into a major and minor part when a collision is detected. The minor part is then iterated to try to locate a matching or free bucket. > > This is considerably easier to do and reason about when everything is a power of two. Then, modulus is a masking operation, and divide is a shift, either of which can be done in one cycle flat. > > AFAIK, however, the main CPU cost of the hash function in Cake is not the hash itself, but the packet dissection required to obtain the data it operates on. This is something a profile would shed more light on. Tried. Mips wasn't a good target. The jhash3 setup cost is bad, but I agree flow dissection can be deeply expensive. As well as the other 42+ functions a packet needs to traverse to get from ingress to egress. But staying on hashing: One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash if one already existed (injected already by tcp, or by hardware, or the tunneling tool). we only need to compute a partial hash on the smaller subset of keys in that case (if we can rely on the skb->hash which we cannot do in the nat case) Another thing I did, long ago, was read the (60s-era!) liturature about set-associative cpu cache architectures... and... In all of these cases I really, really wanted to just punt all this extra work to hardware in ingress - computing 3 hashes can be easily done in parallel there and appended to the packet as it completes. I have been working quite a bit more with the arm architecture of late, and the "perf" profiler over there is vastly better than the mips one we've had. (and aarch64 is *nice*. So is NEON) - but I hadn't got around to dinking with cake there until yesterday. One thing I'm noticing is that even the gigE capable arms have weak or non-existent L2 caches, and generally struggle to get past 700Mbits bidirectionally on the network. some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are here: http://www.taht.net/~d/lime-2/ The rrul tests were not particularly pleasing. [1] ... A second thing on my mind is to be able to take advantage of A) more cores ... and B) hardware that increasingly has 4 or more lanes in it. 1) Presently fq_codel (and cake's) behavior there when set as a default qdisc is sub-optimal - if you have 64 hardware queues you end up with 64 instances, each with 1024 queues. While this might be awesome from a FQ perspective I really don't think the aqm will be as good. Or maybe it might be - what happens with 64000 queues at 100Mbit? 2) It's currently impossible to shape network traffic across cores. I'd like to imagine that with a single atomic exchange or sloppily shared values shaping would be feasible. (also softirq is a single thread, I believe) 3) mq and mqprio are commonly deployed on the high end for this. So I've thought about doing up another version - call it - I dunno - smq - "smart multi-queue" - and seeing how far we could get. > - Jonathan Morton > > _______________________________________________ > Cake mailing list > Cake@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cake [1] If you are on this list and are not using flent, tough. I'm not going through the trouble of generating graphs myself anymore. -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-03 6:21 ` Dave Taht @ 2017-03-06 13:30 ` Benjamin Cronce 2017-03-06 14:44 ` Jonathan Morton 0 siblings, 1 reply; 14+ messages in thread From: Benjamin Cronce @ 2017-03-06 13:30 UTC (permalink / raw) To: Dave Taht; +Cc: Jonathan Morton, cake, Eric Luehrsen [-- Attachment #1: Type: text/plain, Size: 5576 bytes --] On Fri, Mar 3, 2017 at 12:21 AM, Dave Taht <dave.taht@gmail.com> wrote: > As this is devolving into a cake specific discussion, removing the > lede mailing list. > > On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99@gmail.com> > wrote: > > > >> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> > wrote: > >> > >> That's not what I was going for. Agree, it would not be good to depend > >> on an inferior hash. You mentioned divide as a "cost." So I was > >> proposing a thought around a "benefit" estimate. If hash collisions are > >> not as important (or are they), then what is "benefit / cost?" > > > > The computational cost of one divide is not the only consideration I > have in mind. > > > > Cake’s set-associative hash is fundamentally predicated on the number of > hash buckets *not* being prime, as it requires further decomposing the hash > into a major and minor part when a collision is detected. The minor part > is then iterated to try to locate a matching or free bucket. > > > > This is considerably easier to do and reason about when everything is a > power of two. Then, modulus is a masking operation, and divide is a shift, > either of which can be done in one cycle flat. > > > > AFAIK, however, the main CPU cost of the hash function in Cake is not > the hash itself, but the packet dissection required to obtain the data it > operates on. This is something a profile would shed more light on. > > Tried. Mips wasn't a good target. > > The jhash3 setup cost is bad, but I agree flow dissection can be > deeply expensive. As well as the other 42+ functions a packet needs to > traverse to get from ingress to egress. > > But staying on hashing: > > One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash > if one already existed (injected already by tcp, or by hardware, or > the tunneling tool). we only need to compute a partial hash on the > smaller subset of keys in that case (if we can rely on the skb->hash > which we cannot do in the nat case) > > Another thing I did, long ago, was read the (60s-era!) liturature > about set-associative cpu cache architectures... and... > > In all of these cases I really, really wanted to just punt all this > extra work to hardware in ingress - computing 3 hashes can be easily > done in parallel there and appended to the packet as it completes. > > I have been working quite a bit more with the arm architecture of > late, and the "perf" profiler over there is vastly better than the > mips one we've had. > > (and aarch64 is *nice*. So is NEON) > > - but I hadn't got around to dinking with cake there until yesterday. > > One thing I'm noticing is that even the gigE capable arms have weak or > non-existent L2 caches, and generally struggle to get past 700Mbits > bidirectionally on the network. > > some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are > here: > > http://www.taht.net/~d/lime-2/ > > The rrul tests were not particularly pleasing. [1] > > ... > > A second thing on my mind is to be able to take advantage of A) more cores > > ... and B) hardware that increasingly has 4 or more lanes in it. > > 1) Presently fq_codel (and cake's) behavior there when set as a > default qdisc is sub-optimal - if you have 64 hardware queues you end > up with 64 instances, each with 1024 queues. While this might be > awesome from a FQ perspective I really don't think the aqm will be as > good. Or maybe it might be - what happens with 64000 queues at > 100Mbit? > > 2) It's currently impossible to shape network traffic across cores. > I'd like to imagine that with a single atomic exchange or sloppily > shared values shaping would be feasible. > > When you need to worry about multithreading, many times perfect is very much the enemy of good. Depending on how quickly you need to make the network react, you could do something along the lines of a "shared pool" of bandwidth. Each core gets a split of the bandwidth, any unused bandwidth can be added to the pool, and cores that want more bandwidth can take bandwidth from the pool. You could treat it like task stealing, except each core can generate tokens that represent a quantum of bandwidth that is only valid for some interval. If a core suddenly needs bandwidth, it can attempt to "take back" from its publicly shared pool. If other cores have already borrowed, it can attempt to borrow from another core. If it can't find any spare bandwidth, it just waits for some interval related to how long a quantum is valid, and assumes it's safe. Or something.. I don't know, it's 7am and I just woke up. > (also softirq is a single thread, I believe) > > 3) mq and mqprio are commonly deployed on the high end for this. > > So I've thought about doing up another version - call it - I dunno - > smq - "smart multi-queue" - and seeing how far we could get. > > > - Jonathan Morton > > > > _______________________________________________ > > Cake mailing list > > Cake@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/cake > > > > [1] If you are on this list and are not using flent, tough. I'm not > going through the trouble of generating graphs myself anymore. > > -- > Dave Täht > Let's go make home routers and wifi faster! With better software! > http://blog.cerowrt.org > _______________________________________________ > Cake mailing list > Cake@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cake > [-- Attachment #2: Type: text/html, Size: 7184 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-06 13:30 ` Benjamin Cronce @ 2017-03-06 14:44 ` Jonathan Morton 2017-03-06 18:08 ` Benjamin Cronce 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Morton @ 2017-03-06 14:44 UTC (permalink / raw) To: Benjamin Cronce; +Cc: Dave Taht, cake, Eric Luehrsen > On 6 Mar, 2017, at 15:30, Benjamin Cronce <bcronce@gmail.com> wrote: > > You could treat it like task stealing, except each core can generate tokens that represent a quantum of bandwidth that is only valid for some interval. You’re obviously thinking of a token-bucket based shaper here. CAKE uses a deficit-mode shaper which deliberately works a different way - it’s more accurate on short timescales, and this actually makes a positive difference in several important cases. The good news is that there probably is a way to explicitly and efficiently share bandwidth in any desired ratio across different CAKE instances, assuming a shared-memory location can be established. I don’t presently have the mental bandwidth to actually try doing that, though. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-06 14:44 ` Jonathan Morton @ 2017-03-06 18:08 ` Benjamin Cronce 2017-03-06 18:46 ` Jonathan Morton 0 siblings, 1 reply; 14+ messages in thread From: Benjamin Cronce @ 2017-03-06 18:08 UTC (permalink / raw) To: Jonathan Morton; +Cc: Dave Taht, cake, Eric Luehrsen [-- Attachment #1: Type: text/plain, Size: 2119 bytes --] Depends on how short of a timescale you're talking about. Shared global state that is being read and written to very quickly by multiple threads is bad enough for a single package system, but when you start getting to something like an AMD Ryzen or NUMA, shared global state becomes really expensive. Accuracy is expensive. Loosen the accuracy and gain scalability. I would be interested in the pseduo-code or high level of what state needs to be shared and how that state is used. I was also thinking more of some hybrid. Instead of a "token" representing a bucked amount of bandwidth that can be immediately used, I was thinking more of like a "future" of bandwidth that could be used. So instead of saying "here's a token of bandwidth", you have each core doing it's own deficit bandwidth shaping, but when a token is received, a core can temporarily increase its assigned shaping bandwidth. If I remember correctly, cake already supports having its bandwidth changed on the fly. Of course it may be simpler to say cake is meant to be used on no more than 8 cores with a non-numa CPU system with all cores having a shared low-latency cache connecting the cores. On Mon, Mar 6, 2017 at 8:44 AM, Jonathan Morton <chromatix99@gmail.com> wrote: > > > On 6 Mar, 2017, at 15:30, Benjamin Cronce <bcronce@gmail.com> wrote: > > > > You could treat it like task stealing, except each core can generate > tokens that represent a quantum of bandwidth that is only valid for some > interval. > > You’re obviously thinking of a token-bucket based shaper here. CAKE uses > a deficit-mode shaper which deliberately works a different way - it’s more > accurate on short timescales, and this actually makes a positive difference > in several important cases. > > The good news is that there probably is a way to explicitly and > efficiently share bandwidth in any desired ratio across different CAKE > instances, assuming a shared-memory location can be established. I don’t > presently have the mental bandwidth to actually try doing that, though. > > - Jonathan Morton > > [-- Attachment #2: Type: text/html, Size: 2641 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 2017-03-06 18:08 ` Benjamin Cronce @ 2017-03-06 18:46 ` Jonathan Morton 0 siblings, 0 replies; 14+ messages in thread From: Jonathan Morton @ 2017-03-06 18:46 UTC (permalink / raw) To: Benjamin Cronce; +Cc: Dave Taht, cake, Eric Luehrsen > On 6 Mar, 2017, at 20:08, Benjamin Cronce <bcronce@gmail.com> wrote: > > Depends on how short of a timescale you're talking about. Shared global state that is being read and written to very quickly by multiple threads is bad enough for a single package system, but when you start getting to something like an AMD Ryzen or NUMA, shared global state becomes really expensive. Accuracy is expensive. Loosen the accuracy and gain scalability. I’m talking about timer event latency timescales, so approx 1ms on Linux. The deficit-mode shaper automatically and naturally adapts to whatever timer latency is actually experienced. A token-bucket shaper has to be configured in advance with a burst size, which it uses whether or not it is warranted to do so. The effects are measurable on single TCP flows at 20Mbps (so slightly more than 1Kpps peak), as they modify Codel’s behaviour. Cake achieves higher average throughput than HTB+fq_codel with its more accurate shaping, because Codel isn’t forced into overcorrecting after accepting several sub-bucket bursts in sequence. Anyway, these are concerns I would want to go away and think about for a while, before committing to a design. That’s precisely why I don’t have mental bandwidth for it right now. - Jonathan Morton ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-03-06 18:46 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <e955b05f85fea5661cfe306be0a28250@inventati.org> [not found] ` <07479F0A-40DD-44E5-B67E-28117C7CF228@gmx.de> [not found] ` <1488400107.3610.1@smtp.autistici.org> [not found] ` <2B251BF1-C965-444D-A831-9981861E453E@gmx.de> [not found] ` <1488484262.16753.0@smtp.autistici.org> 2017-03-02 21:10 ` [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 Dave Täht 2017-03-02 23:16 ` John Yates 2017-03-03 0:00 ` Jonathan Morton 2017-03-02 23:55 ` John Yates 2017-03-03 0:02 ` Jonathan Morton 2017-03-03 4:31 ` Eric Luehrsen 2017-03-03 4:35 ` Jonathan Morton 2017-03-03 5:00 ` Eric Luehrsen 2017-03-03 5:49 ` Jonathan Morton 2017-03-03 6:21 ` Dave Taht 2017-03-06 13:30 ` Benjamin Cronce 2017-03-06 14:44 ` Jonathan Morton 2017-03-06 18:08 ` Benjamin Cronce 2017-03-06 18:46 ` Jonathan Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox