* [Codel] The next slice of cake @ 2015-03-17 20:08 Jonathan Morton 2015-03-18 7:22 ` Sebastian Moeller 2015-03-18 15:10 ` Kathleen Nichols 0 siblings, 2 replies; 22+ messages in thread From: Jonathan Morton @ 2015-03-17 20:08 UTC (permalink / raw) To: codel, cerowrt-devel After far too long, it looks like I’ll have the opportunity to work on sch_cake a bit more. So here’s a little bit of a “state of the union” speech about what we’ve got and what I’m planing to add to it. So far we’ve got a deficit-mode, non-bursting shaper that works pretty well, and an integrated implementation of fq_codel that tunes itself (that is, the target delay) to the bandwidth set on the shaper. The configuration is “as easy as cake”; the intention is that you can just specify one parameter (the bandwidth to shape at) and leave everything else at the defaults; there simply aren’t very many visible knobs, because they aren’t needed. We’ve also got Diffserv classification, and that part hasn’t been so successful. Each class grabs all traffic with some subset of the codepoints, and stuffs them into a separate shaper+fq_codel instance, and the higher-priority shapers steal bandwidth from the lower ones to enforce priority. High-priority classes can only use a limited amount of bandwidth, exactly as specified in generic Diffserv PHBs. It works, perfectly as designed, but the resulting behaviour isn’t particularly desirable from an end-user perspective. In particular, people run tests using best-effort traffic to see how much bandwidth they’re getting, resulting in complaints that cake had to be given a bigger number to get the correct throughput - which of course also stops it from functioning correctly when background traffic is added to the mix. So that needed a rethink. Incidentally, the existing Diffserv implementation can be disabled by specifying the “besteffort” keyword. This lumps all traffic into a single class, handled by a single shaper at the configured rate. Cake already works pretty well in that mode; sometimes I turn the shaper down to analogue-modem speeds and note, with some satisfaction, that everything *still* works. Except YouTube, but that’s only because streaming video really does need more than analogue-modem bandwidth. As for performance, I’m able to make my ancient Pentium-MMX shape at over 50 Mbps, summing traffic in both directions between two bridged Fast Ethernet cards. This limitation is probably a combination of timer latency and context-switch overhead. I don’t expect it to improve much, unless we find a way to seriously reduce those overheads (which are already quite low for a modern desktop OS). A faster machine with better timers gets better performance, of course. So there are two big things I want to change in the next version: The easy part (at least in terms of how many unknowns there are) is adjusting the flow-queueing part so that it uses set-associative hashing instead of straight hashing when selecting a queue. This should reduce the incidence of hash collisions considerably for a given number of flow queues, or conversely provide equivalent collision performance with a smaller number of queues. The more interesting part is to rework the Diffserv prioritiser so that it behaves more usefully. I think I’ve hit upon the right idea which should make this work in practice - instead of individually hard-shaping each class, instead use the shaper logic as a threshold function between high and low priority, and instead implement a single shaper to handle all traffic. The priority function can then be handled by a weighted DRR system - which is already in place, but doesn’t do much - with just that small modification for changing the weights based on the shaper state. So high-priority traffic gets high priority - but only if it limits itself to a reasonable bandwidth. Above that bandwidth, it gets low priority, but is still able to use the full shaped bandwidth if nobody else contends for it. And (unlike say HFSC) we need precisely two parameters per class to do this, both specified as ratios rather than hard bandwidth numbers: a bandwidth share (which determines both the shaper setting and the low-priority-mode DRR weighting) and a priority factor (which determines the high-priority-mode DRR weighting). So if those knobs end up being exposed to userspace, they’ll be easier to understand and thus use correctly. All of this feeds my main goal with Diffserv, which is to start giving applications natural incentives to mark their traffic appropriately. Each class has both an advantage, and a tradeoff which must be accepted to realise that advantage. If you need absolutely minimal latency, you can choose a high-priority class, but you’ll have to be frugal about bandwidth. If you need maximum throughput, you’ll have to put up with reduced priority compared to latency-sensitive traffic. And if you want to be altruistic, you can choose to mark your stuff as bulk, background traffic, and it’ll be treated accordingly. All of this is in accordance with existing RFCs. A small caveat: cake is not designed for wifi. It’s designed for links that can at least be treated as full-duplex to a close approximation. Shared-medium links *can* behave like that, if they’re shaped to a miserly enough degree, but we really need something different for wifi - although several of cake’s components and ideas could be used in such a qdisc. Roll on cake3. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-17 20:08 [Codel] The next slice of cake Jonathan Morton @ 2015-03-18 7:22 ` Sebastian Moeller 2015-03-18 8:41 ` Jonathan Morton 2015-03-18 15:10 ` Kathleen Nichols 1 sibling, 1 reply; 22+ messages in thread From: Sebastian Moeller @ 2015-03-18 7:22 UTC (permalink / raw) To: Jonathan Morton, codel, cerowrt-devel Hi Jonathan, Great work. @Dave is there a cerowrt or openwrt build around that includes cake? On March 17, 2015 9:08:39 PM GMT+01:00, Jonathan Morton <chromatix99@gmail.com> wrote: >After far too long, it looks like I’ll have the opportunity to work on >sch_cake a bit more. So here’s a little bit of a “state of the union” >speech about what we’ve got and what I’m planing to add to it. > >So far we’ve got a deficit-mode, non-bursting shaper that works pretty >well, and an integrated implementation of fq_codel that tunes itself >(that is, the target delay) to the bandwidth set on the shaper. The >configuration is “as easy as cake”; the intention is that you can just >specify one parameter (the bandwidth to shape at) and leave everything >else at the defaults; there simply aren’t very many visible knobs, >because they aren’t needed. > >We’ve also got Diffserv classification, and that part hasn’t been so >successful. Each class grabs all traffic with some subset of the >codepoints, and stuffs them into a separate shaper+fq_codel instance, >and the higher-priority shapers steal bandwidth from the lower ones to >enforce priority. High-priority classes can only use a limited amount >of bandwidth, exactly as specified in generic Diffserv PHBs. > >It works, perfectly as designed, but the resulting behaviour isn’t >particularly desirable from an end-user perspective. In particular, >people run tests using best-effort traffic to see how much bandwidth >they’re getting, resulting in complaints that cake had to be given a >bigger number to get the correct throughput - which of course also >stops it from functioning correctly when background traffic is added to >the mix. So that needed a rethink. I wonder, are the low priority classes configured with a guaranteed minimum bandwidth to avoid starvation? And will they opportunistically grab all left over bandwidth to fill the pipe? Then speed test should just work as long as there is no competing traffic... > >Incidentally, the existing Diffserv implementation can be disabled by >specifying the “besteffort” keyword. This lumps all traffic into a >single class, handled by a single shaper at the configured rate. Cake >already works pretty well in that mode; sometimes I turn the shaper >down to analogue-modem speeds and note, with some satisfaction, that >everything *still* works. Except YouTube, but that’s only because >streaming video really does need more than analogue-modem bandwidth. > >As for performance, I’m able to make my ancient Pentium-MMX shape at >over 50 Mbps, summing traffic in both directions between two bridged >Fast Ethernet cards. This limitation is probably a combination of >timer latency and context-switch overhead. I don’t expect it to >improve much, unless we find a way to seriously reduce those overheads >(which are already quite low for a modern desktop OS). A faster >machine with better timers gets better performance, of course. I am probably out of my mind, but couldn't it help if cake would allow a fixed cycle mode where it would process 50ms or so worth of packets pass them to the interface, and then sleep until the next 50ms block start. This should just be a fallback mode to not degrade badly under overload; I would hope that could help, as it will be far fewer timers to handle and maybe less context switching, as I would guess cake processing happens entirely in kernel space. I probably am overlooking something that makes my idea a non-starter ;) > >So there are two big things I want to change in the next version: > >The easy part (at least in terms of how many unknowns there are) is >adjusting the flow-queueing part so that it uses set-associative >hashing instead of straight hashing when selecting a queue. This >should reduce the incidence of hash collisions considerably for a given >number of flow queues, or conversely provide equivalent collision >performance with a smaller number of queues. > >The more interesting part is to rework the Diffserv prioritiser so that >it behaves more usefully. I think I’ve hit upon the right idea which >should make this work in practice - instead of individually >hard-shaping each class, instead use the shaper logic as a threshold >function between high and low priority, and instead implement a single >shaper to handle all traffic. The priority function can then be >handled by a weighted DRR system - which is already in place, but >doesn’t do much - with just that small modification for changing the >weights based on the shaper state. > >So high-priority traffic gets high priority - but only if it limits >itself to a reasonable bandwidth. Above that bandwidth, it gets low >priority, but is still able to use the full shaped bandwidth if nobody >else contends for it. I think the highest priority band should only get its bandwidth quota, and have no silent priority demotion; but otherwise I think that idea that classes can pick up unused bandwidth sounds sane, especially for best effort and background. And (unlike say HFSC) we need precisely two >parameters per class to do this, both specified as ratios rather than >hard bandwidth numbers: a bandwidth share (which determines both the >shaper setting and the low-priority-mode DRR weighting) and a priority >factor (which determines the high-priority-mode DRR weighting). So if >those knobs end up being exposed to userspace, they’ll be easier to >understand and thus use correctly. > >All of this feeds my main goal with Diffserv, which is to start giving >applications natural incentives to mark their traffic appropriately. >Each class has both an advantage, and a tradeoff which must be accepted >to realise that advantage. If you need absolutely minimal latency, you >can choose a high-priority class, but you’ll have to be frugal about >bandwidth. If you need maximum throughput, you’ll have to put up with >reduced priority compared to latency-sensitive traffic. And if you >want to be altruistic, you can choose to mark your stuff as bulk, >background traffic, and it’ll be treated accordingly. All of this is >in accordance with existing RFCs. > >A small caveat: cake is not designed for wifi. It’s designed for links >that can at least be treated as full-duplex to a close approximation. >Shared-medium links *can* behave like that, if they’re shaped to a >miserly enough degree, but we really need something different for wifi >- although several of cake’s components and ideas could be used in such >a qdisc. > >Roll on cake3. > > - Jonathan Morton Many thanks and best regards Sebastian > >_______________________________________________ >Codel mailing list >Codel@lists.bufferbloat.net >https://lists.bufferbloat.net/listinfo/codel -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-18 7:22 ` Sebastian Moeller @ 2015-03-18 8:41 ` Jonathan Morton 2015-03-18 10:39 ` Sebastian Moeller 0 siblings, 1 reply; 22+ messages in thread From: Jonathan Morton @ 2015-03-18 8:41 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel, cerowrt-devel > I wonder, are the low priority classes configured with a guaranteed minimum bandwidth to avoid starvation? And will they opportunistically grab all left over bandwidth to fill the pipe? Then speed test should just work as long as there is no competing traffic… The problem is that, in the present version, *only* the bulk/background class can use the full pipe. Best effort gets a large fraction as its limit, but it’s not full. Existing speed tests use best-effort traffic, and that’s not likely to change soon. The next version should change that. > I am probably out of my mind, but couldn't it help if cake would allow a fixed cycle mode where it would process 50ms or so worth of packets pass them to the interface, and then sleep until the next 50ms block start. This should just be a fallback mode to not degrade badly under overload… There is already such a mode to cope with limited-resolution timers and the existing overheads. Without it, the Pentium-MMX is limited to a rather low rate (since it then has to wait for a timer interrupt for alternate packets). At 50Mbps+, it’s not too far off what it can bridge without shaping (60Mbps+). For some reason, the little CPE boxes still lose more performance than that to shaping. Note that due to the very nature of shaping, the link is always either effectively idle (in which case an arriving packet is dispatched immediately, without waiting for a timer), or overloaded (in which case packets are delivered according to a schedule). The question is whether the shaping rate also overloads the hardware. In any case, bursting for fifty whole milliseconds at a time would absolutely *destroy* cake’s latency performance. I’m not going to do that. Accommodating timer performance is the only concession to bursting I’m willing to make. > I think the highest priority band should only get its bandwidth quota, and have no silent priority demotion; but otherwise I think that idea that classes can pick up unused bandwidth sounds sane, especially for best effort and background. Let’s see how well it works this way. It should be fairly easy to adjust this aspect of behaviour later on. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-18 8:41 ` Jonathan Morton @ 2015-03-18 10:39 ` Sebastian Moeller 0 siblings, 0 replies; 22+ messages in thread From: Sebastian Moeller @ 2015-03-18 10:39 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel, cerowrt-devel Hi Jonathan, On Mar 18, 2015, at 09:41 , Jonathan Morton <chromatix99@gmail.com> wrote: >> I wonder, are the low priority classes configured with a guaranteed minimum bandwidth to avoid starvation? And will they opportunistically grab all left over bandwidth to fill the pipe? Then speed test should just work as long as there is no competing traffic… > > The problem is that, in the present version, *only* the bulk/background class can use the full pipe. Best effort gets a large fraction as its limit, but it’s not full. Existing speed tests use best-effort traffic, and that’s not likely to change soon. > > The next version should change that. Great. > >> I am probably out of my mind, but couldn't it help if cake would allow a fixed cycle mode where it would process 50ms or so worth of packets pass them to the interface, and then sleep until the next 50ms block start. This should just be a fallback mode to not degrade badly under overload… > > There is already such a mode to cope with limited-resolution timers and the existing overheads. Without it, the Pentium-MMX is limited to a rather low rate (since it then has to wait for a timer interrupt for alternate packets). At 50Mbps+, it’s not too far off what it can bridge without shaping (60Mbps+). For some reason, the little CPE boxes still lose more performance than that to shaping. Excellent. > > Note that due to the very nature of shaping, the link is always either effectively idle (in which case an arriving packet is dispatched immediately, without waiting for a timer), or overloaded (in which case packets are delivered according to a schedule). The question is whether the shaping rate also overloads the hardware. > > In any case, bursting for fifty whole milliseconds at a time would absolutely *destroy* cake’s latency performance. I’m not going to do that. Accommodating timer performance is the only concession to bursting I’m willing to make. Well, I should not have put a number in, I know, I was mainly trying to push for a batching mode to distribute the timer and context switching costs over more work done, like the batching patches Jesper got into the kernel. I guess I should look at the cake code and try to understand what is happening ;) I think with HTB latency starts to suffer once the shaped rates exceed the hardware’s capabilities, so I think of this as a trade-off between latency and bandwidth; if cake does not show this behavior but just bounds the effective bandwidth instead of increasing latency the whole configurable tradeoff idea might make not much sense ;) > >> I think the highest priority band should only get its bandwidth quota, and have no silent priority demotion; but otherwise I think that idea that classes can pick up unused bandwidth sounds sane, especially for best effort and background. > > Let’s see how well it works this way. It should be fairly easy to adjust this aspect of behaviour later on. > > - Jonathan Morton > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-17 20:08 [Codel] The next slice of cake Jonathan Morton 2015-03-18 7:22 ` Sebastian Moeller @ 2015-03-18 15:10 ` Kathleen Nichols 2015-03-18 15:20 ` Jonathan Morton 1 sibling, 1 reply; 22+ messages in thread From: Kathleen Nichols @ 2015-03-18 15:10 UTC (permalink / raw) To: codel Wow, this seems like great work! I'm curious about the statement about "tunes itself (that is, the target delay) to the bandwidth set on the shaper". How are you relating target delay to bandwidth? thanks, Kathie On 3/17/15 1:08 PM, Jonathan Morton wrote: > After far too long, it looks like I’ll have the opportunity to work > on sch_cake a bit more. So here’s a little bit of a “state of the > union” speech about what we’ve got and what I’m planing to add to > it. > > So far we’ve got a deficit-mode, non-bursting shaper that works > pretty well, and an integrated implementation of fq_codel that tunes > itself (that is, the target delay) to the bandwidth set on the > shaper. The configuration is “as easy as cake”; the intention is > that you can just specify one parameter (the bandwidth to shape at) > and leave everything else at the defaults; there simply aren’t very > many visible knobs, because they aren’t needed. > > We’ve also got Diffserv classification, and that part hasn’t been so > successful. Each class grabs all traffic with some subset of the > codepoints, and stuffs them into a separate shaper+fq_codel instance, > and the higher-priority shapers steal bandwidth from the lower ones > to enforce priority. High-priority classes can only use a limited > amount of bandwidth, exactly as specified in generic Diffserv PHBs. > > It works, perfectly as designed, but the resulting behaviour isn’t > particularly desirable from an end-user perspective. In particular, > people run tests using best-effort traffic to see how much bandwidth > they’re getting, resulting in complaints that cake had to be given a > bigger number to get the correct throughput - which of course also > stops it from functioning correctly when background traffic is added > to the mix. So that needed a rethink. > > Incidentally, the existing Diffserv implementation can be disabled by > specifying the “besteffort” keyword. This lumps all traffic into a > single class, handled by a single shaper at the configured rate. > Cake already works pretty well in that mode; sometimes I turn the > shaper down to analogue-modem speeds and note, with some > satisfaction, that everything *still* works. Except YouTube, but > that’s only because streaming video really does need more than > analogue-modem bandwidth. > > As for performance, I’m able to make my ancient Pentium-MMX shape at > over 50 Mbps, summing traffic in both directions between two bridged > Fast Ethernet cards. This limitation is probably a combination of > timer latency and context-switch overhead. I don’t expect it to > improve much, unless we find a way to seriously reduce those > overheads (which are already quite low for a modern desktop OS). A > faster machine with better timers gets better performance, of > course. > > So there are two big things I want to change in the next version: > > The easy part (at least in terms of how many unknowns there are) is > adjusting the flow-queueing part so that it uses set-associative > hashing instead of straight hashing when selecting a queue. This > should reduce the incidence of hash collisions considerably for a > given number of flow queues, or conversely provide equivalent > collision performance with a smaller number of queues. > > The more interesting part is to rework the Diffserv prioritiser so > that it behaves more usefully. I think I’ve hit upon the right idea > which should make this work in practice - instead of individually > hard-shaping each class, instead use the shaper logic as a threshold > function between high and low priority, and instead implement a > single shaper to handle all traffic. The priority function can then > be handled by a weighted DRR system - which is already in place, but > doesn’t do much - with just that small modification for changing the > weights based on the shaper state. > > So high-priority traffic gets high priority - but only if it limits > itself to a reasonable bandwidth. Above that bandwidth, it gets low > priority, but is still able to use the full shaped bandwidth if > nobody else contends for it. And (unlike say HFSC) we need precisely > two parameters per class to do this, both specified as ratios rather > than hard bandwidth numbers: a bandwidth share (which determines both > the shaper setting and the low-priority-mode DRR weighting) and a > priority factor (which determines the high-priority-mode DRR > weighting). So if those knobs end up being exposed to userspace, > they’ll be easier to understand and thus use correctly. > > All of this feeds my main goal with Diffserv, which is to start > giving applications natural incentives to mark their traffic > appropriately. Each class has both an advantage, and a tradeoff > which must be accepted to realise that advantage. If you need > absolutely minimal latency, you can choose a high-priority class, but > you’ll have to be frugal about bandwidth. If you need maximum > throughput, you’ll have to put up with reduced priority compared to > latency-sensitive traffic. And if you want to be altruistic, you can > choose to mark your stuff as bulk, background traffic, and it’ll be > treated accordingly. All of this is in accordance with existing > RFCs. > > A small caveat: cake is not designed for wifi. It’s designed for > links that can at least be treated as full-duplex to a close > approximation. Shared-medium links *can* behave like that, if > they’re shaped to a miserly enough degree, but we really need > something different for wifi - although several of cake’s components > and ideas could be used in such a qdisc. > > Roll on cake3. > > - Jonathan Morton > > _______________________________________________ Codel mailing list > Codel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/codel > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-18 15:10 ` Kathleen Nichols @ 2015-03-18 15:20 ` Jonathan Morton 2015-03-18 21:20 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Jonathan Morton @ 2015-03-18 15:20 UTC (permalink / raw) To: codel > On 18 Mar, 2015, at 17:10, Kathleen Nichols <nichols@pollere.com> wrote: > > How are you relating target delay to bandwidth? Essentially, I use 5ms as a minimum, and increase it if necessary to accommodate a couple of MTU-sized packets at the shaping rate. This keeps things nicely under control at low bandwidths, and I find that cake remains useful and usable even at 64Kbps (without making even the usual adjustments to host or link configuration for such low speeds). I can do this in cake because the shaping rate is known, whereas the pure codel and fq_codel qdiscs do not have reliable link-speed information. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-18 15:20 ` Jonathan Morton @ 2015-03-18 21:20 ` Dave Taht 2015-03-21 16:09 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Dave Taht @ 2015-03-18 21:20 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel [-- Attachment #1: Type: text/plain, Size: 2229 bytes --] On Wed, Mar 18, 2015 at 8:20 AM, Jonathan Morton <chromatix99@gmail.com> wrote: > > > On 18 Mar, 2015, at 17:10, Kathleen Nichols <nichols@pollere.com> wrote: > > > > How are you relating target delay to bandwidth? > > Essentially, I use 5ms as a minimum, and increase it if necessary to > accommodate a couple of MTU-sized packets at the shaping rate. This keeps > things nicely under control at low bandwidths, and I find that cake remains > useful and usable even at 64Kbps (without making even the usual adjustments > to host or link configuration for such low speeds). > In the cake2 (or maybe it was the unpublished cake3) version, I had a lighter weight version of the codel algorithm, that did not have a target parameter at all. Instead it just took the interval parameter and shifted it right 4 (yielding a target of 6.xms from an interval of 100ms) This saves on a memory access (and storage per queue!) , and I felt that any differences in behavior would be unnoticeable. And they were. This is also above the bound for cable-modem media access that greg white (rightly or wrongly) believed existed. So I have no problem in eliminating "target" entirely. Cake (without bandwidth shaping engaged) uses more cpu than fq_codel did and this was one of many optimizations I'd attempted (or successfully added). Cake with shaping is a bit less cpu than sqm-scripts htb + fq_codel + filters. It also looked like cake could be poured into gates, with a bit more research, and testing. > I can do this in cake because the shaping rate is known, whereas the pure > codel and fq_codel qdiscs do not have reliable link-speed information. As for this bit, we seemed to need to account for a MTU's worth of data at the lower speeds, and I did not explore what fiddling with the interval and auto-calc-ing the target did at these speeds, as yet. > - Jonathan Morton > > _______________________________________________ > Codel mailing list > Codel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/codel > -- Dave Täht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb [-- Attachment #2: Type: text/html, Size: 3339 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-18 21:20 ` Dave Taht @ 2015-03-21 16:09 ` Dave Taht 2015-03-21 23:55 ` Jonathan Morton 2015-03-22 9:39 ` Sebastian Moeller 0 siblings, 2 replies; 22+ messages in thread From: Dave Taht @ 2015-03-21 16:09 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel In terms of a cake feature request... fq_codel has a maximum number of packets limit, which is set very large (10000) to accommodate 10GigE. It is arbitrarily patched down in openwrt (1000), and reduced still further by the sqm-scripts (also arbitrarily), to reduce the impact of a packet flood on machines with very little memory. I would like cake to have a byte limit instead. Now, per packet overhead in linux is very high, something like 256 extra bytes per packet (4x1 vs the smallest size). However, a packet limit can be much harder on memory than that - overhead be as large as 64k per "packet" on TSO/GSO enabled systems, (dynamic range of 1x1000!), vs using a byte limit which would only have issues with lots of small packets. cake's bandwidth parameter can easily set a desirable max byte limit at (say) 2 or 4x the BDP, and key off of that and not bother to track a per packet limit. It would be nice for cake (without shaping enabled) to be about to automatically sense the actual interface rate and size this outer limit appropriately, but I don't think mechanisms exist to do that. On Wed, Mar 18, 2015 at 2:20 PM, Dave Taht <dave.taht@gmail.com> wrote: > > > On Wed, Mar 18, 2015 at 8:20 AM, Jonathan Morton <chromatix99@gmail.com> > wrote: >> >> >> > On 18 Mar, 2015, at 17:10, Kathleen Nichols <nichols@pollere.com> wrote: >> > >> > How are you relating target delay to bandwidth? >> >> Essentially, I use 5ms as a minimum, and increase it if necessary to >> accommodate a couple of MTU-sized packets at the shaping rate. This keeps >> things nicely under control at low bandwidths, and I find that cake remains >> useful and usable even at 64Kbps (without making even the usual adjustments >> to host or link configuration for such low speeds). > > > In the cake2 (or maybe it was the unpublished cake3) version, I had a > lighter weight version of the codel algorithm, that did not have a target > parameter at all. Instead it just took the interval parameter and shifted it > right 4 (yielding a target of 6.xms from an interval of 100ms) > > This saves on a memory access (and storage per queue!) , and I felt that any > differences in behavior would be unnoticeable. And they were. This is also > above the bound for cable-modem media access that greg white (rightly or > wrongly) believed existed. So I have no problem in eliminating "target" > entirely. > > Cake (without bandwidth shaping engaged) uses more cpu than fq_codel did and > this was one of many optimizations I'd attempted (or successfully added). > Cake with shaping is a bit less cpu than sqm-scripts htb + fq_codel + > filters. > > It also looked like cake could be poured into gates, with a bit more > research, and testing. > >> >> I can do this in cake because the shaping rate is known, whereas the pure >> codel and fq_codel qdiscs do not have reliable link-speed information. > > > As for this bit, we seemed to need to account for a MTU's worth of data at > the lower speeds, and I did not explore what fiddling with the interval and > auto-calc-ing the target did at these speeds, as yet. > >> >> - Jonathan Morton >> >> _______________________________________________ >> Codel mailing list >> Codel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/codel > > > > > -- > Dave Täht > Let's make wifi fast, less jittery and reliable again! > > https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb -- Dave Täht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-21 16:09 ` Dave Taht @ 2015-03-21 23:55 ` Jonathan Morton 2015-03-22 9:39 ` Sebastian Moeller 1 sibling, 0 replies; 22+ messages in thread From: Jonathan Morton @ 2015-03-21 23:55 UTC (permalink / raw) To: Dave Taht; +Cc: codel > On 21 Mar, 2015, at 18:09, Dave Taht <dave.taht@gmail.com> wrote: > > I would like cake to have a byte limit instead. That can probably be arranged. > at (say) 2 or 4x the BDP Hmm, we know the bandwidth, but what’s the delay? It probably makes sense to use the Codel interval parameter for that purpose. Four times that should make a reasonable headroom for most situations, and the drop-from-longest-queue behaviour should deal with excursions beyond that. > It would be nice for cake (without shaping enabled) to be about to > automatically sense the actual interface rate and size this outer > limit appropriately, but I don't think mechanisms exist to do that. It may be feasible to guess an initial value, and refine it based on the observed throughput when the queue is busy. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-21 16:09 ` Dave Taht 2015-03-21 23:55 ` Jonathan Morton @ 2015-03-22 9:39 ` Sebastian Moeller 2015-03-22 10:43 ` Jonathan Morton 1 sibling, 1 reply; 22+ messages in thread From: Sebastian Moeller @ 2015-03-22 9:39 UTC (permalink / raw) To: Dave Täht; +Cc: Jonathan Morton, codel Hi Dave, hi Jonathan, On Mar 21, 2015, at 17:09 , Dave Taht <dave.taht@gmail.com> wrote: > In terms of a cake feature request... > > fq_codel has a maximum number of packets limit, which is set very > large (10000) to accommodate 10GigE. It is arbitrarily patched down in > openwrt (1000), and reduced still further by the sqm-scripts (also > arbitrarily), to reduce the impact of a packet flood on machines with > very little memory. > > I would like cake to have a byte limit instead. Now, per packet > overhead in linux is very high, something like 256 extra bytes per > packet (4x1 vs the smallest size). However, a packet limit can be much > harder on memory than that - overhead be as large as 64k per "packet" > on TSO/GSO enabled systems, (dynamic range of 1x1000!), vs using a > byte limit which would only have issues with lots of small packets. I could be out to lunch here, as usual,;but I argue the byte limit should include the kernel overhead (could this be based on skb->truesize) as this is what cunts against real memory. My assumption here is that in normal operation we rarely/never get queues to fill up to the limit anyways (as tho would turn the queue into tail-drop effectively), but if we do we really need to account for the worst case (especially on home routers). What do you think? Best Regards Sebastian > > cake's bandwidth parameter can easily set a desirable max byte limit > at (say) 2 or 4x the BDP, and key off of that and not bother to track > a per packet limit. > > It would be nice for cake (without shaping enabled) to be about to > automatically sense the actual interface rate and size this outer > limit appropriately, but I don't think mechanisms exist to do that. > > > On Wed, Mar 18, 2015 at 2:20 PM, Dave Taht <dave.taht@gmail.com> wrote: >> >> >> On Wed, Mar 18, 2015 at 8:20 AM, Jonathan Morton <chromatix99@gmail.com> >> wrote: >>> >>> >>>> On 18 Mar, 2015, at 17:10, Kathleen Nichols <nichols@pollere.com> wrote: >>>> >>>> How are you relating target delay to bandwidth? >>> >>> Essentially, I use 5ms as a minimum, and increase it if necessary to >>> accommodate a couple of MTU-sized packets at the shaping rate. This keeps >>> things nicely under control at low bandwidths, and I find that cake remains >>> useful and usable even at 64Kbps (without making even the usual adjustments >>> to host or link configuration for such low speeds). >> >> >> In the cake2 (or maybe it was the unpublished cake3) version, I had a >> lighter weight version of the codel algorithm, that did not have a target >> parameter at all. Instead it just took the interval parameter and shifted it >> right 4 (yielding a target of 6.xms from an interval of 100ms) >> >> This saves on a memory access (and storage per queue!) , and I felt that any >> differences in behavior would be unnoticeable. And they were. This is also >> above the bound for cable-modem media access that greg white (rightly or >> wrongly) believed existed. So I have no problem in eliminating "target" >> entirely. >> >> Cake (without bandwidth shaping engaged) uses more cpu than fq_codel did and >> this was one of many optimizations I'd attempted (or successfully added). >> Cake with shaping is a bit less cpu than sqm-scripts htb + fq_codel + >> filters. >> >> It also looked like cake could be poured into gates, with a bit more >> research, and testing. >> >>> >>> I can do this in cake because the shaping rate is known, whereas the pure >>> codel and fq_codel qdiscs do not have reliable link-speed information. >> >> >> As for this bit, we seemed to need to account for a MTU's worth of data at >> the lower speeds, and I did not explore what fiddling with the interval and >> auto-calc-ing the target did at these speeds, as yet. >> >>> >>> - Jonathan Morton >>> >>> _______________________________________________ >>> Codel mailing list >>> Codel@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/codel >> >> >> >> >> -- >> Dave Täht >> Let's make wifi fast, less jittery and reliable again! >> >> https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb > > > > -- > Dave Täht > Let's make wifi fast, less jittery and reliable again! > > https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb > _______________________________________________ > Codel mailing list > Codel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/codel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-22 9:39 ` Sebastian Moeller @ 2015-03-22 10:43 ` Jonathan Morton 2015-03-22 12:51 ` Sebastian Moeller 0 siblings, 1 reply; 22+ messages in thread From: Jonathan Morton @ 2015-03-22 10:43 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel > On 22 Mar, 2015, at 11:39, Sebastian Moeller <moeller0@gmx.de> wrote: > > I could be out to lunch here, as usual,;but I argue the byte limit should include the kernel overhead (could this be based on skb->truesize) as this is what counts against real memory. My assumption here is that in normal operation we rarely/never get queues to fill up to the limit anyways Such an argument could certainly be made. Does skb->truesize include the skb header, as well as the buffer space allocated? > (as tho would turn the queue into tail-drop effectively) But fq_codel (and cake) are a little cleverer than that, even when they hit the hard limit. They still drop from the head, and they shoot the longest flow-queue first. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-22 10:43 ` Jonathan Morton @ 2015-03-22 12:51 ` Sebastian Moeller 2015-03-22 15:29 ` Jonathan Morton 2015-03-30 17:28 ` Jonathan Morton 0 siblings, 2 replies; 22+ messages in thread From: Sebastian Moeller @ 2015-03-22 12:51 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel Hi Jonathan, On Mar 22, 2015, at 11:43 , Jonathan Morton <chromatix99@gmail.com> wrote: > >> On 22 Mar, 2015, at 11:39, Sebastian Moeller <moeller0@gmx.de> wrote: >> >> I could be out to lunch here, as usual,;but I argue the byte limit should include the kernel overhead (could this be based on skb->truesize) as this is what counts against real memory. My assumption here is that in normal operation we rarely/never get queues to fill up to the limit anyways > > Such an argument could certainly be made. Does skb->truesize include the skb header, as well as the buffer space allocated? According to http://vger.kernel.org/~davem/skb_sk.html (“We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. This is the total of how large a data buffer we allocated for the packet, plus the size of 'struct sk_buff' itself.") it looks like this should be the right number, but I am not well versed in reading kernel code, so there might be some caveats I am not aware of. > >> (as tho would turn the queue into tail-drop effectively) > > But fq_codel (and cake) are a little cleverer than that, even when they hit the hard limit. They still drop from the head, and they shoot the longest flow-queue first. Excellent, learned something new today; in fq_codel does this come from the per-codel instance 1000 packet limit or from the default fq_codel 102400? packet limit (just in case someone knows off hand, I can try to understand the kernel code myself, given enough time ;) )? Best Regards Sebastian > > - Jonathan Morton > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-22 12:51 ` Sebastian Moeller @ 2015-03-22 15:29 ` Jonathan Morton 2015-03-30 17:28 ` Jonathan Morton 1 sibling, 0 replies; 22+ messages in thread From: Jonathan Morton @ 2015-03-22 15:29 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel >>> On 22 Mar, 2015, at 11:39, Sebastian Moeller <moeller0@gmx.de> wrote: >>> >>> I could be out to lunch here, as usual,;but I argue the byte limit should include the kernel overhead (could this be based on skb->truesize) as this is what counts against real memory. My assumption here is that in normal operation we rarely/never get queues to fill up to the limit anyways >> >> Such an argument could certainly be made. Does skb->truesize include the skb header, as well as the buffer space allocated? > > According to http://vger.kernel.org/~davem/skb_sk.html (“We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. This is the total of how large a data buffer we allocated for the packet, plus the size of 'struct sk_buff' itself.") it looks like this should be the right number, but I am not well versed in reading kernel code, so there might be some caveats I am not aware of. The statement in that commentary seems to be authoritative enough to rely on. >>> (as tho would turn the queue into tail-drop effectively) >> >> But fq_codel (and cake) are a little cleverer than that, even when they hit the hard limit. They still drop from the head, and they shoot the longest flow-queue first. > > Excellent, learned something new today; in fq_codel does this come from the per-codel instance 1000 packet limit or from the default fq_codel 102400? packet limit (just in case someone knows off hand, I can try to understand the kernel code myself, given enough time ;) )? Off the top of my head, most likely both. (If a per-queue limit is hit, then by definition that’s the longest queue.) I should probably re-read the code to be certain of that, though. I’m currently eyeball-deep in running software updates on half a dozen very obsolete machines, and cracking the cases on half a dozen routers to see what hardware they’ve got inside. So far I think *one* of them has actually matched what the OpenWRT database lists, and even then there’s an obvious error. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-22 12:51 ` Sebastian Moeller 2015-03-22 15:29 ` Jonathan Morton @ 2015-03-30 17:28 ` Jonathan Morton 2015-03-30 18:23 ` Sebastian Moeller 2015-03-30 19:28 ` Dave Taht 1 sibling, 2 replies; 22+ messages in thread From: Jonathan Morton @ 2015-03-30 17:28 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! The set-associative hashing is in, currently hard-coded to 8-way, which should be more than adequate for typical home workloads (especially if BitTorrent is marked as background). I’m now adding statistics outputs so that we can all see how well it works under real loads, distinguishing between direct hits (the fast path, where the hash points directly to a correctly allocated queue), indirect hits (where a search through the set locates an allocated queue), misses (where an empty queue was allocated) and collisions (where all the queues were already occupied by other flows). Note that “allocation” in this case simply means updating the tag on the queue to match the packet, not allocating memory. These statistics should help with tuning the number of flow queues in each class. The new Diffserv logic is also in and apparently working well. I’m able to observe both the priority-balanced and bandwidth-balanced behaviours under appropriate circumstances. It is, however, fairly sensitive to the baseline quantum value from which the outer DRR weights are derived. Setting it to the MTU lets the high-priority classes burst too much, temporarily starving the lower classes. It seems to work a lot better using 256 as a baseline, but I hope the overhead of going around the DRR loop several times per packet isn’t too high. It might be sane to vary the baseline quantum based on the bandwidth setting, but we can experiment with that later. > We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. Calculating the appropriate size to use here was an interesting problem. I start by estimating the BDP (from the shaper bandwidth and the Codel interval), quadrupling that and enforcing a lower limit of 64KB. If it’s set to “unlimited bandwidth”, I set the size to an arbitrary 1MB. Then I look at the packet limit, multiply that by the MTU and clamp the resulting size down to that - providing a straightforward mechanism to set a limit on memory usage. This all comes with a *fifth* almost-copy of codel.h, which incorporates the more efficient dequeue callback from codel4.h, but puts back the separate treatment of interval and target. Cake auto-tunes both of them to related values, but the relationship isn’t as simple as scaling one from the other, and I’m not comfortable with the mental contortions involved in the new reasoning. I also added a lookup table for the first few inverse square roots, because I found (using a spreadsheet) that these are very inaccurate if you only do one Newton iteration on them - something like 1.0, 0.500, 0.563, 0.488 instead of 1.0, 0.707, 0.577, 0.500. Nebulous problems with “insufficiently tight control” might possibly be traced to this. With the first fifteen entries calculated up-front using four iterations and stored, accuracy is vastly improved and, in practice, almost all square-root calculations will now reduce to a simple table lookup. The memory cost is trivial, and Newton is good enough for exceptionally large drop counts. Along with the hashing statistics, I’m also adding three measures of latency to replace the one from the first version. This no longer relies on data provided by Codel (so there’s no space penalty per flow for collecting it, and the inner loop in the stats dump is also eliminated), but performs the necessary calculation just after dequeuing each packet, storing the aggregated values per class instead of per flow. Each measure is an EWMA (using optimised weights which reduce to shifts and adds), but two of them are biased - one towards high latencies (so it’ll mostly track the fattest flows), and one towards low latencies (so it’ll mostly track sparse flows). Still lots of little things to do… - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-30 17:28 ` Jonathan Morton @ 2015-03-30 18:23 ` Sebastian Moeller 2015-03-31 3:22 ` Jonathan Morton 2015-03-30 19:28 ` Dave Taht 1 sibling, 1 reply; 22+ messages in thread From: Sebastian Moeller @ 2015-03-30 18:23 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel On Mar 30, 2015, at 19:28 , Jonathan Morton <chromatix99@gmail.com> wrote: > I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! > > The set-associative hashing is in, currently hard-coded to 8-way, which should be more than adequate for typical home workloads (especially if BitTorrent is marked as background). I’m now adding statistics outputs so that we can all see how well it works under real loads, distinguishing between direct hits (the fast path, where the hash points directly to a correctly allocated queue), indirect hits (where a search through the set locates an allocated queue), misses (where an empty queue was allocated) and collisions (where all the queues were already occupied by other flows). Note that “allocation” in this case simply means updating the tag on the queue to match the packet, not allocating memory. These statistics should help with tuning the number of flow queues in each class. > > The new Diffserv logic is also in and apparently working well. I’m able to observe both the priority-balanced and bandwidth-balanced behaviours under appropriate circumstances. It is, however, fairly sensitive to the baseline quantum value from which the outer DRR weights are derived. Setting it to the MTU lets the high-priority classes burst too much, temporarily starving the lower classes. It seems to work a lot better using 256 as a baseline, but I hope the overhead of going around the DRR loop several times per packet isn’t too high. It might be sane to vary the baseline quantum based on the bandwidth setting, but we can experiment with that later. > >> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. > > The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. Does this mean the largest size the queue actually occupies is limit plus the last packet’s true size temporarily? What about the case a full MTU packet comes in while the eligible packet at the head of the longest queue is say 64 byte? > > Calculating the appropriate size to use here was an interesting problem. I start by estimating the BDP (from the shaper bandwidth and the Codel interval), quadrupling that and enforcing a lower limit of 64KB. If it’s set to “unlimited bandwidth”, I set the size to an arbitrary 1MB. Then I look at the packet limit, multiply that by the MTU and clamp the resulting size down to that - providing a straightforward mechanism to set a limit on memory usage. > > This all comes with a *fifth* almost-copy of codel.h, which incorporates the more efficient dequeue callback from codel4.h, but puts back the separate treatment of interval and target. Cake auto-tunes both of them to related values, but the relationship isn’t as simple as scaling one from the other, and I’m not comfortable with the mental contortions involved in the new reasoning. I thought the main argument was, that interval needs to be roughly in the right range and target is supposed to be between 5 to 10 percent of interval. > > I also added a lookup table for the first few inverse square roots, because I found (using a spreadsheet) that these are very inaccurate if you only do one Newton iteration on them - something like 1.0, 0.500, 0.563, 0.488 instead of 1.0, 0.707, 0.577, 0.500. Nebulous problems with “insufficiently tight control” might possibly be traced to this. With the first fifteen entries calculated up-front using four iterations and stored, accuracy is vastly improved and, in practice, almost all square-root calculations will now reduce to a simple table lookup. The memory cost is trivial, and Newton is good enough for exceptionally large drop counts. > > Along with the hashing statistics, I’m also adding three measures of latency to replace the one from the first version. This no longer relies on data provided by Codel (so there’s no space penalty per flow for collecting it, and the inner loop in the stats dump is also eliminated), but performs the necessary calculation just after dequeuing each packet, storing the aggregated values per class instead of per flow. Each measure is an EWMA (using optimised weights which reduce to shifts and adds), but two of them are biased - one towards high latencies (so it’ll mostly track the fattest flows), and one towards low latencies (so it’ll mostly track sparse flows). > > Still lots of little things to do… > > - Jonathan Morton Wahoo, this sounds great. I cant’t wait to do a “test-drive” with this ;), (once you are looking for testers let me know) Best Regards Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-30 18:23 ` Sebastian Moeller @ 2015-03-31 3:22 ` Jonathan Morton 2015-03-31 7:12 ` Sebastian Moeller 0 siblings, 1 reply; 22+ messages in thread From: Jonathan Morton @ 2015-03-31 3:22 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel > On 30 Mar, 2015, at 21:23, Sebastian Moeller <moeller0@gmx.de> wrote: > >>> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. >> >> The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. > > Does this mean the largest size the queue actually occupies is limit plus the last packet’s true size temporarily? What about the case a full MTU packet comes in while the eligible packet at the head of the longest queue is say 64 byte? Yes - I could make it capable of dropping multiple packets to cope better with the latter case. The former case is not a problem really, since the packet is already in memory when we’re given it to enqueue; to fix it we’d have to anticipate the largest possible arriving packet and pre-emptively leave space for it. > I thought the main argument was, that interval needs to be roughly in the right range and target is supposed to be between 5 to 10 percent of interval. Yes, but why? I have seen other arguments, both recently and otherwise, which modify target to accommodate the MAC grant latency of some medium, while interval is still described as a rough estimate of the RTT. The existence of both explanations gives me epic cognitive-dissonance headaches, which I can’t resolve *unless* I am able to vary target and interval independently. Since the cost to do so appears to be small, I put that facility back in. >>> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. >> >> The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. > > did this include the overload protection fallback to dropping rather > than ecn-ing packets? Codel already has such a protection mechanism, but I haven’t yet converted it to byte mode. > We explored the sqrt cache idea in the very first implementations of > codel which you can revisit in the earliest, exciting days of the > existence of the codel mailing list. > > I am not allergic to trying it again. What we had found then was the > count variable was increasing without bound (with the original > published ns2 version), for which we substituted another modification > that works well at low bandwidths but less well at higher ones. > > My sole mathematical contribution to the invsqrt approximation in > linux codel was the discovery that newton's method had much larger > errors than 2% when run in reverse (eg, when decreasing count by more > than 1, as various forms of the algorithm have done while trying to > come up with a better estimate for the the resumption portion of the > algorithm) A quick check in that spreadsheet suggests that this inaccuracy only occurs for low values of count, in basically the same places as running it forward is also inaccurate. So having the pre-calculated lookup table for those first values works, and Newton is still used directly for larger counts if they occur. > Note that I tried hard in codel4 (I think it was) to reduce the > size of the codel stats to a single 32byte cache-line on 32 bit arches > - down from 48 or so bytes - and succeeded with one word to spare. I > am not allergic to keeping lots of statistics but we are also aiming > for speed on lame processors here… I’ve managed to add these statistics without increasing the size of the per-flow data, only the per-class data. The weakness of the CPUs involved is why I’m aiming for shifts and adds (which an ARM in particular can do extremely efficiently, but a MIPS isn’t hurt much either) rather than multiplies in the fast path. There’s a stray multiply in the hash function which I still want to excise. It might also be possible, later on, to add a kernel config switch to drop out the stats computations. It probably won’t make very much difference, but when every little helps... - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-31 3:22 ` Jonathan Morton @ 2015-03-31 7:12 ` Sebastian Moeller 2015-03-31 12:47 ` Jonathan Morton 0 siblings, 1 reply; 22+ messages in thread From: Sebastian Moeller @ 2015-03-31 7:12 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel Hi Jonathan, On Mar 31, 2015, at 05:22 , Jonathan Morton <chromatix99@gmail.com> wrote: > >> On 30 Mar, 2015, at 21:23, Sebastian Moeller <moeller0@gmx.de> wrote: >> >>>> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. >>> >>> The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. >> >> Does this mean the largest size the queue actually occupies is limit plus the last packet’s true size temporarily? What about the case a full MTU packet comes in while the eligible packet at the head of the longest queue is say 64 byte? > > Yes - I could make it capable of dropping multiple packets to cope better with the latter case. > > The former case is not a problem really, since the packet is already in memory when we’re given it to enqueue; to fix it we’d have to anticipate the largest possible arriving packet and pre-emptively leave space for it. Well, we at least need to drop as much data as we add at the end, otherwise a UDP flood can certainly make us exceed all memory (queue is filled with small packets, then a flood of full MTU packets arrives, boom). So on resource starved machines it would be nice if we could set a “hard" limit on the memory consumption by cake (and yes this leaves the buffers leading into cake). > >> I thought the main argument was, that interval needs to be roughly in the right range and target is supposed to be between 5 to 10 percent of interval. > > Yes, but why? I think this is explained well in sections 4.3 and 4.4 of https://datatracker.ietf.org/doc/draft-ietf-aqm-codel/?include_text=1 > > I have seen other arguments, both recently and otherwise, which modify target to accommodate the MAC grant latency of some medium, while interval is still described as a rough estimate of the RTT. According to Kathleen Nichols http://pollere.net/Pdfdocs/noteburstymacs.pdf the reasoning of the RFC draft should also apply to the MAC issue. I admit that in sqm-scripts I went the easy way and sized target to allow ~1.5 Packets and then just added the target duration to interval, but according to Kathleen Nichols this will basically sacrifice more latency than necessary to reclaim some bandwidth. Now I thought that adding the slow link transfer time to the interval appealed to me as effectively all RTTs are increased by that amount, but empirically we want to fiddle with target. I guess I will go and implement a smarter adjustment, that first ramps target from 5 to 10% of interval (the RFC argues for target in the 5-10% range), and only then starts to increase interval, so that adjusted_interval = adjusted_target * 10. But increasing target above a full MTU transfer time is still necessary otherwise people;e on slow links sacrifice (too?) much of their already scarce bandwidth. (I believe Dave argued that this problem does not really exist n the ns version of the ACM paper, but since ns2 is not a full model of the actual linux kernel, the do not drop for sing;e packet queues heuristic in codel does not seem to do the right thing in real life (I could easily have gotten this wrong, though)) I guess I should go measure and only then come back to this discussion. > > The existence of both explanations gives me epic cognitive-dissonance headaches, which I can’t resolve *unless* I am able to vary target and interval independently. Since the cost to do so appears to be small, I put that facility back in. I am all for manual overrides, I could also live with a mode where if only one is given, the other is calculated according to the above ides. > >>>> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. >>> >>> The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. >> >> did this include the overload protection fallback to dropping rather >> than ecn-ing packets? > > Codel already has such a protection mechanism, but I haven’t yet converted it to byte mode. Ah, good this is one more important step to make smallish routers survive a udp flood; not that I think a router should work great during a flood/attack, but it certainly should not crash... > >> We explored the sqrt cache idea in the very first implementations of >> codel which you can revisit in the earliest, exciting days of the >> existence of the codel mailing list. >> >> I am not allergic to trying it again. What we had found then was the >> count variable was increasing without bound (with the original >> published ns2 version), for which we substituted another modification >> that works well at low bandwidths but less well at higher ones. >> >> My sole mathematical contribution to the invsqrt approximation in >> linux codel was the discovery that newton's method had much larger >> errors than 2% when run in reverse (eg, when decreasing count by more >> than 1, as various forms of the algorithm have done while trying to >> come up with a better estimate for the the resumption portion of the >> algorithm) > > A quick check in that spreadsheet suggests that this inaccuracy only occurs for low values of count, in basically the same places as running it forward is also inaccurate. So having the pre-calculated lookup table for those first values works, and Newton is still used directly for larger counts if they occur. > >> Note that I tried hard in codel4 (I think it was) to reduce the >> size of the codel stats to a single 32byte cache-line on 32 bit arches >> - down from 48 or so bytes - and succeeded with one word to spare. I >> am not allergic to keeping lots of statistics but we are also aiming >> for speed on lame processors here… > > I’ve managed to add these statistics without increasing the size of the per-flow data, only the per-class data. The weakness of the CPUs involved is why I’m aiming for shifts and adds (which an ARM in particular can do extremely efficiently, but a MIPS isn’t hurt much either) rather than multiplies in the fast path. There’s a stray multiply in the hash function which I still want to excise. > > It might also be possible, later on, to add a kernel config switch to drop out the stats computations. It probably won’t make very much difference, but when every little helps... > > - Jonathan Morton > Best Regards Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-31 7:12 ` Sebastian Moeller @ 2015-03-31 12:47 ` Jonathan Morton 0 siblings, 0 replies; 22+ messages in thread From: Jonathan Morton @ 2015-03-31 12:47 UTC (permalink / raw) To: Sebastian Moeller; +Cc: codel > On 31 Mar, 2015, at 10:12, Sebastian Moeller <moeller0@gmx.de> wrote: > >>> I thought the main argument was, that interval needs to be roughly in the right range and target is supposed to be between 5 to 10 percent of interval. >> >> Yes, but why? > > I think this is explained well in sections 4.3 and 4.4 of https://datatracker.ietf.org/doc/draft-ietf-aqm-codel/?include_text=1 > >> I have seen other arguments, both recently and otherwise, which modify target to accommodate the MAC grant latency of some medium, while interval is still described as a rough estimate of the RTT. > > According to Kathleen Nichols http://pollere.net/Pdfdocs/noteburstymacs.pdf the reasoning of the RFC draft should also apply to the MAC issue. Cake's shaper has no significant access grant latency (typical scheduling latency in Linux is on the order of 1ms), so while the discussion of that case is enlightening, it isn’t relevant to my work right now. (Might become relevant in other contexts.) But having just re-read those papers carefully, the argument over target actually boils down to the fact that - at the time of writing - there was a special case which would avoid marking or dropping the final MTU’s worth of the queue, regardless of the target or latency settings. This is actually quoted in the I-D as the mechanism by which Codel scales down to 64Kbps. See "non-starvation drop inhibit”. That special case is *not* present in the more recent versions of Codel, including the version used for the original version of cake. Some versions of Codel have an alternative mechanism, where they special-case the last packet in the queue rather than the last MTU’s worth. But this doesn’t work very well for cake (or for fq_codel) which often delivers the last packet in that sub-queue while there are still other packets in the overall queue - and the latter is what is tested. As a result, I saw truly excessive marking if I selected a very slow shaping rate (eg. 64Kbps) without adjusting the target. The congestion window was already at minimum size, so such aggressive congestion signalling just resulted in reduced throughput and (when talking to servers without ECN) more retransmissions. So what cake does is to adjust the target to roughly match the behaviour of that special case, and then adjusts the interval to be compatible with the revised target and the assumed RTT. Since cake implements sophisticated flow isolation - which is getting more sophisticated with each version - a slightly higher sojourn time in fat flows isn’t a problem, because sparse flows get to bypass them. One of the things I really like about flow-queuing and ECN is that it enables smooth, consistent delivery of the various assets that make up a web page, and those benefits were noticeably compromised with cake2 (which left out those adjustments). >> Yes - I could make it capable of dropping multiple packets to cope better with the latter case. >> Codel already has such a protection mechanism, but I haven’t yet converted it to byte mode. > > Ah, good this is one more important step to make smallish routers survive a udp flood; not that I think a router should work great during a flood/attack, but it certainly should not crash… By the time I read this, I’d already implemented these. :-) Now to update iproute2 - again - to deal with the new statistics outputs, and I might be in a position to start sending patches. - Jonathan Morton ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-30 17:28 ` Jonathan Morton 2015-03-30 18:23 ` Sebastian Moeller @ 2015-03-30 19:28 ` Dave Taht 2015-04-02 4:48 ` Jonathan Morton 1 sibling, 1 reply; 22+ messages in thread From: Dave Taht @ 2015-03-30 19:28 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel On Mon, Mar 30, 2015 at 10:28 AM, Jonathan Morton <chromatix99@gmail.com> wrote: > I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! Preliminary patches always appreciated. I plan to do a build thursday for a test cycle this weekend of various things. > The set-associative hashing is in, currently hard-coded to 8-way, which should be more than adequate for typical home workloads (especially if BitTorrent is marked as background). I’m now adding statistics outputs so that we can all see how well it works under real loads, distinguishing between direct hits (the fast path, where the hash points directly to a correctly allocated queue), indirect hits (where a search through the set locates an allocated queue), misses (where an empty queue was allocated) and collisions (where all the queues were already occupied by other flows). Note that “allocation” in this case simply means updating the tag on the queue to match the packet, not allocating memory. These statistics should help with tuning the number of flow queues in each class. > > The new Diffserv logic is also in and apparently working well. I’m able to observe both the priority-balanced and bandwidth-balanced behaviours under appropriate circumstances. It is, however, fairly sensitive to the baseline quantum value from which the outer DRR weights are derived. Setting it to the MTU lets the high-priority classes burst too much, temporarily starving the lower classes. It seems to work a lot better using 256 as a baseline, but I hope the overhead of going around the DRR loop several times per packet isn’t too high. It might be sane to vary the baseline quantum based on the bandwidth setting, but we can experiment with that later. > >> We keep track of how many bytes of system memory are consumed by a packet in 'skb->truesize'. > > The buffer-occupancy queue limit is in, using the field quoted above. Previously, it would just count packets. As with previous versions of cake (and fq_codel), it checks the limit immediately after enqueuing each packet, and drops one packet from the head of the longest queue (by byte count) if it’s violated. did this include the overload protection fallback to dropping rather than ecn-ing packets? > Calculating the appropriate size to use here was an interesting problem. I start by estimating the BDP (from the shaper bandwidth and the Codel interval), quadrupling that and enforcing a lower limit of 64KB. If it’s set to “unlimited bandwidth”, I set the size to an arbitrary 1MB. Then I look at the packet limit, multiply that by the MTU and clamp the Hmm.. resulting size down to that - providing a straightforward mechanism to set a limit on memory usage. > > This all comes with a *fifth* almost-copy of codel.h, which incorporates the more efficient dequeue callback from codel4.h, but puts back the separate treatment of interval and target. Cake auto-tunes both of them to related values, but the relationship isn’t as simple as scaling one from the other, and I’m not comfortable with the mental contortions involved in the new reasoning. k. will test. > I also added a lookup table for the first few inverse square roots, because I found (using a spreadsheet) that these are very inaccurate if you only do one Newton iteration on them - something like 1.0, 0.500, 0.563, 0.488 instead of 1.0, 0.707, 0.577, 0.500. Nebulous problems with “insufficiently tight control” might possibly be traced to this. With the first fifteen entries calculated up-front using four iterations and stored, accuracy is vastly improved and, in practice, almost all square-root calculations will now reduce to a simple table lookup. The memory cost is trivial, and Newton is good enough for exceptionally large drop counts. We explored the sqrt cache idea in the very first implementations of codel which you can revisit in the earliest, exciting days of the existence of the codel mailing list. I am not allergic to trying it again. What we had found then was the count variable was increasing without bound (with the original published ns2 version), for which we substituted another modification that works well at low bandwidths but less well at higher ones. My sole mathematical contribution to the invsqrt approximation in linux codel was the discovery that newton's method had much larger errors than 2% when run in reverse (eg, when decreasing count by more than 1, as various forms of the algorithm have done while trying to come up with a better estimate for the the resumption portion of the algorithm) When using alternate forms of recalculating count, I have usually run newton's twice to smooth out that error somewhat, but I am open to other alternatives. One thing I would like us to be looking at much harder this time, instead of the latency of the fq'd measurement flows, is the actual size of cwnd and other tcp behaviors from the packet captures, against modern tcps. Us perpetually publishing just the fq results in rrul is misleading some people I think. It would be nice to be sampling these across the course of a run also from netperf-wrapper but I don't know how to do that without web10g. > Along with the hashing statistics, I’m also adding three measures of latency to replace the one from the first version. This no longer relies on data provided by Codel (so there’s no space penalty per flow for collecting it, and the inner loop in the stats dump is also eliminated), but performs the necessary calculation just after dequeuing each packet, storing the aggregated values per class instead of per flow. Each measure is an EWMA (using optimised weights which reduce to shifts and adds), but two of them are biased - one towards high latencies (so it’ll mostly track the fattest flows), and one towards low latencies (so it’ll mostly track sparse flows). Nifty. Note that I tried hard in codel4 (I think it was) to reduce the size of the codel stats to a single 32byte cache-line on 32 bit arches - down from 48 or so bytes - and succeeded with one word to spare. I am not allergic to keeping lots of statistics but we are also aiming for speed on lame processors here.... > > Still lots of little things to do… > > - Jonathan Morton -- Dave Täht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-03-30 19:28 ` Dave Taht @ 2015-04-02 4:48 ` Jonathan Morton 2015-04-02 5:17 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Jonathan Morton @ 2015-04-02 4:48 UTC (permalink / raw) To: Dave Taht; +Cc: codel [-- Attachment #1: Type: text/plain, Size: 1168 bytes --] > On 30 Mar, 2015, at 22:28, Dave Taht <dave.taht@gmail.com> wrote: > >> I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! > > Preliminary patches always appreciated. I plan to do a build thursday > for a test cycle this weekend of various things. Have a go with this, then. I took the time to do quite a lot of little cleanup things, such as suppressing the statistics output of classes that are not in use. Tests on the Pentium-MMX show similar throughput and latency capabilities to the first version of cake, but this version has more robust behaviour in high-traffic situations - and the Diffserv logic is also far more useful. I’m working towards getting OProfile running on at least one of my desktop-type machines; hopefully it’ll work sufficiently well on the Pentium-MMX, since that’s the easiest one to drive into saturation. The kernel patch applies on top of the previous one, but the iproute2 patch should apply to stock. Note that the current git version of iproute2 seems to have a problem with displaying the stats from an ingress filter. - Jonathan Morton [-- Attachment #2: cake3-take7.patch.gz --] [-- Type: application/x-gzip, Size: 14896 bytes --] [-- Attachment #3: iproute2-cake3.patch.gz --] [-- Type: application/x-gzip, Size: 5756 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-04-02 4:48 ` Jonathan Morton @ 2015-04-02 5:17 ` Dave Taht 2015-04-02 5:19 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Dave Taht @ 2015-04-02 5:17 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel On Wed, Apr 1, 2015 at 9:48 PM, Jonathan Morton <chromatix99@gmail.com> wrote: > >> On 30 Mar, 2015, at 22:28, Dave Taht <dave.taht@gmail.com> wrote: >> >>> I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! >> >> Preliminary patches always appreciated. I plan to do a build thursday >> for a test cycle this weekend of various things. > > Have a go with this, then. I took the time to do quite a lot of little cleanup things, such as suppressing the statistics output of classes that are not in use. > > Tests on the Pentium-MMX show similar throughput and latency capabilities to the first version of cake, but this version has more robust behaviour in high-traffic situations - and the Diffserv logic is also far more useful. I’m working towards getting OProfile running on at least one of my desktop-type machines; hopefully it’ll work sufficiently well on the Pentium-MMX, since that’s the easiest one to drive into saturation. Perf is included in every kernel these days, it is way better than oprofile, and easy to build if you are running natively. in your linux build dir: cd tools make # to get more info > The kernel patch applies on top of the previous one, but the iproute2 patch should apply to stock. Note that the current git version of iproute2 seems to have a problem with displaying the stats from an ingress filter. I think for benefit of the audience a fresh patch would be better. I will see if this applies on top of my old patches. > > - Jonathan Morton > -- Dave Täht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Codel] The next slice of cake 2015-04-02 5:17 ` Dave Taht @ 2015-04-02 5:19 ` Dave Taht 0 siblings, 0 replies; 22+ messages in thread From: Dave Taht @ 2015-04-02 5:19 UTC (permalink / raw) To: Jonathan Morton; +Cc: codel I note the backport to prior to 3.18 is no longer needed at this point, I have no intentions of going back to a kernel that old. Does anyone here care about keeping this working with anything older than 3.18? On Wed, Apr 1, 2015 at 10:17 PM, Dave Taht <dave.taht@gmail.com> wrote: > On Wed, Apr 1, 2015 at 9:48 PM, Jonathan Morton <chromatix99@gmail.com> wrote: >> >>> On 30 Mar, 2015, at 22:28, Dave Taht <dave.taht@gmail.com> wrote: >>> >>>> I made a lot of progress over the weekend, and got cake3 running on a sacrificial box today. No crashes yet! >>> >>> Preliminary patches always appreciated. I plan to do a build thursday >>> for a test cycle this weekend of various things. >> >> Have a go with this, then. I took the time to do quite a lot of little cleanup things, such as suppressing the statistics output of classes that are not in use. >> >> Tests on the Pentium-MMX show similar throughput and latency capabilities to the first version of cake, but this version has more robust behaviour in high-traffic situations - and the Diffserv logic is also far more useful. I’m working towards getting OProfile running on at least one of my desktop-type machines; hopefully it’ll work sufficiently well on the Pentium-MMX, since that’s the easiest one to drive into saturation. > > Perf is included in every kernel these days, it is way better than > oprofile, and easy to build if you are running natively. > in your linux build dir: > > cd tools > make # to get more info > > >> The kernel patch applies on top of the previous one, but the iproute2 patch should apply to stock. Note that the current git version of iproute2 seems to have a problem with displaying the stats from an ingress filter. > > I think for benefit of the audience a fresh patch would be better. I > will see if this applies on top of my old patches. > >> >> - Jonathan Morton >> > > > > -- > Dave Täht > Let's make wifi fast, less jittery and reliable again! > > https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb -- Dave Täht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-04-02 5:19 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-17 20:08 [Codel] The next slice of cake Jonathan Morton 2015-03-18 7:22 ` Sebastian Moeller 2015-03-18 8:41 ` Jonathan Morton 2015-03-18 10:39 ` Sebastian Moeller 2015-03-18 15:10 ` Kathleen Nichols 2015-03-18 15:20 ` Jonathan Morton 2015-03-18 21:20 ` Dave Taht 2015-03-21 16:09 ` Dave Taht 2015-03-21 23:55 ` Jonathan Morton 2015-03-22 9:39 ` Sebastian Moeller 2015-03-22 10:43 ` Jonathan Morton 2015-03-22 12:51 ` Sebastian Moeller 2015-03-22 15:29 ` Jonathan Morton 2015-03-30 17:28 ` Jonathan Morton 2015-03-30 18:23 ` Sebastian Moeller 2015-03-31 3:22 ` Jonathan Morton 2015-03-31 7:12 ` Sebastian Moeller 2015-03-31 12:47 ` Jonathan Morton 2015-03-30 19:28 ` Dave Taht 2015-04-02 4:48 ` Jonathan Morton 2015-04-02 5:17 ` Dave Taht 2015-04-02 5:19 ` Dave Taht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox