* [Cerowrt-devel] Ubiquiti QOS @ 2014-05-25 6:17 Dane Medic 2014-05-25 14:23 ` Valdis.Kletnieks ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Dane Medic @ 2014-05-25 6:17 UTC (permalink / raw) To: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 138 bytes --] Is it true that devices with less than 64 MB can't handle QOS? -> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html [-- Attachment #2: Type: text/html, Size: 258 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 6:17 [Cerowrt-devel] Ubiquiti QOS Dane Medic @ 2014-05-25 14:23 ` Valdis.Kletnieks 2014-05-25 15:42 ` Mikael Abrahamsson 2014-05-25 18:39 ` Sebastian Moeller 2 siblings, 0 replies; 30+ messages in thread From: Valdis.Kletnieks @ 2014-05-25 14:23 UTC (permalink / raw) To: Dane Medic; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 549 bytes --] On Sun, 25 May 2014 08:17:47 +0200, Dane Medic said: > Is it true that devices with less than 64 MB can't handle QOS? -> > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html I'm not going to give one post on a list very much credence, especially when it doesn't contain a single actual fact or definitive claim. An explanation of exactly which data structure won't fit in 32M would be ideal. Even some numbers on RAM usage from /proc/slabinfo and a "who ate all the frobozz slabs?" would be better than what's in the post. [-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 6:17 [Cerowrt-devel] Ubiquiti QOS Dane Medic 2014-05-25 14:23 ` Valdis.Kletnieks @ 2014-05-25 15:42 ` Mikael Abrahamsson 2014-05-25 20:00 ` dpreed 2014-05-25 18:39 ` Sebastian Moeller 2 siblings, 1 reply; 30+ messages in thread From: Mikael Abrahamsson @ 2014-05-25 15:42 UTC (permalink / raw) To: Dane Medic; +Cc: cerowrt-devel On Sun, 25 May 2014, Dane Medic wrote: > Is it true that devices with less than 64 MB can't handle QOS? -> > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. I also don't see why performance and memory size would be relevant, I'd say forwarding performance has more to do with CPU speed than anything else. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 15:42 ` Mikael Abrahamsson @ 2014-05-25 20:00 ` dpreed 2014-05-26 0:18 ` Mikael Abrahamsson 2014-05-27 15:23 ` Jim Gettys 0 siblings, 2 replies; 30+ messages in thread From: dpreed @ 2014-05-25 20:00 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 6264 bytes --] Not that it is directly relevant, but there is no essential reason to require 50 ms. of buffering. That might be true of some particular QOS-related router algorithm. 50 ms. is about all one can tolerate in any router between source and destination for today's networks - an upper-bound rather than a minimum. The optimum buffer state for throughput is 1-2 packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the input queue to the lowest speed link along the path) should have this much actually buffered. Buffering more than this increases end-to-end latency beyond its optimal state. Increased end-to-end latency reduces the effectiveness of control loops, creating more congestion. The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes run operating systems that tend to release packets in bursts. That's a whole other discussion - in an ideal world, source nodes would avoid bursty packet releases by letting the control by the receiver window be "tight" timing-wise. That is, to transmit a packet immediately at the instant an ACK arrives increasing the window. This would pace the flow - current OS's tend (due to scheduling mismatches) to send bursts of packets, "catching up" on sending that could have been spaced out and done earlier if the feedback from the receiver's window advancing were heeded. That is, endpoint network stacks (TCP implementations) can worsen congestion by "dallying". The ideal end-to-end flows occupying a congested router would have their packets paced so that the packets end up being sent in the least bursty manner that an application can support. The effect of this pacing is to move the "backlog" for each flow quickly into the source node for that flow, which then provides back pressure on the application driving the flow, which ultimately is necessary to stanch congestion. The ideal congestion control mechanism slows the sender part of the application to a pace that can go through the network without contributing to buffering. Current network stacks (including Linux's) don't achieve that goal - their pushback on application sources is minimal - instead they accumulate buffering internal to the network implementation. This contributes to end-to-end latency as well. But if you think about it, this is almost as bad as switch-level bufferbloat in terms of degrading user experience. The reason I say "almost" is that there are tools, rarely used in practice, that allow an application to specify that buffering should not build up in the network stack (in the kernel or wherever it is). But the default is not to use those APIs, and to buffer way too much. Remember, the network send stack can act similarly to a congested switch (it is a switch among all the user applications running on that node). IF there is a heavy file transfer, the file transfer's buffering acts to increase latency for all other networked communications on that machine. Traditionally this problem has been thought of only as a within-node fairness issue, but in fact it has a big effect on the switches in between source and destination due to the lack of dispersed pacing of the packets at the source - in other words, the current design does nothing to stem the "burst groups" from a single source mentioned above. So we do need the source nodes to implement less "bursty" sending stacks. This is especially true for multiplexed source nodes, such as web servers implementing thousands of flows. A combination of codel-style switch-level buffer management and the stack at the sender being implemented to spread packets in a particular TCP flow out over time would improve things a lot. To achieve best throughput, the optimal way to spread packets out on an end-to-end basis is to update the receive window (sending ACK) at the receive end as quickly as possible, and to respond to the updated receive window as quickly as possible when it increases. Just like the "bufferbloat" issue, the problem is caused by applications like streaming video, file transfers and big web pages that the application programmer sees as not having a latency requirement within the flow, so the application programmer does not have an incentive to control pacing. Thus the operating system has got to push back on the applications' flow somehow, so that the flow ends up paced once it enters the Internet itself. So there's no real problem caused by large buffering in the network stack at the endpoint, as long as the stack's delivery to the Internet is paced by some mechanism, e.g. tight management of receive window control on an end-to-end basis. I don't think this can be fixed by cerowrt, so this is out of place here. It's partially ameliorated by cerowrt, if it aggressively drops packets from flows that burst without pacing. fq_codel does this, if the buffer size it aims for is small - but the problem is that the OS stacks don't respond by pacing... they tend to respond by bursting, not because TCP doesn't provide the mechanisms for pacing, but because the OS stack doesn't transmit as soon as it is allowed to - thus building up a burst unnecessarily. Bursts on a flow are thus bad in general. They make congestion happen when it need not. On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <swmike@swm.pp.se> said: > On Sun, 25 May 2014, Dane Medic wrote: > > > Is it true that devices with less than 64 MB can't handle QOS? -> > > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html > > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. > > I also don't see why performance and memory size would be relevant, I'd > say forwarding performance has more to do with CPU speed than anything > else. > > -- > Mikael Abrahamsson email: swmike@swm.pp.se > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > [-- Attachment #2: Type: text/html, Size: 7754 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 20:00 ` dpreed @ 2014-05-26 0:18 ` Mikael Abrahamsson 2014-05-26 4:49 ` dpreed 2014-05-27 15:23 ` Jim Gettys 1 sibling, 1 reply; 30+ messages in thread From: Mikael Abrahamsson @ 2014-05-26 0:18 UTC (permalink / raw) To: cerowrt-devel On Sun, 25 May 2014, dpreed@reed.com wrote: > The optimum buffer state for throughput is 1-2 packets worth - in other > words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck No, the optimal state for througbut is to have huge buffers and have them filled. The optimal state for interactivity is to have very small buffers. FQ_CODEL tries to strike a balance between the two at 10ms of buffer. PIE does the same around 20ms. In order for PIE to work properly I'd say you need 50ms of buffering as a minimum, otherwise you're going to get 100% tail drop and multiple sequential drops occasionally (which might be desireable to keep interactivity good). My comment about 50ms is that you seldom need a lot more. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 0:18 ` Mikael Abrahamsson @ 2014-05-26 4:49 ` dpreed 2014-05-26 13:02 ` Mikael Abrahamsson 0 siblings, 1 reply; 30+ messages in thread From: dpreed @ 2014-05-26 4:49 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 7031 bytes --] Len Kleinrock and his student proved that the "optimal" state for throughput in the internet is the 1-2 buffer case. It's easy to think this through... A simple intuition is that each node that qualifies as a "bottleneck" (meaning that the input rate exceeds the service rate of the outbound queue) will work optimally if it is in a double buffering state - that is a complete packet comes in for the outbound link during the time that the current packet goes out. That's topology independent. It's equivalent to saying that the number of packets in flight along a path in an optimal state between two endpoints is equal to the path's share of the bottleneck link's capacity times the physical minimum RTT for the MTU packet - the amount of "pipelining" that can be achieved along that path. Having more buffering can only make the throughput lower or at best the same. In other words, you might get the same throughput with more buffering, but more likely the extra buffering will make things worse. (the rare special case is the "hot rod" scenario of maximum end-to-end throughput with no competing flows.) The real congestion management issue, which I described, is the unnecessary "bunching" of packets in a flow. The bunching can be ameliorated at the source endpoint (or controlled by the receive endpoint transmitting an ack only when it receives a packet in the optimal state, but immediately responding to it to increase the responsiveness of the control loop: analogous to "impedance matching" in complex networks of transmission lines - bunching analogously corresponds to standing waves that reduce power transfer when impedance is not matched approximately. The maximum power transfer does not happen if some intermediate point includes a bad impedance match, storing energy that interferes with future energy transfer). Bunching has many causes, but it's better to remove it at the entry to the network than to allow it to clog up latency of competing flows. I'm deliberately not using queueing theory descriptions, because the queueing theory and control theory associated with networks that can drop packets and have finite buffering with end-to-end feedback congestion control is quite complex, especially for non-Poisson traffic - far beyond what is taught in elementary queueing theory. But if you want, I can dig that up for you. The problem of understanding the network congestion phenomenon as a whole is that one can not carry over intuitions from a single, multi hop, linear network of nodes to the global network congestion control problem. [The reason a CDMA (wired) or CSMA (wireless) Ethernet has "collision-driven" exponential-random back off is the same rationale - it's important to de-bunch the various flows that are competing for the Ethernet segment. The right level of randomness creates local de-bunching (or pacing) almost as effectively as a perfect, zero-latency admission control that knows the rates of all incoming queues. That is, when a packet ends, all senders with a packet ready to transmit do so. They all collide, and back off for different times - de-bunching the packet arrivals that next time around. This may not achieve maximal throughput, however, because there's unnecessary blocking of newly arrived packets on the "backed-off" NICs - but fixing that is a different story, especially when the Ethernet is an internal path in the Internet as a whole - there you need some kind of buffer limiting on each NIC, and ideally to treat each "flow" as distinct "back-off" entity.] The same basic idea - using collision-based back-off to force competing flows to de-bunch themselves - and keeping the end-to-end feedback loops very, very short by avoiding any more than the optimal buffering, leads to a network that can operate at near-optimal throughput *and* near-optimal latency. This is what I've been calling in my own terminology, a "ballistic state" of the network - analogous to, but not the same as, a gaseous rather than a liquid or solid phase of matter. The least congested state that has the most fluidity, and therefore the highest throughput of individual molecules (rather than a liquid which transmits pressure very well, but does not transmit individual tagged molecules very fast at all). That's what Kleinrock and his student showed. Counterintuitive though it may seem. (It doesn't seem counterintuitive to me at all, but many, many network engineers are taught and continue to think that you need lots of buffering to maximize throughput). I conjecture that it's an achievable operating mode of the Internet based solely on end-to-end congestion-control algorithms, probably not very different from TCP, running over a network where each switch signals congestion to all flows passing through it. It's probably the most desirable operating mode, because it maximizes throughput while minimizing latency simultaneously. There's no inherent tradeoff between the two, except when you have control algorithms that don't deal with the key issues of bunching and unnecessarily slow feedback about "collision rate" among paths. It would be instructive to try to prove, in a rigorous way, that this phase of network operation cannot be achieved with an end-to-end control algorithm. The proof of achievability or the contrary proof of non-achievability have not been produced. But there is no doubt that this is an achievable operational state if you have an Oracle who knows all future traffic, and can globally schedule traffic: performance is optimal, with no buffering needed. Attempting to prove or disprove my conjecture would probably produce some really important insights. The insight we already have is that adding buffering cannot increase throughput once the bottleneck links are in "double-buffering" state. That state is "Pareto optimal" in a serious sense. On Sunday, May 25, 2014 8:18pm, "Mikael Abrahamsson" <swmike@swm.pp.se> said: > On Sun, 25 May 2014, dpreed@reed.com wrote: > > > The optimum buffer state for throughput is 1-2 packets worth - in other > > words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck > > No, the optimal state for througbut is to have huge buffers and have them > filled. The optimal state for interactivity is to have very small buffers. > FQ_CODEL tries to strike a balance between the two at 10ms of buffer. PIE > does the same around 20ms. In order for PIE to work properly I'd say you > need 50ms of buffering as a minimum, otherwise you're going to get 100% > tail drop and multiple sequential drops occasionally (which might be > desireable to keep interactivity good). > > My comment about 50ms is that you seldom need a lot more. > > -- > Mikael Abrahamsson email: swmike@swm.pp.se > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > [-- Attachment #2: Type: text/html, Size: 8961 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 4:49 ` dpreed @ 2014-05-26 13:02 ` Mikael Abrahamsson 2014-05-26 14:01 ` dpreed 0 siblings, 1 reply; 30+ messages in thread From: Mikael Abrahamsson @ 2014-05-26 13:02 UTC (permalink / raw) To: dpreed; +Cc: cerowrt-devel On Mon, 26 May 2014, dpreed@reed.com wrote: > Len Kleinrock and his student proved that the "optimal" state for > throughput in the internet is the 1-2 buffer case. It's easy to think > this through... Yes, but how do we achieve it? If you signal congestion with very small buffer depth used, TCP will back off and you will drain the buffer, meaning the link will be underutilized. This is great from an interactive point of view, but not so much for keeping the link used actually at capacity without incurring excessive buffering latency? So you would like to see ECN drop=1 markings on all packets as soon as they're the 3rd (or deeper) packet in the buffer? Or if the packet doesn't have ECN markings, you'd just drop it? I doubt this will create a beneficial system for the end user, sounds like it focuses too much on interactivity and too little on throughput. I just don't buy your statement that adding buffers won't increase throughput. If you're optimizing for throughput, then you let a single session use 1 second of buffering, meaning you know for a fact that the link is always going to be used at 100%. This totally kills interactivity, but it's still more throughput efficient than having 2 packet deep buffers, where you're very likely to drain these two packets and then have no packets in the buffer, meaning the link will be underutilized. So, I'd agree that a lot of the time you need very little buffers, but stating you need a buffer of 2 packets deep regardless of speed, well, I don't see how that would work. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 13:02 ` Mikael Abrahamsson @ 2014-05-26 14:01 ` dpreed 2014-05-26 14:11 ` Mikael Abrahamsson 0 siblings, 1 reply; 30+ messages in thread From: dpreed @ 2014-05-26 14:01 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 1171 bytes --] On Monday, May 26, 2014 9:02am, "Mikael Abrahamsson" <swmike@swm.pp.se> said: > So, I'd agree that a lot of the time you need very little buffers, but > stating you need a buffer of 2 packets deep regardless of speed, well, I > don't see how that would work. > My main point is that looking to increased buffering to achieve throughput while maintaining latency is not that helpful, and often causes more harm than good. There are alternatives to buffering that can be managed more dynamically (managing bunching and something I didn't mention - spreading flows or packets within flows across multiple routes when a bottleneck appears - are some of them). I would look to queue minimization rather than "queue management" (which implied queues are often long) as a goal, and think harder about the end-to-end problem of minimizing total end-to-end queueing delay while maximizing throughput. It's clearly a totally false tradeoff between throughput and latency - in the IP framework. There is no such tradeoff for the operating point. There may be such a tradeoff for certain specific implementations of TCP, but that's not fixed in stone. [-- Attachment #2: Type: text/html, Size: 1666 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 14:01 ` dpreed @ 2014-05-26 14:11 ` Mikael Abrahamsson 2014-05-26 15:31 ` David P. Reed 0 siblings, 1 reply; 30+ messages in thread From: Mikael Abrahamsson @ 2014-05-26 14:11 UTC (permalink / raw) To: dpreed; +Cc: cerowrt-devel On Mon, 26 May 2014, dpreed@reed.com wrote: > I would look to queue minimization rather than "queue management" (which > implied queues are often long) as a goal, and think harder about the > end-to-end problem of minimizing total end-to-end queueing delay while > maximizing throughput. As far as I can tell, this is exactly what CODEL and PIE tries to do. They try to find a decent tradeoff between having queues to make sure the pipe is filled, and not making these queues big enough to seriously affect interactive performance. The latter part looks like what LEDBAT does? <http://tools.ietf.org/html/rfc6817> Or are you thinking about something else? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 14:11 ` Mikael Abrahamsson @ 2014-05-26 15:31 ` David P. Reed 2014-05-27 21:19 ` David Lang 0 siblings, 1 reply; 30+ messages in thread From: David P. Reed @ 2014-05-26 15:31 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 1446 bytes --] Codel and PIE are excellent first steps... but I don't think they are the best eventual approach. I want to see them deployed ASAP in CMTS' s and server load balancing networks... it would be a disaster to not deploy the far better option we have today immediately at the point of most leverage. The best is the enemy of the good. But, the community needs to learn once and for all that throughput and latency do not trade off. We can in principle get far better latency while maintaining high throughput.... and we need to start thinking about that. That means that the framing of the issue as AQM is counterproductive. On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >On Mon, 26 May 2014, dpreed@reed.com wrote: > >> I would look to queue minimization rather than "queue management" >(which >> implied queues are often long) as a goal, and think harder about the >> end-to-end problem of minimizing total end-to-end queueing delay >while >> maximizing throughput. > >As far as I can tell, this is exactly what CODEL and PIE tries to do. >They >try to find a decent tradeoff between having queues to make sure the >pipe >is filled, and not making these queues big enough to seriously affect >interactive performance. > >The latter part looks like what LEDBAT does? ><http://tools.ietf.org/html/rfc6817> > >Or are you thinking about something else? -- Sent from my Android device with K-@ Mail. Please excuse my brevity. [-- Attachment #2: Type: text/html, Size: 2044 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-26 15:31 ` David P. Reed @ 2014-05-27 21:19 ` David Lang 2014-05-27 22:00 ` Dave Taht 0 siblings, 1 reply; 30+ messages in thread From: David Lang @ 2014-05-27 21:19 UTC (permalink / raw) To: David P. Reed; +Cc: cerowrt-devel [-- Attachment #1: Type: TEXT/Plain, Size: 2311 bytes --] the problem is that paths change, they mix traffic from streams, and in other ways the utilization of the links can change radically in a short amount of time. If you try to limit things to exactly the ballistic throughput, you are not going to be able to exactly maintain this state, you are either going to overshoot (too much traffic, requiring dropping packets to maintain your minimal buffer), or you are going to undershoot (too little traffic and your connection is idle) Since you can't predict all the competing traffic throughout the Internet, if you want to maximize throughput, you want to buffer as much as you can tolerate for latency reasons. For most apps, this is more than enough to cause problems for other connections. David Lang On Mon, 26 May 2014, David P. Reed wrote: > Codel and PIE are excellent first steps... but I don't think they are the best > eventual approach. I want to see them deployed ASAP in CMTS' s and server > load balancing networks... it would be a disaster to not deploy the far better > option we have today immediately at the point of most leverage. The best is > the enemy of the good. > > But, the community needs to learn once and for all that throughput and latency > do not trade off. We can in principle get far better latency while maintaining > high throughput.... and we need to start thinking about that. That means that > the framing of the issue as AQM is counterproductive. > > On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >> On Mon, 26 May 2014, dpreed@reed.com wrote: >> >>> I would look to queue minimization rather than "queue management" >> (which >>> implied queues are often long) as a goal, and think harder about the >>> end-to-end problem of minimizing total end-to-end queueing delay >> while >>> maximizing throughput. >> >> As far as I can tell, this is exactly what CODEL and PIE tries to do. >> They >> try to find a decent tradeoff between having queues to make sure the >> pipe >> is filled, and not making these queues big enough to seriously affect >> interactive performance. >> >> The latter part looks like what LEDBAT does? >> <http://tools.ietf.org/html/rfc6817> >> >> Or are you thinking about something else? > > -- Sent from my Android device with K-@ Mail. Please excuse my brevity. [-- Attachment #2: Type: TEXT/PLAIN, Size: 164 bytes --] _______________________________________________ Cerowrt-devel mailing list Cerowrt-devel@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/cerowrt-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 21:19 ` David Lang @ 2014-05-27 22:00 ` Dave Taht 2014-05-27 23:27 ` David Lang 0 siblings, 1 reply; 30+ messages in thread From: Dave Taht @ 2014-05-27 22:00 UTC (permalink / raw) To: David Lang; +Cc: cerowrt-devel There is a phrase in this thread that is begging to bother me. "Throughput". Everyone assumes that throughput is a big goal - and it certainly is - and latency is also a big goal - and it certainly is - but by specifying what you want from "throughput" as a compromise with latency is not the right thing... If what you want is actually "high speed in-order packet delivery" - say, for example a movie, or a video conference, youtube, or a video conference - excessive latency with high throughput, really, really makes in-order packet delivery at high speed tough. You eventually lose a packet, and you have to wait a really long time until a replacement arrives. Stuart and I showed that at last ietf. And you get the classic "buffering" song playing.... low latency makes recovery from a loss in an in-order stream much, much faster. Honestly, for most applications on the web, what you want is high speed in-order packet delivery, not "bulk throughput". There is a whole class of apps (bittorrent, file transfer) that don't need that, and we have protocols for those.... On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> wrote: > the problem is that paths change, they mix traffic from streams, and in > other ways the utilization of the links can change radically in a short > amount of time. > > If you try to limit things to exactly the ballistic throughput, you are not > going to be able to exactly maintain this state, you are either going to > overshoot (too much traffic, requiring dropping packets to maintain your > minimal buffer), or you are going to undershoot (too little traffic and your > connection is idle) > > Since you can't predict all the competing traffic throughout the Internet, > if you want to maximize throughput, you want to buffer as much as you can > tolerate for latency reasons. For most apps, this is more than enough to > cause problems for other connections. > > David Lang > > > On Mon, 26 May 2014, David P. Reed wrote: > >> Codel and PIE are excellent first steps... but I don't think they are the >> best eventual approach. I want to see them deployed ASAP in CMTS' s and >> server load balancing networks... it would be a disaster to not deploy the >> far better option we have today immediately at the point of most leverage. >> The best is the enemy of the good. >> >> But, the community needs to learn once and for all that throughput and >> latency do not trade off. We can in principle get far better latency while >> maintaining high throughput.... and we need to start thinking about that. >> That means that the framing of the issue as AQM is counterproductive. >> >> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >>> >>> On Mon, 26 May 2014, dpreed@reed.com wrote: >>> >>>> I would look to queue minimization rather than "queue management" >>> >>> (which >>>> >>>> implied queues are often long) as a goal, and think harder about the >>>> end-to-end problem of minimizing total end-to-end queueing delay >>> >>> while >>>> >>>> maximizing throughput. >>> >>> >>> As far as I can tell, this is exactly what CODEL and PIE tries to do. >>> They >>> try to find a decent tradeoff between having queues to make sure the >>> pipe >>> is filled, and not making these queues big enough to seriously affect >>> interactive performance. >>> >>> The latter part looks like what LEDBAT does? >>> <http://tools.ietf.org/html/rfc6817> >>> >>> Or are you thinking about something else? >> >> >> -- Sent from my Android device with K-@ Mail. Please excuse my brevity. > > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 22:00 ` Dave Taht @ 2014-05-27 23:27 ` David Lang 2014-05-28 2:12 ` Dave Taht 0 siblings, 1 reply; 30+ messages in thread From: David Lang @ 2014-05-27 23:27 UTC (permalink / raw) To: Dave Taht; +Cc: cerowrt-devel On Tue, 27 May 2014, Dave Taht wrote: > There is a phrase in this thread that is begging to bother me. > > "Throughput". Everyone assumes that throughput is a big goal - and it > certainly is - and latency is also a big goal - and it certainly is - > but by specifying what you want from "throughput" as a compromise with > latency is not the right thing... > > If what you want is actually "high speed in-order packet delivery" - > say, for example a movie, > or a video conference, youtube, or a video conference - excessive > latency with high throughput, really, really makes in-order packet > delivery at high speed tough. the key word here is "excessive", that's why I said that for max throughput you want to buffer as much as your latency budget will allow you to. > You eventually lose a packet, and you have to wait a really long time > until a replacement arrives. Stuart and I showed that at last ietf. > And you get the classic "buffering" song playing.... Yep, and if you buffer too much, your "lost packet" is actually still in flight and eating bandwidth. David Lang > low latency makes recovery from a loss in an in-order stream much, much faster. > > Honestly, for most applications on the web, what you want is high > speed in-order packet delivery, not > "bulk throughput". There is a whole class of apps (bittorrent, file > transfer) that don't need that, and we > have protocols for those.... > > > > On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> wrote: >> the problem is that paths change, they mix traffic from streams, and in >> other ways the utilization of the links can change radically in a short >> amount of time. >> >> If you try to limit things to exactly the ballistic throughput, you are not >> going to be able to exactly maintain this state, you are either going to >> overshoot (too much traffic, requiring dropping packets to maintain your >> minimal buffer), or you are going to undershoot (too little traffic and your >> connection is idle) >> >> Since you can't predict all the competing traffic throughout the Internet, >> if you want to maximize throughput, you want to buffer as much as you can >> tolerate for latency reasons. For most apps, this is more than enough to >> cause problems for other connections. >> >> David Lang >> >> >> On Mon, 26 May 2014, David P. Reed wrote: >> >>> Codel and PIE are excellent first steps... but I don't think they are the >>> best eventual approach. I want to see them deployed ASAP in CMTS' s and >>> server load balancing networks... it would be a disaster to not deploy the >>> far better option we have today immediately at the point of most leverage. >>> The best is the enemy of the good. >>> >>> But, the community needs to learn once and for all that throughput and >>> latency do not trade off. We can in principle get far better latency while >>> maintaining high throughput.... and we need to start thinking about that. >>> That means that the framing of the issue as AQM is counterproductive. >>> >>> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >>>> >>>> On Mon, 26 May 2014, dpreed@reed.com wrote: >>>> >>>>> I would look to queue minimization rather than "queue management" >>>> >>>> (which >>>>> >>>>> implied queues are often long) as a goal, and think harder about the >>>>> end-to-end problem of minimizing total end-to-end queueing delay >>>> >>>> while >>>>> >>>>> maximizing throughput. >>>> >>>> >>>> As far as I can tell, this is exactly what CODEL and PIE tries to do. >>>> They >>>> try to find a decent tradeoff between having queues to make sure the >>>> pipe >>>> is filled, and not making these queues big enough to seriously affect >>>> interactive performance. >>>> >>>> The latter part looks like what LEDBAT does? >>>> <http://tools.ietf.org/html/rfc6817> >>>> >>>> Or are you thinking about something else? >>> >>> >>> -- Sent from my Android device with K-@ Mail. Please excuse my brevity. >> >> >> _______________________________________________ >> Cerowrt-devel mailing list >> Cerowrt-devel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> >> _______________________________________________ >> Cerowrt-devel mailing list >> Cerowrt-devel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> > > > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 23:27 ` David Lang @ 2014-05-28 2:12 ` Dave Taht 2014-05-28 3:21 ` David Lang 0 siblings, 1 reply; 30+ messages in thread From: Dave Taht @ 2014-05-28 2:12 UTC (permalink / raw) To: David Lang; +Cc: cerowrt-devel On Tue, May 27, 2014 at 4:27 PM, David Lang <david@lang.hm> wrote: > On Tue, 27 May 2014, Dave Taht wrote: > >> There is a phrase in this thread that is begging to bother me. >> >> "Throughput". Everyone assumes that throughput is a big goal - and it >> certainly is - and latency is also a big goal - and it certainly is - >> but by specifying what you want from "throughput" as a compromise with >> latency is not the right thing... >> >> If what you want is actually "high speed in-order packet delivery" - >> say, for example a movie, >> or a video conference, youtube, or a video conference - excessive >> latency with high throughput, really, really makes in-order packet >> delivery at high speed tough. > > > the key word here is "excessive", that's why I said that for max throughput > you want to buffer as much as your latency budget will allow you to. Again I'm trying to make a distinction between "throughput", and "packets delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think) The buffering should not be in-the-network, it can be in the application. Take our hypothetical video stream for example. I am 20ms RTT from netflix. If I artificially inflate that by adding 50ms of in-network buffering, that means a loss can take 120ms to recover from. If instead, I keep a 3*RTT buffer in my application, and expect that I have 5ms worth of network-buffering, instead, I recover from a loss in 40ms. (please note, it's late, I might not have got the math entirely right) As physical RTTs grow shorter, the advantages of smaller buffers grow larger. You don't need 50ms queueing delay on a 100us path. Many applications buffer for seconds due to needing to be at least 2*(actual buffering+RTT) on the path. > >> You eventually lose a packet, and you have to wait a really long time >> until a replacement arrives. Stuart and I showed that at last ietf. >> And you get the classic "buffering" song playing.... > > > Yep, and if you buffer too much, your "lost packet" is actually still in > flight and eating bandwidth. > > David Lang > > >> low latency makes recovery from a loss in an in-order stream much, much >> faster. >> >> Honestly, for most applications on the web, what you want is high >> speed in-order packet delivery, not >> "bulk throughput". There is a whole class of apps (bittorrent, file >> transfer) that don't need that, and we >> have protocols for those.... >> >> >> >> On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> wrote: >>> >>> the problem is that paths change, they mix traffic from streams, and in >>> other ways the utilization of the links can change radically in a short >>> amount of time. >>> >>> If you try to limit things to exactly the ballistic throughput, you are >>> not >>> going to be able to exactly maintain this state, you are either going to >>> overshoot (too much traffic, requiring dropping packets to maintain your >>> minimal buffer), or you are going to undershoot (too little traffic and >>> your >>> connection is idle) >>> >>> Since you can't predict all the competing traffic throughout the >>> Internet, >>> if you want to maximize throughput, you want to buffer as much as you can >>> tolerate for latency reasons. For most apps, this is more than enough to >>> cause problems for other connections. >>> >>> David Lang >>> >>> >>> On Mon, 26 May 2014, David P. Reed wrote: >>> >>>> Codel and PIE are excellent first steps... but I don't think they are >>>> the >>>> best eventual approach. I want to see them deployed ASAP in CMTS' s and >>>> server load balancing networks... it would be a disaster to not deploy >>>> the >>>> far better option we have today immediately at the point of most >>>> leverage. >>>> The best is the enemy of the good. >>>> >>>> But, the community needs to learn once and for all that throughput and >>>> latency do not trade off. We can in principle get far better latency >>>> while >>>> maintaining high throughput.... and we need to start thinking about >>>> that. >>>> That means that the framing of the issue as AQM is counterproductive. >>>> >>>> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >>>>> >>>>> >>>>> On Mon, 26 May 2014, dpreed@reed.com wrote: >>>>> >>>>>> I would look to queue minimization rather than "queue management" >>>>> >>>>> >>>>> (which >>>>>> >>>>>> >>>>>> implied queues are often long) as a goal, and think harder about the >>>>>> end-to-end problem of minimizing total end-to-end queueing delay >>>>> >>>>> >>>>> while >>>>>> >>>>>> >>>>>> maximizing throughput. >>>>> >>>>> >>>>> >>>>> As far as I can tell, this is exactly what CODEL and PIE tries to do. >>>>> They >>>>> try to find a decent tradeoff between having queues to make sure the >>>>> pipe >>>>> is filled, and not making these queues big enough to seriously affect >>>>> interactive performance. >>>>> >>>>> The latter part looks like what LEDBAT does? >>>>> <http://tools.ietf.org/html/rfc6817> >>>>> >>>>> Or are you thinking about something else? >>>> >>>> >>>> >>>> -- Sent from my Android device with K-@ Mail. Please excuse my brevity. >>> >>> >>> >>> _______________________________________________ >>> Cerowrt-devel mailing list >>> Cerowrt-devel@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >>> >>> _______________________________________________ >>> Cerowrt-devel mailing list >>> Cerowrt-devel@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >>> >> >> >> >> > -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-28 2:12 ` Dave Taht @ 2014-05-28 3:21 ` David Lang 2014-05-28 15:52 ` dpreed 0 siblings, 1 reply; 30+ messages in thread From: David Lang @ 2014-05-28 3:21 UTC (permalink / raw) To: Dave Taht; +Cc: cerowrt-devel On Tue, 27 May 2014, Dave Taht wrote: > On Tue, May 27, 2014 at 4:27 PM, David Lang <david@lang.hm> wrote: >> On Tue, 27 May 2014, Dave Taht wrote: >> >>> There is a phrase in this thread that is begging to bother me. >>> >>> "Throughput". Everyone assumes that throughput is a big goal - and it >>> certainly is - and latency is also a big goal - and it certainly is - >>> but by specifying what you want from "throughput" as a compromise with >>> latency is not the right thing... >>> >>> If what you want is actually "high speed in-order packet delivery" - >>> say, for example a movie, >>> or a video conference, youtube, or a video conference - excessive >>> latency with high throughput, really, really makes in-order packet >>> delivery at high speed tough. >> >> >> the key word here is "excessive", that's why I said that for max throughput >> you want to buffer as much as your latency budget will allow you to. > > Again I'm trying to make a distinction between "throughput", and "packets > delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think) > > The buffering should not be in-the-network, it can be in the application. > > Take our hypothetical video stream for example. I am 20ms RTT from netflix. > If I artificially inflate that by adding 50ms of in-network buffering, > that means a loss can > take 120ms to recover from. > > If instead, I keep a 3*RTT buffer in my application, and expect that I have 5ms > worth of network-buffering, instead, I recover from a loss in 40ms. > > (please note, it's late, I might not have got the math entirely right) but you aren't going to be tuning the retry wait time per connection. what is the retry time that is set in your stack? It's something huge to survive international connections with satellite paths (so several seconds worth). If your server-to-eyeball buffering is shorter than this, you will get a window where you aren't fully utilizing the connection. so yes, I do think that if your purpose is to get the maximum possible in-order packets delivered, you end up making different decisions than if you are just trying to stream a HD video, or do other normal things. The problem is thinking that this absolute throughput is representitive of normal use. > As physical RTTs grow shorter, the advantages of smaller buffers grow larger. > > You don't need 50ms queueing delay on a 100us path. > > Many applications buffer for seconds due to needing to be at least > 2*(actual buffering+RTT) on the path. For something like streaming video, there's nothing wrong with the application buffering aggressivly (assuming you have the space to do so on the client side), the more you have gotten transmitted to the client, the longer it can survive a disruption of it's network. There's nothing wrong with having an hour of buffered data between the server and the viewer's eyes.now, this buffering should not be in the network devices, it should be in the client app, but this isn't because there's something wrong with bufferng, it's just because the client device has so much more available space to hold stuff. David Lang >> >>> You eventually lose a packet, and you have to wait a really long time >>> until a replacement arrives. Stuart and I showed that at last ietf. >>> And you get the classic "buffering" song playing.... >> >> >> Yep, and if you buffer too much, your "lost packet" is actually still in >> flight and eating bandwidth. >> >> David Lang >> >> >>> low latency makes recovery from a loss in an in-order stream much, much >>> faster. >>> >>> Honestly, for most applications on the web, what you want is high >>> speed in-order packet delivery, not >>> "bulk throughput". There is a whole class of apps (bittorrent, file >>> transfer) that don't need that, and we >>> have protocols for those.... >>> >>> >>> >>> On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> wrote: >>>> >>>> the problem is that paths change, they mix traffic from streams, and in >>>> other ways the utilization of the links can change radically in a short >>>> amount of time. >>>> >>>> If you try to limit things to exactly the ballistic throughput, you are >>>> not >>>> going to be able to exactly maintain this state, you are either going to >>>> overshoot (too much traffic, requiring dropping packets to maintain your >>>> minimal buffer), or you are going to undershoot (too little traffic and >>>> your >>>> connection is idle) >>>> >>>> Since you can't predict all the competing traffic throughout the >>>> Internet, >>>> if you want to maximize throughput, you want to buffer as much as you can >>>> tolerate for latency reasons. For most apps, this is more than enough to >>>> cause problems for other connections. >>>> >>>> David Lang >>>> >>>> >>>> On Mon, 26 May 2014, David P. Reed wrote: >>>> >>>>> Codel and PIE are excellent first steps... but I don't think they are >>>>> the >>>>> best eventual approach. I want to see them deployed ASAP in CMTS' s and >>>>> server load balancing networks... it would be a disaster to not deploy >>>>> the >>>>> far better option we have today immediately at the point of most >>>>> leverage. >>>>> The best is the enemy of the good. >>>>> >>>>> But, the community needs to learn once and for all that throughput and >>>>> latency do not trade off. We can in principle get far better latency >>>>> while >>>>> maintaining high throughput.... and we need to start thinking about >>>>> that. >>>>> That means that the framing of the issue as AQM is counterproductive. >>>>> >>>>> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> wrote: >>>>>> >>>>>> >>>>>> On Mon, 26 May 2014, dpreed@reed.com wrote: >>>>>> >>>>>>> I would look to queue minimization rather than "queue management" >>>>>> >>>>>> >>>>>> (which >>>>>>> >>>>>>> >>>>>>> implied queues are often long) as a goal, and think harder about the >>>>>>> end-to-end problem of minimizing total end-to-end queueing delay >>>>>> >>>>>> >>>>>> while >>>>>>> >>>>>>> >>>>>>> maximizing throughput. >>>>>> >>>>>> >>>>>> >>>>>> As far as I can tell, this is exactly what CODEL and PIE tries to do. >>>>>> They >>>>>> try to find a decent tradeoff between having queues to make sure the >>>>>> pipe >>>>>> is filled, and not making these queues big enough to seriously affect >>>>>> interactive performance. >>>>>> >>>>>> The latter part looks like what LEDBAT does? >>>>>> <http://tools.ietf.org/html/rfc6817> >>>>>> >>>>>> Or are you thinking about something else? >>>>> >>>>> >>>>> >>>>> -- Sent from my Android device with K-@ Mail. Please excuse my brevity. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Cerowrt-devel mailing list >>>> Cerowrt-devel@lists.bufferbloat.net >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >>>> >>>> _______________________________________________ >>>> Cerowrt-devel mailing list >>>> Cerowrt-devel@lists.bufferbloat.net >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >>>> >>> >>> >>> >>> >> > > > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-28 3:21 ` David Lang @ 2014-05-28 15:52 ` dpreed 2014-05-28 16:34 ` David Lang 0 siblings, 1 reply; 30+ messages in thread From: dpreed @ 2014-05-28 15:52 UTC (permalink / raw) To: David Lang; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 9623 bytes --] Interesting conversation. A particular switch has no idea of the "latency budget" of a particular flow - so it cannot have its *own* latency budget. The switch designer has no choice but to assume that his latency budget is near zero. The number of packets that should be sustained in flight to maintain maximum throughput between the source (entry) switch and destination (exit) switch of the flow need be no higher than the flow's share of bandwidth of the bottleneck multiplied by the end-to-end delay (including packet forwarding, but not queueing). All buffering needed for isochrony ("jitter buffer") and "alternative path selection" can be moved to either before the entry switch or after the exit switch. If you have multiple simultaneous paths, the number of packets in flight involves replacing "bandwidth of the bottleneck" with "aggregate bandwidth across the minimum cut-set of the chosen paths used for the flow". Of course, these are dynamic - "the flow's share" and "paths used for the flow" change over short time scales. That's why you have a control loop that needs to measure them. The whole point of minimizing buffering is to make the measurements more timely and the control inputs more timely. This is not about convergence to an asymptote.... A network where every internal buffer is driven hard toward zero makes it possible to handle multiple paths, alternate paths, etc. more *easily*. That's partly because you allow endpoints to see what is happening to their flows more quickly so they can compensate. And of course for shared wireless resources, things change more quickly because of new factors - more sharing, more competition for collision-free slots, varying transmission rates, etc. The last thing you want is long-term standing waves caused by large buffers and very loose control. On Tuesday, May 27, 2014 11:21pm, "David Lang" <david@lang.hm> said: > On Tue, 27 May 2014, Dave Taht wrote: > > > On Tue, May 27, 2014 at 4:27 PM, David Lang <david@lang.hm> wrote: > >> On Tue, 27 May 2014, Dave Taht wrote: > >> > >>> There is a phrase in this thread that is begging to bother me. > >>> > >>> "Throughput". Everyone assumes that throughput is a big goal - and > it > >>> certainly is - and latency is also a big goal - and it certainly is > - > >>> but by specifying what you want from "throughput" as a compromise > with > >>> latency is not the right thing... > >>> > >>> If what you want is actually "high speed in-order packet delivery" - > >>> say, for example a movie, > >>> or a video conference, youtube, or a video conference - excessive > >>> latency with high throughput, really, really makes in-order packet > >>> delivery at high speed tough. > >> > >> > >> the key word here is "excessive", that's why I said that for max > throughput > >> you want to buffer as much as your latency budget will allow you to. > > > > Again I'm trying to make a distinction between "throughput", and "packets > > delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think) > > > > The buffering should not be in-the-network, it can be in the application. > > > > Take our hypothetical video stream for example. I am 20ms RTT from netflix. > > If I artificially inflate that by adding 50ms of in-network buffering, > > that means a loss can > > take 120ms to recover from. > > > > If instead, I keep a 3*RTT buffer in my application, and expect that I have > 5ms > > worth of network-buffering, instead, I recover from a loss in 40ms. > > > > (please note, it's late, I might not have got the math entirely right) > > but you aren't going to be tuning the retry wait time per connection. what is > the retry time that is set in your stack? It's something huge to survive > international connections with satellite paths (so several seconds worth). If > your server-to-eyeball buffering is shorter than this, you will get a window > where you aren't fully utilizing the connection. > > so yes, I do think that if your purpose is to get the maximum possible in-order > packets delivered, you end up making different decisions than if you are just > trying to stream a HD video, or do other normal things. > > The problem is thinking that this absolute throughput is representitive of > normal use. > > > As physical RTTs grow shorter, the advantages of smaller buffers grow > larger. > > > > You don't need 50ms queueing delay on a 100us path. > > > > Many applications buffer for seconds due to needing to be at least > > 2*(actual buffering+RTT) on the path. > > For something like streaming video, there's nothing wrong with the application > buffering aggressivly (assuming you have the space to do so on the client side), > the more you have gotten transmitted to the client, the longer it can survive a > disruption of it's network. > > There's nothing wrong with having an hour of buffered data between the server > and the viewer's eyes.now, this buffering should not be in the network devices, it > should be in the > client app, but this isn't because there's something wrong with bufferng, it's > just because the client device has so much more available space to hold stuff. > > David Lang > > >> > >>> You eventually lose a packet, and you have to wait a really long > time > >>> until a replacement arrives. Stuart and I showed that at last ietf. > >>> And you get the classic "buffering" song playing.... > >> > >> > >> Yep, and if you buffer too much, your "lost packet" is actually still in > >> flight and eating bandwidth. > >> > >> David Lang > >> > >> > >>> low latency makes recovery from a loss in an in-order stream much, > much > >>> faster. > >>> > >>> Honestly, for most applications on the web, what you want is high > >>> speed in-order packet delivery, not > >>> "bulk throughput". There is a whole class of apps (bittorrent, file > >>> transfer) that don't need that, and we > >>> have protocols for those.... > >>> > >>> > >>> > >>> On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> > wrote: > >>>> > >>>> the problem is that paths change, they mix traffic from streams, > and in > >>>> other ways the utilization of the links can change radically in a > short > >>>> amount of time. > >>>> > >>>> If you try to limit things to exactly the ballistic throughput, > you are > >>>> not > >>>> going to be able to exactly maintain this state, you are either > going to > >>>> overshoot (too much traffic, requiring dropping packets to > maintain your > >>>> minimal buffer), or you are going to undershoot (too little > traffic and > >>>> your > >>>> connection is idle) > >>>> > >>>> Since you can't predict all the competing traffic throughout the > >>>> Internet, > >>>> if you want to maximize throughput, you want to buffer as much as > you can > >>>> tolerate for latency reasons. For most apps, this is more than > enough to > >>>> cause problems for other connections. > >>>> > >>>> David Lang > >>>> > >>>> > >>>> On Mon, 26 May 2014, David P. Reed wrote: > >>>> > >>>>> Codel and PIE are excellent first steps... but I don't think > they are > >>>>> the > >>>>> best eventual approach. I want to see them deployed ASAP in > CMTS' s and > >>>>> server load balancing networks... it would be a disaster to > not deploy > >>>>> the > >>>>> far better option we have today immediately at the point of > most > >>>>> leverage. > >>>>> The best is the enemy of the good. > >>>>> > >>>>> But, the community needs to learn once and for all that > throughput and > >>>>> latency do not trade off. We can in principle get far better > latency > >>>>> while > >>>>> maintaining high throughput.... and we need to start thinking > about > >>>>> that. > >>>>> That means that the framing of the issue as AQM is > counterproductive. > >>>>> > >>>>> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> > wrote: > >>>>>> > >>>>>> > >>>>>> On Mon, 26 May 2014, dpreed@reed.com wrote: > >>>>>> > >>>>>>> I would look to queue minimization rather than "queue > management" > >>>>>> > >>>>>> > >>>>>> (which > >>>>>>> > >>>>>>> > >>>>>>> implied queues are often long) as a goal, and think > harder about the > >>>>>>> end-to-end problem of minimizing total end-to-end > queueing delay > >>>>>> > >>>>>> > >>>>>> while > >>>>>>> > >>>>>>> > >>>>>>> maximizing throughput. > >>>>>> > >>>>>> > >>>>>> > >>>>>> As far as I can tell, this is exactly what CODEL and PIE > tries to do. > >>>>>> They > >>>>>> try to find a decent tradeoff between having queues to > make sure the > >>>>>> pipe > >>>>>> is filled, and not making these queues big enough to > seriously affect > >>>>>> interactive performance. > >>>>>> > >>>>>> The latter part looks like what LEDBAT does? > >>>>>> <http://tools.ietf.org/html/rfc6817> > >>>>>> > >>>>>> Or are you thinking about something else? > >>>>> > >>>>> > >>>>> > >>>>> -- Sent from my Android device with K-@ Mail. Please excuse > my brevity. > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Cerowrt-devel mailing list > >>>> Cerowrt-devel@lists.bufferbloat.net > >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel > >>>> > >>>> _______________________________________________ > >>>> Cerowrt-devel mailing list > >>>> Cerowrt-devel@lists.bufferbloat.net > >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel > >>>> > >>> > >>> > >>> > >>> > >> > > > > > > > > > [-- Attachment #2: Type: text/html, Size: 14213 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-28 15:52 ` dpreed @ 2014-05-28 16:34 ` David Lang 0 siblings, 0 replies; 30+ messages in thread From: David Lang @ 2014-05-28 16:34 UTC (permalink / raw) To: dpreed; +Cc: cerowrt-devel Ok, I am not understanding your proposal then. I thought you were claiming that since the optimum buffer length is 1-2 packets, the endpoints should be adjusting their sending speeds to try and make that happen on all switches and routers in the path. The endpoints do know what their latency budget is and can have up to that much data in flight. They don't know if that data is sitting in router buffers, or is in transit on a high-speed-high-latency link (high speed satellite links can have a LOT of data that's left the transmitter, but not arrived at the receiver yet, this is will look exactly like data sitting in a buffer to the endpoints) the endpoints don't know the state of all the intermediate connections, so unless they get feedback (ECN or dropped packets) they have to assume that there is no congestion. David Lang On Wed, 28 May 2014, dpreed@reed.com wrote: > Interesting conversation. A particular switch has no idea of the "latency budget" of a particular flow - so it cannot have its *own* latency budget. The switch designer has no choice but to assume that his latency budget is near zero. > > The number of packets that should be sustained in flight to maintain maximum throughput between the source (entry) switch and destination (exit) switch of the flow need be no higher than > > the flow's share of bandwidth of the bottleneck > > multiplied by > > the end-to-end delay (including packet forwarding, but not queueing). > > All buffering needed for isochrony ("jitter buffer") and "alternative path selection" can be moved to either before the entry switch or after the exit switch. > > If you have multiple simultaneous paths, the number of packets in flight involves replacing "bandwidth of the bottleneck" with "aggregate bandwidth across the minimum cut-set of the chosen paths used for the flow". > > Of course, these are dynamic - "the flow's share" and "paths used for the flow" change over short time scales. That's why you have a control loop that needs to measure them. > > The whole point of minimizing buffering is to make the measurements more timely and the control inputs more timely. This is not about convergence to an asymptote.... > > A network where every internal buffer is driven hard toward zero makes it possible to handle multiple paths, alternate paths, etc. more *easily*. That's partly because you allow endpoints to see what is happening to their flows more quickly so they can compensate. > > And of course for shared wireless resources, things change more quickly because of new factors - more sharing, more competition for collision-free slots, varying transmission rates, etc. > > The last thing you want is long-term standing waves caused by large buffers and very loose control. > > > > On Tuesday, May 27, 2014 11:21pm, "David Lang" <david@lang.hm> said: > > > >> On Tue, 27 May 2014, Dave Taht wrote: >> >> > On Tue, May 27, 2014 at 4:27 PM, David Lang <david@lang.hm> wrote: >> >> On Tue, 27 May 2014, Dave Taht wrote: >> >> >> >>> There is a phrase in this thread that is begging to bother me. >> >>> >> >>> "Throughput". Everyone assumes that throughput is a big goal - and >> it >> >>> certainly is - and latency is also a big goal - and it certainly is >> - >> >>> but by specifying what you want from "throughput" as a compromise >> with >> >>> latency is not the right thing... >> >>> >> >>> If what you want is actually "high speed in-order packet delivery" - >> >>> say, for example a movie, >> >>> or a video conference, youtube, or a video conference - excessive >> >>> latency with high throughput, really, really makes in-order packet >> >>> delivery at high speed tough. >> >> >> >> >> >> the key word here is "excessive", that's why I said that for max >> throughput >> >> you want to buffer as much as your latency budget will allow you to. >> > >> > Again I'm trying to make a distinction between "throughput", and "packets >> > delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think) >> > >> > The buffering should not be in-the-network, it can be in the application. >> > >> > Take our hypothetical video stream for example. I am 20ms RTT from netflix. >> > If I artificially inflate that by adding 50ms of in-network buffering, >> > that means a loss can >> > take 120ms to recover from. >> > >> > If instead, I keep a 3*RTT buffer in my application, and expect that I have >> 5ms >> > worth of network-buffering, instead, I recover from a loss in 40ms. >> > >> > (please note, it's late, I might not have got the math entirely right) >> >> but you aren't going to be tuning the retry wait time per connection. what is >> the retry time that is set in your stack? It's something huge to survive >> international connections with satellite paths (so several seconds worth). If >> your server-to-eyeball buffering is shorter than this, you will get a window >> where you aren't fully utilizing the connection. >> >> so yes, I do think that if your purpose is to get the maximum possible in-order >> packets delivered, you end up making different decisions than if you are just >> trying to stream a HD video, or do other normal things. >> >> The problem is thinking that this absolute throughput is representitive of >> normal use. >> >> > As physical RTTs grow shorter, the advantages of smaller buffers grow >> larger. >> > >> > You don't need 50ms queueing delay on a 100us path. >> > >> > Many applications buffer for seconds due to needing to be at least >> > 2*(actual buffering+RTT) on the path. >> >> For something like streaming video, there's nothing wrong with the application >> buffering aggressivly (assuming you have the space to do so on the client side), >> the more you have gotten transmitted to the client, the longer it can survive a >> disruption of it's network. >> >> There's nothing wrong with having an hour of buffered data between the server >> and the viewer's eyes.now, this buffering should not be in the network devices, it >> should be in the >> client app, but this isn't because there's something wrong with bufferng, it's >> just because the client device has so much more available space to hold stuff. >> >> David Lang >> >> >> >> >>> You eventually lose a packet, and you have to wait a really long >> time >> >>> until a replacement arrives. Stuart and I showed that at last ietf. >> >>> And you get the classic "buffering" song playing.... >> >> >> >> >> >> Yep, and if you buffer too much, your "lost packet" is actually still in >> >> flight and eating bandwidth. >> >> >> >> David Lang >> >> >> >> >> >>> low latency makes recovery from a loss in an in-order stream much, >> much >> >>> faster. >> >>> >> >>> Honestly, for most applications on the web, what you want is high >> >>> speed in-order packet delivery, not >> >>> "bulk throughput". There is a whole class of apps (bittorrent, file >> >>> transfer) that don't need that, and we >> >>> have protocols for those.... >> >>> >> >>> >> >>> >> >>> On Tue, May 27, 2014 at 2:19 PM, David Lang <david@lang.hm> >> wrote: >> >>>> >> >>>> the problem is that paths change, they mix traffic from streams, >> and in >> >>>> other ways the utilization of the links can change radically in a >> short >> >>>> amount of time. >> >>>> >> >>>> If you try to limit things to exactly the ballistic throughput, >> you are >> >>>> not >> >>>> going to be able to exactly maintain this state, you are either >> going to >> >>>> overshoot (too much traffic, requiring dropping packets to >> maintain your >> >>>> minimal buffer), or you are going to undershoot (too little >> traffic and >> >>>> your >> >>>> connection is idle) >> >>>> >> >>>> Since you can't predict all the competing traffic throughout the >> >>>> Internet, >> >>>> if you want to maximize throughput, you want to buffer as much as >> you can >> >>>> tolerate for latency reasons. For most apps, this is more than >> enough to >> >>>> cause problems for other connections. >> >>>> >> >>>> David Lang >> >>>> >> >>>> >> >>>> On Mon, 26 May 2014, David P. Reed wrote: >> >>>> >> >>>>> Codel and PIE are excellent first steps... but I don't think >> they are >> >>>>> the >> >>>>> best eventual approach. I want to see them deployed ASAP in >> CMTS' s and >> >>>>> server load balancing networks... it would be a disaster to >> not deploy >> >>>>> the >> >>>>> far better option we have today immediately at the point of >> most >> >>>>> leverage. >> >>>>> The best is the enemy of the good. >> >>>>> >> >>>>> But, the community needs to learn once and for all that >> throughput and >> >>>>> latency do not trade off. We can in principle get far better >> latency >> >>>>> while >> >>>>> maintaining high throughput.... and we need to start thinking >> about >> >>>>> that. >> >>>>> That means that the framing of the issue as AQM is >> counterproductive. >> >>>>> >> >>>>> On May 26, 2014, Mikael Abrahamsson <swmike@swm.pp.se> >> wrote: >> >>>>>> >> >>>>>> >> >>>>>> On Mon, 26 May 2014, dpreed@reed.com wrote: >> >>>>>> >> >>>>>>> I would look to queue minimization rather than "queue >> management" >> >>>>>> >> >>>>>> >> >>>>>> (which >> >>>>>>> >> >>>>>>> >> >>>>>>> implied queues are often long) as a goal, and think >> harder about the >> >>>>>>> end-to-end problem of minimizing total end-to-end >> queueing delay >> >>>>>> >> >>>>>> >> >>>>>> while >> >>>>>>> >> >>>>>>> >> >>>>>>> maximizing throughput. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> As far as I can tell, this is exactly what CODEL and PIE >> tries to do. >> >>>>>> They >> >>>>>> try to find a decent tradeoff between having queues to >> make sure the >> >>>>>> pipe >> >>>>>> is filled, and not making these queues big enough to >> seriously affect >> >>>>>> interactive performance. >> >>>>>> >> >>>>>> The latter part looks like what LEDBAT does? >> >>>>>> <http://tools.ietf.org/html/rfc6817> >> >>>>>> >> >>>>>> Or are you thinking about something else? >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- Sent from my Android device with K-@ Mail. Please excuse >> my brevity. >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Cerowrt-devel mailing list >> >>>> Cerowrt-devel@lists.bufferbloat.net >> >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> >>>> >> >>>> _______________________________________________ >> >>>> Cerowrt-devel mailing list >> >>>> Cerowrt-devel@lists.bufferbloat.net >> >>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> >>>> >> >>> >> >>> >> >>> >> >>> >> >> >> > >> > >> > >> > >> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 20:00 ` dpreed 2014-05-26 0:18 ` Mikael Abrahamsson @ 2014-05-27 15:23 ` Jim Gettys 2014-05-27 17:31 ` Dave Taht 2014-05-28 15:20 ` dpreed 1 sibling, 2 replies; 30+ messages in thread From: Jim Gettys @ 2014-05-27 15:23 UTC (permalink / raw) To: David P Reed; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 8676 bytes --] On Sun, May 25, 2014 at 4:00 PM, <dpreed@reed.com> wrote: > Not that it is directly relevant, but there is no essential reason to > require 50 ms. of buffering. That might be true of some particular > QOS-related router algorithm. 50 ms. is about all one can tolerate in any > router between source and destination for today's networks - an upper-bound > rather than a minimum. > > > > The optimum buffer state for throughput is 1-2 packets worth - in other > words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck > buffer (the input queue to the lowest speed link along the path) should > have this much actually buffered. Buffering more than this increases > end-to-end latency beyond its optimal state. Increased end-to-end latency > reduces the effectiveness of control loops, creating more congestion. > > > > The rationale for having 50 ms. of buffering is probably to avoid > disruption of bursty mixed flows where the bursts might persist for 50 ms. > and then die. One reason for this is that source nodes run operating > systems that tend to release packets in bursts. That's a whole other > discussion - in an ideal world, source nodes would avoid bursty packet > releases by letting the control by the receiver window be "tight" > timing-wise. That is, to transmit a packet immediately at the instant an > ACK arrives increasing the window. This would pace the flow - current OS's > tend (due to scheduling mismatches) to send bursts of packets, "catching > up" on sending that could have been spaced out and done earlier if the > feedback from the receiver's window advancing were heeded. > > > > That is, endpoint network stacks (TCP implementations) can worsen > congestion by "dallying". The ideal end-to-end flows occupying a congested > router would have their packets paced so that the packets end up being sent > in the least bursty manner that an application can support. The effect of > this pacing is to move the "backlog" for each flow quickly into the source > node for that flow, which then provides back pressure on the application > driving the flow, which ultimately is necessary to stanch congestion. The > ideal congestion control mechanism slows the sender part of the application > to a pace that can go through the network without contributing to buffering. > Pacing is in Linux 3.12(?). How long it will take to see widespread deployment is another question, and as for other operating systems, who knows. See: https://lwn.net/Articles/564978/ > > > Current network stacks (including Linux's) don't achieve that goal - their > pushback on application sources is minimal - instead they accumulate > buffering internal to the network implementation. > This is much, much less true than it once was. There have been substantial changes in the Linux TCP stack in the last year or two, to avoid generating packets before necessary. Again, how long it will take for people to deploy this on Linux (and implement on other OS's) is a question. > This contributes to end-to-end latency as well. But if you think about > it, this is almost as bad as switch-level bufferbloat in terms of degrading > user experience. The reason I say "almost" is that there are tools, rarely > used in practice, that allow an application to specify that buffering > should not build up in the network stack (in the kernel or wherever it is). > But the default is not to use those APIs, and to buffer way too much. > > > > Remember, the network send stack can act similarly to a congested switch > (it is a switch among all the user applications running on that node). IF > there is a heavy file transfer, the file transfer's buffering acts to > increase latency for all other networked communications on that machine. > > > > Traditionally this problem has been thought of only as a within-node > fairness issue, but in fact it has a big effect on the switches in between > source and destination due to the lack of dispersed pacing of the packets > at the source - in other words, the current design does nothing to stem the > "burst groups" from a single source mentioned above. > > > > So we do need the source nodes to implement less "bursty" sending stacks. > This is especially true for multiplexed source nodes, such as web servers > implementing thousands of flows. > > > > A combination of codel-style switch-level buffer management and the stack > at the sender being implemented to spread packets in a particular TCP flow > out over time would improve things a lot. To achieve best throughput, the > optimal way to spread packets out on an end-to-end basis is to update the > receive window (sending ACK) at the receive end as quickly as possible, and > to respond to the updated receive window as quickly as possible when it > increases. > > > > Just like the "bufferbloat" issue, the problem is caused by applications > like streaming video, file transfers and big web pages that the application > programmer sees as not having a latency requirement within the flow, so the > application programmer does not have an incentive to control pacing. Thus > the operating system has got to push back on the applications' flow > somehow, so that the flow ends up paced once it enters the Internet itself. > So there's no real problem caused by large buffering in the network stack > at the endpoint, as long as the stack's delivery to the Internet is paced > by some mechanism, e.g. tight management of receive window control on an > end-to-end basis. > > > > I don't think this can be fixed by cerowrt, so this is out of place here. > It's partially ameliorated by cerowrt, if it aggressively drops packets > from flows that burst without pacing. fq_codel does this, if the buffer > size it aims for is small - but the problem is that the OS stacks don't > respond by pacing... they tend to respond by bursting, not because TCP > doesn't provide the mechanisms for pacing, but because the OS stack doesn't > transmit as soon as it is allowed to - thus building up a burst > unnecessarily. > > > > Bursts on a flow are thus bad in general. They make congestion happen > when it need not. > By far the biggest headache is what the Web does to the network. It has turned the web into a burst generator. A typical web page may have 10 (or even more images). See the "connections per page" plot in the link below. A browser downloads the base page, and then, over N connections, essentially simultaneously downloads those embedded objects. Many/most of them are small in size (4-10 packets). You never even get near slow start. So you get an IW amount of data/TCP connection, with no pacing, and no congestion avoidance. It is easy to observe 50-100 packets (or more) back to back at the bottleneck. This is (in practice) the amount you have to buffer today: that burst of packets from a web page. Without flow queuing, you are screwed. With it, it's annoying, but can be tolerated. I go over this is detail in: http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/ So far, I don't believe anyone has tried pacing the IW burst of packets. I'd certainly like to see that, but pacing needs to be across TCP connections (host pairs) to be possibly effective to outwit the gaming the web has done to the network. - Jim > > > > > > > > On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <swmike@swm.pp.se> > said: > > > On Sun, 25 May 2014, Dane Medic wrote: > > > > > Is it true that devices with less than 64 MB can't handle QOS? -> > > > > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html > > > > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = > > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. > > > > I also don't see why performance and memory size would be relevant, I'd > > say forwarding performance has more to do with CPU speed than anything > > else. > > > > -- > > Mikael Abrahamsson email: swmike@swm.pp.se > > _______________________________________________ > > Cerowrt-devel mailing list > > Cerowrt-devel@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/cerowrt-devel > > > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > > [-- Attachment #2: Type: text/html, Size: 13314 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 15:23 ` Jim Gettys @ 2014-05-27 17:31 ` Dave Taht 2014-05-28 15:33 ` dpreed 2014-05-28 15:20 ` dpreed 1 sibling, 1 reply; 30+ messages in thread From: Dave Taht @ 2014-05-27 17:31 UTC (permalink / raw) To: Jim Gettys, bloat; +Cc: cerowrt-devel This has been a good thread, and I'm sorry it was mostly on cerowrt-devel rather than the main list... It is not clear from observing google's deployment that pacing of the IW is not in use. I see clear 1ms boundaries for individual flows on much lower than iw10 boundaries. (e.g. I see 1-4 packets at a time arrive at 1ms intervals - but this could be an artifact of the capture, intermediate devices, etc) sch_fq comes with explicit support for spreading out the initial window, (by default it allows a full iw10 burst however) and tcp small queues and pacing-aware tcps and the tso fixes and stuff we don't know about all are collaborating to reduce the web burst size... sch_fq_codel used as the host/router qdisc basically does spread out any flow if there is a bottleneck on the link. The pacing stuff spreads flow delivery out across an estimate of srtt by clock tick... It makes tremendous sense to pace out a flow if you are hitting the wire at 10gbit and know you are stepping down to 100mbit or less on the end device - that 100x difference in rate is meaningful... and at the same time to get full throughput out of 10gbit some level of tso offloads is needed... and the initial guess at the right pace is hard to get right before a couple RTTs go by. I look forward to learning what's up. On Tue, May 27, 2014 at 8:23 AM, Jim Gettys <jg@freedesktop.org> wrote: > > > > On Sun, May 25, 2014 at 4:00 PM, <dpreed@reed.com> wrote: >> >> Not that it is directly relevant, but there is no essential reason to >> require 50 ms. of buffering. That might be true of some particular >> QOS-related router algorithm. 50 ms. is about all one can tolerate in any >> router between source and destination for today's networks - an upper-bound >> rather than a minimum. >> >> >> >> The optimum buffer state for throughput is 1-2 packets worth - in other >> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck >> buffer (the input queue to the lowest speed link along the path) should have >> this much actually buffered. Buffering more than this increases end-to-end >> latency beyond its optimal state. Increased end-to-end latency reduces the >> effectiveness of control loops, creating more congestion. This misses an important facet of modern macs (wifi, wireless, cable, and gpon), which which can aggregate 32k or more in packets. So the ideal size in those cases is much larger than a MTU, and has additional factors governing the ideal - such as the probability of a packet loss inducing a retransmit.... Ethernet, sure. >> >> >> >> The rationale for having 50 ms. of buffering is probably to avoid >> disruption of bursty mixed flows where the bursts might persist for 50 ms. >> and then die. One reason for this is that source nodes run operating systems >> that tend to release packets in bursts. That's a whole other discussion - in >> an ideal world, source nodes would avoid bursty packet releases by letting >> the control by the receiver window be "tight" timing-wise. That is, to >> transmit a packet immediately at the instant an ACK arrives increasing the >> window. This would pace the flow - current OS's tend (due to scheduling >> mismatches) to send bursts of packets, "catching up" on sending that could >> have been spaced out and done earlier if the feedback from the receiver's >> window advancing were heeded. This loop has got ever tighter since linux 3.3, to where it's really as tight as a modern cpu scheduler can get it. (or so I keep thinking - but successive improvements in linux tcp keep proving me wrong. :) I am really in awe of linux tcp these days. Recently I was benchmarking windows and macos. Windows only got 60% of the throughput linux tcp did at gigE speeds, and osx had a lot of issues at 10mbit and below, stretch acks and holding the window too high for the path) I keep hoping better ethernet hardware will arrive that can mix flows even more. >> >> >> >> That is, endpoint network stacks (TCP implementations) can worsen >> congestion by "dallying". The ideal end-to-end flows occupying a congested >> router would have their packets paced so that the packets end up being sent >> in the least bursty manner that an application can support. The effect of >> this pacing is to move the "backlog" for each flow quickly into the source >> node for that flow, which then provides back pressure on the application >> driving the flow, which ultimately is necessary to stanch congestion. The >> ideal congestion control mechanism slows the sender part of the application >> to a pace that can go through the network without contributing to buffering. > > > Pacing is in Linux 3.12(?). How long it will take to see widespread > deployment is another question, and as for other operating systems, who > knows. > > See: https://lwn.net/Articles/564978/ Steinar drove some of this with persistence and results... http://www.linux-support.com/cms/steinar-h-gunderson-paced-tcp-and-the-fq-scheduler/ >> >> >> >> Current network stacks (including Linux's) don't achieve that goal - their >> pushback on application sources is minimal - instead they accumulate >> buffering internal to the network implementation. > > > This is much, much less true than it once was. There have been substantial > changes in the Linux TCP stack in the last year or two, to avoid generating > packets before necessary. Again, how long it will take for people to deploy > this on Linux (and implement on other OS's) is a question. The data centers I'm in (linode, isc, google cloud) seem to be tracking modern kernels pretty good... >> >> This contributes to end-to-end latency as well. But if you think about >> it, this is almost as bad as switch-level bufferbloat in terms of degrading >> user experience. The reason I say "almost" is that there are tools, rarely >> used in practice, that allow an application to specify that buffering should >> not build up in the network stack (in the kernel or wherever it is). But >> the default is not to use those APIs, and to buffer way too much. >> >> >> >> Remember, the network send stack can act similarly to a congested switch >> (it is a switch among all the user applications running on that node). IF >> there is a heavy file transfer, the file transfer's buffering acts to >> increase latency for all other networked communications on that machine. >> >> >> >> Traditionally this problem has been thought of only as a within-node >> fairness issue, but in fact it has a big effect on the switches in between >> source and destination due to the lack of dispersed pacing of the packets at >> the source - in other words, the current design does nothing to stem the >> "burst groups" from a single source mentioned above. >> >> >> >> So we do need the source nodes to implement less "bursty" sending stacks. >> This is especially true for multiplexed source nodes, such as web servers >> implementing thousands of flows. >> >> >> >> A combination of codel-style switch-level buffer management and the stack >> at the sender being implemented to spread packets in a particular TCP flow >> out over time would improve things a lot. To achieve best throughput, the >> optimal way to spread packets out on an end-to-end basis is to update the >> receive window (sending ACK) at the receive end as quickly as possible, and >> to respond to the updated receive window as quickly as possible when it >> increases. >> >> >> >> Just like the "bufferbloat" issue, the problem is caused by applications >> like streaming video, file transfers and big web pages that the application >> programmer sees as not having a latency requirement within the flow, so the >> application programmer does not have an incentive to control pacing. Thus >> the operating system has got to push back on the applications' flow somehow, >> so that the flow ends up paced once it enters the Internet itself. So >> there's no real problem caused by large buffering in the network stack at >> the endpoint, as long as the stack's delivery to the Internet is paced by >> some mechanism, e.g. tight management of receive window control on an >> end-to-end basis. >> >> >> >> I don't think this can be fixed by cerowrt, so this is out of place here. >> It's partially ameliorated by cerowrt, if it aggressively drops packets from >> flows that burst without pacing. fq_codel does this, if the buffer size it >> aims for is small - but the problem is that the OS stacks don't respond by >> pacing... they tend to respond by bursting, not because TCP doesn't provide >> the mechanisms for pacing, but because the OS stack doesn't transmit as soon >> as it is allowed to - thus building up a burst unnecessarily. >> >> >> >> Bursts on a flow are thus bad in general. They make congestion happen >> when it need not. > > > By far the biggest headache is what the Web does to the network. It has > turned the web into a burst generator. > > A typical web page may have 10 (or even more images). See the "connections > per page" plot in the link below. > > A browser downloads the base page, and then, over N connections, essentially > simultaneously downloads those embedded objects. Many/most of them are > small in size (4-10 packets). You never even get near slow start. > > So you get an IW amount of data/TCP connection, with no pacing, and no > congestion avoidance. It is easy to observe 50-100 packets (or more) back > to back at the bottleneck. > > This is (in practice) the amount you have to buffer today: that burst of > packets from a web page. Without flow queuing, you are screwed. With it, > it's annoying, but can be tolerated. > > > I go over this is detail in: > > http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/ > > So far, I don't believe anyone has tried pacing the IW burst of packets. > I'd certainly like to see that, but pacing needs to be across TCP > connections (host pairs) to be possibly effective to outwit the gaming the > web has done to the network. > - Jim > >> >> >> >> >> >> >> >> >> On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <swmike@swm.pp.se> >> said: >> >> > On Sun, 25 May 2014, Dane Medic wrote: >> > >> > > Is it true that devices with less than 64 MB can't handle QOS? -> >> > > >> > > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html >> > >> > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = >> > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. >> > >> > I also don't see why performance and memory size would be relevant, I'd >> > say forwarding performance has more to do with CPU speed than anything >> > else. >> > >> > -- >> > Mikael Abrahamsson email: swmike@swm.pp.se >> > _______________________________________________ >> > Cerowrt-devel mailing list >> > Cerowrt-devel@lists.bufferbloat.net >> > https://lists.bufferbloat.net/listinfo/cerowrt-devel >> > >> >> >> _______________________________________________ >> Cerowrt-devel mailing list >> Cerowrt-devel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> > > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel > -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 17:31 ` Dave Taht @ 2014-05-28 15:33 ` dpreed 0 siblings, 0 replies; 30+ messages in thread From: dpreed @ 2014-05-28 15:33 UTC (permalink / raw) To: Dave Taht; +Cc: cerowrt-devel, bloat [-- Attachment #1: Type: text/plain, Size: 13987 bytes --] Same concern I mentioned with Jim's message. I was not clear what I meant by "pacing" in the context of optimization of latency while preserving throughput. It is NOT just a matter of spreading packets out in time that I was talking about. It is a matter of doing so without reducing throughput. That means transmitting as *early* as possible while avoiding congestion. Building a "backlog" and then artificially spreading it out by "add-on pacing" will definitely reduce throughput below the flow's fair share of the bottleneck resource. It is pretty clear to me that you can't get to a minimal latency, optimal throughput control algorithm by a series of "add ons" in LART. It requires rethinking of the control discipline, and changes to get more information about congestion earlier, without ever allowing a buffer queue to build up in intermediate nodes - since that destroys latency by definition. As long as you require buffers to grow at bottleneck links in order to get measurements of congestion, you probably are stuck with long-time-constant control loops, and as long as you encourage buffering at OS send stacks you are even worse off at the application layer. The problem is in the assumption that buffer queueing is the only possible answer. The "pacing" being included in Linux is just another way to build bigger buffers (on the sending host), by taking control away from the TCP control loop. On Tuesday, May 27, 2014 1:31pm, "Dave Taht" <dave.taht@gmail.com> said: > This has been a good thread, and I'm sorry it was mostly on > cerowrt-devel rather than the main list... > > It is not clear from observing google's deployment that pacing of the > IW is not in use. I see > clear 1ms boundaries for individual flows on much lower than iw10 > boundaries. (e.g. I see 1-4 > packets at a time arrive at 1ms intervals - but this could be an > artifact of the capture, intermediate > devices, etc) > > sch_fq comes with explicit support for spreading out the initial > window, (by default it allows a full iw10 burst however) and tcp small > queues and pacing-aware tcps and the tso fixes and stuff we don't know > about all are collaborating to reduce the web burst size... > > sch_fq_codel used as the host/router qdisc basically does spread out > any flow if there is a bottleneck on the link. The pacing stuff > spreads flow delivery out across an estimate of srtt by clock tick... > > It makes tremendous sense to pace out a flow if you are hitting the > wire at 10gbit and know you are stepping down to 100mbit or less on > the end device - that 100x difference in rate is meaningful... and at > the same time to get full throughput out of 10gbit some level of tso > offloads is needed... and the initial guess > at the right pace is hard to get right before a couple RTTs go by. > > I look forward to learning what's up. > > On Tue, May 27, 2014 at 8:23 AM, Jim Gettys <jg@freedesktop.org> wrote: > > > > > > > > On Sun, May 25, 2014 at 4:00 PM, <dpreed@reed.com> wrote: > >> > >> Not that it is directly relevant, but there is no essential reason to > >> require 50 ms. of buffering. That might be true of some particular > >> QOS-related router algorithm. 50 ms. is about all one can tolerate in > any > >> router between source and destination for today's networks - an > upper-bound > >> rather than a minimum. > >> > >> > >> > >> The optimum buffer state for throughput is 1-2 packets worth - in other > >> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck > >> buffer (the input queue to the lowest speed link along the path) should > have > >> this much actually buffered. Buffering more than this increases > end-to-end > >> latency beyond its optimal state. Increased end-to-end latency reduces > the > >> effectiveness of control loops, creating more congestion. > > This misses an important facet of modern macs (wifi, wireless, cable, and gpon), > which which can aggregate 32k or more in packets. > > So the ideal size in those cases is much larger than a MTU, and has additional > factors governing the ideal - such as the probability of a packet loss inducing > a retransmit.... > > Ethernet, sure. > > >> > >> > >> > >> The rationale for having 50 ms. of buffering is probably to avoid > >> disruption of bursty mixed flows where the bursts might persist for 50 > ms. > >> and then die. One reason for this is that source nodes run operating > systems > >> that tend to release packets in bursts. That's a whole other discussion - > in > >> an ideal world, source nodes would avoid bursty packet releases by > letting > >> the control by the receiver window be "tight" timing-wise. That is, to > >> transmit a packet immediately at the instant an ACK arrives increasing > the > >> window. This would pace the flow - current OS's tend (due to scheduling > >> mismatches) to send bursts of packets, "catching up" on sending that > could > >> have been spaced out and done earlier if the feedback from the > receiver's > >> window advancing were heeded. > > This loop has got ever tighter since linux 3.3, to where it's really as tight > as a modern cpu scheduler can get it. (or so I keep thinking - > but successive improvements in linux tcp keep proving me wrong. :) > > I am really in awe of linux tcp these days. Recently I was benchmarking > windows and macos. Windows only got 60% of the throughput linux tcp > did at gigE speeds, and osx had a lot of issues at 10mbit and below, > stretch acks and holding the window too high for the path) > > I keep hoping better ethernet hardware will arrive that can mix flows > even more. > > >> > >> > >> > >> That is, endpoint network stacks (TCP implementations) can worsen > >> congestion by "dallying". The ideal end-to-end flows occupying a > congested > >> router would have their packets paced so that the packets end up being > sent > >> in the least bursty manner that an application can support. The effect > of > >> this pacing is to move the "backlog" for each flow quickly into the > source > >> node for that flow, which then provides back pressure on the application > >> driving the flow, which ultimately is necessary to stanch congestion. > The > >> ideal congestion control mechanism slows the sender part of the > application > >> to a pace that can go through the network without contributing to > buffering. > > > > > > Pacing is in Linux 3.12(?). How long it will take to see widespread > > deployment is another question, and as for other operating systems, who > > knows. > > > > See: https://lwn.net/Articles/564978/ > > Steinar drove some of this with persistence and results... > > http://www.linux-support.com/cms/steinar-h-gunderson-paced-tcp-and-the-fq-scheduler/ > > >> > >> > >> > >> Current network stacks (including Linux's) don't achieve that goal - > their > >> pushback on application sources is minimal - instead they accumulate > >> buffering internal to the network implementation. > > > > > > This is much, much less true than it once was. There have been substantial > > changes in the Linux TCP stack in the last year or two, to avoid generating > > packets before necessary. Again, how long it will take for people to deploy > > this on Linux (and implement on other OS's) is a question. > > The data centers I'm in (linode, isc, google cloud) seem to be > tracking modern kernels pretty good... > > >> > >> This contributes to end-to-end latency as well. But if you think about > >> it, this is almost as bad as switch-level bufferbloat in terms of > degrading > >> user experience. The reason I say "almost" is that there are tools, > rarely > >> used in practice, that allow an application to specify that buffering > should > >> not build up in the network stack (in the kernel or wherever it is). > But > >> the default is not to use those APIs, and to buffer way too much. > >> > >> > >> > >> Remember, the network send stack can act similarly to a congested switch > >> (it is a switch among all the user applications running on that node). > IF > >> there is a heavy file transfer, the file transfer's buffering acts to > >> increase latency for all other networked communications on that machine. > >> > >> > >> > >> Traditionally this problem has been thought of only as a within-node > >> fairness issue, but in fact it has a big effect on the switches in > between > >> source and destination due to the lack of dispersed pacing of the packets > at > >> the source - in other words, the current design does nothing to stem the > >> "burst groups" from a single source mentioned above. > >> > >> > >> > >> So we do need the source nodes to implement less "bursty" sending > stacks. > >> This is especially true for multiplexed source nodes, such as web > servers > >> implementing thousands of flows. > >> > >> > >> > >> A combination of codel-style switch-level buffer management and the > stack > >> at the sender being implemented to spread packets in a particular TCP > flow > >> out over time would improve things a lot. To achieve best throughput, > the > >> optimal way to spread packets out on an end-to-end basis is to update > the > >> receive window (sending ACK) at the receive end as quickly as possible, > and > >> to respond to the updated receive window as quickly as possible when it > >> increases. > >> > >> > >> > >> Just like the "bufferbloat" issue, the problem is caused by applications > >> like streaming video, file transfers and big web pages that the > application > >> programmer sees as not having a latency requirement within the flow, so > the > >> application programmer does not have an incentive to control pacing. > Thus > >> the operating system has got to push back on the applications' flow > somehow, > >> so that the flow ends up paced once it enters the Internet itself. So > >> there's no real problem caused by large buffering in the network stack > at > >> the endpoint, as long as the stack's delivery to the Internet is paced > by > >> some mechanism, e.g. tight management of receive window control on an > >> end-to-end basis. > >> > >> > >> > >> I don't think this can be fixed by cerowrt, so this is out of place > here. > >> It's partially ameliorated by cerowrt, if it aggressively drops packets > from > >> flows that burst without pacing. fq_codel does this, if the buffer size > it > >> aims for is small - but the problem is that the OS stacks don't respond > by > >> pacing... they tend to respond by bursting, not because TCP doesn't > provide > >> the mechanisms for pacing, but because the OS stack doesn't transmit as > soon > >> as it is allowed to - thus building up a burst unnecessarily. > >> > >> > >> > >> Bursts on a flow are thus bad in general. They make congestion happen > >> when it need not. > > > > > > By far the biggest headache is what the Web does to the network. It has > > turned the web into a burst generator. > > > > A typical web page may have 10 (or even more images). See the "connections > > per page" plot in the link below. > > > > A browser downloads the base page, and then, over N connections, essentially > > simultaneously downloads those embedded objects. Many/most of them are > > small in size (4-10 packets). You never even get near slow start. > > > > So you get an IW amount of data/TCP connection, with no pacing, and no > > congestion avoidance. It is easy to observe 50-100 packets (or more) back > > to back at the bottleneck. > > > > This is (in practice) the amount you have to buffer today: that burst of > > packets from a web page. Without flow queuing, you are screwed. With it, > > it's annoying, but can be tolerated. > > > > > > I go over this is detail in: > > > > > http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/ > > > > So far, I don't believe anyone has tried pacing the IW burst of packets. > > I'd certainly like to see that, but pacing needs to be across TCP > > connections (host pairs) to be possibly effective to outwit the gaming the > > web has done to the network. > > - Jim > > > >> > >> > >> > >> > >> > >> > >> > >> > >> On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" > <swmike@swm.pp.se> > >> said: > >> > >> > On Sun, 25 May 2014, Dane Medic wrote: > >> > > >> > > Is it true that devices with less than 64 MB can't handle QOS? > -> > >> > > > >> > > > https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html > >> > > >> > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s > = > >> > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. > >> > > >> > I also don't see why performance and memory size would be relevant, > I'd > >> > say forwarding performance has more to do with CPU speed than > anything > >> > else. > >> > > >> > -- > >> > Mikael Abrahamsson email: swmike@swm.pp.se > >> > _______________________________________________ > >> > Cerowrt-devel mailing list > >> > Cerowrt-devel@lists.bufferbloat.net > >> > https://lists.bufferbloat.net/listinfo/cerowrt-devel > >> > > >> > >> > >> _______________________________________________ > >> Cerowrt-devel mailing list > >> Cerowrt-devel@lists.bufferbloat.net > >> https://lists.bufferbloat.net/listinfo/cerowrt-devel > >> > > > > > > _______________________________________________ > > Cerowrt-devel mailing list > > Cerowrt-devel@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/cerowrt-devel > > > > > > -- > Dave Täht > > NSFW: > https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article > [-- Attachment #2: Type: text/html, Size: 18073 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-27 15:23 ` Jim Gettys 2014-05-27 17:31 ` Dave Taht @ 2014-05-28 15:20 ` dpreed 2014-05-28 18:33 ` David Lang 1 sibling, 1 reply; 30+ messages in thread From: dpreed @ 2014-05-28 15:20 UTC (permalink / raw) To: Jim Gettys; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 9893 bytes --] I did not mean that "pacing". Sorry I used a generic term. I meant what my longer description described - a specific mechanism for reducing bunching that is essentially "cooperative" among all active flows through a bottlenecked link. That's part of a "closed loop" control system driving each TCP endpoint into a cooperative mode. The thing you call "pacing" is something quite different. It is disconnected from the TCP control loops involved, which basically means it is flying blind. Introducing that kind of "pacing" almost certainly reduces throughput, because it *delays* packets. The thing I called "pacing" is in no version of Linux that I know of. Give it a different name: "anti-bunching cooperation" or "timing phase management for congestion reduction". Rather than *delaying* packets, it tries to get packets to avoid bunching only when reducing window size, and doing so by tightening the control loop so that the sender transmits as *soon* as it can, not by delaying sending after the sender dallies around not sending when it can. On Tuesday, May 27, 2014 11:23am, "Jim Gettys" <jg@freedesktop.org> said: On Sun, May 25, 2014 at 4:00 PM, <[dpreed@reed.com](mailto:dpreed@reed.com)> wrote: Not that it is directly relevant, but there is no essential reason to require 50 ms. of buffering. That might be true of some particular QOS-related router algorithm. 50 ms. is about all one can tolerate in any router between source and destination for today's networks - an upper-bound rather than a minimum. The optimum buffer state for throughput is 1-2 packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the input queue to the lowest speed link along the path) should have this much actually buffered. Buffering more than this increases end-to-end latency beyond its optimal state. Increased end-to-end latency reduces the effectiveness of control loops, creating more congestion. The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes run operating systems that tend to release packets in bursts. That's a whole other discussion - in an ideal world, source nodes would avoid bursty packet releases by letting the control by the receiver window be "tight" timing-wise. That is, to transmit a packet immediately at the instant an ACK arrives increasing the window. This would pace the flow - current OS's tend (due to scheduling mismatches) to send bursts of packets, "catching up" on sending that could have been spaced out and done earlier if the feedback from the receiver's window advancing were heeded. That is, endpoint network stacks (TCP implementations) can worsen congestion by "dallying". The ideal end-to-end flows occupying a congested router would have their packets paced so that the packets end up being sent in the least bursty manner that an application can support. The effect of this pacing is to move the "backlog" for each flow quickly into the source node for that flow, which then provides back pressure on the application driving the flow, which ultimately is necessary to stanch congestion. The ideal congestion control mechanism slows the sender part of the application to a pace that can go through the network without contributing to buffering. Pacing is in Linux 3.12(?). How long it will take to see widespread deployment is another question, and as for other operating systems, who knows. See: [https://lwn.net/Articles/564978/](https://lwn.net/Articles/564978/) Current network stacks (including Linux's) don't achieve that goal - their pushback on application sources is minimal - instead they accumulate buffering internal to the network implementation. This is much, much less true than it once was. There have been substantial changes in the Linux TCP stack in the last year or two, to avoid generating packets before necessary. Again, how long it will take for people to deploy this on Linux (and implement on other OS's) is a question. This contributes to end-to-end latency as well. But if you think about it, this is almost as bad as switch-level bufferbloat in terms of degrading user experience. The reason I say "almost" is that there are tools, rarely used in practice, that allow an application to specify that buffering should not build up in the network stack (in the kernel or wherever it is). But the default is not to use those APIs, and to buffer way too much. Remember, the network send stack can act similarly to a congested switch (it is a switch among all the user applications running on that node). IF there is a heavy file transfer, the file transfer's buffering acts to increase latency for all other networked communications on that machine. Traditionally this problem has been thought of only as a within-node fairness issue, but in fact it has a big effect on the switches in between source and destination due to the lack of dispersed pacing of the packets at the source - in other words, the current design does nothing to stem the "burst groups" from a single source mentioned above. So we do need the source nodes to implement less "bursty" sending stacks. This is especially true for multiplexed source nodes, such as web servers implementing thousands of flows. A combination of codel-style switch-level buffer management and the stack at the sender being implemented to spread packets in a particular TCP flow out over time would improve things a lot. To achieve best throughput, the optimal way to spread packets out on an end-to-end basis is to update the receive window (sending ACK) at the receive end as quickly as possible, and to respond to the updated receive window as quickly as possible when it increases. Just like the "bufferbloat" issue, the problem is caused by applications like streaming video, file transfers and big web pages that the application programmer sees as not having a latency requirement within the flow, so the application programmer does not have an incentive to control pacing. Thus the operating system has got to push back on the applications' flow somehow, so that the flow ends up paced once it enters the Internet itself. So there's no real problem caused by large buffering in the network stack at the endpoint, as long as the stack's delivery to the Internet is paced by some mechanism, e.g. tight management of receive window control on an end-to-end basis. I don't think this can be fixed by cerowrt, so this is out of place here. It's partially ameliorated by cerowrt, if it aggressively drops packets from flows that burst without pacing. fq_codel does this, if the buffer size it aims for is small - but the problem is that the OS stacks don't respond by pacing... they tend to respond by bursting, not because TCP doesn't provide the mechanisms for pacing, but because the OS stack doesn't transmit as soon as it is allowed to - thus building up a burst unnecessarily. Bursts on a flow are thus bad in general. They make congestion happen when it need not. By far the biggest headache is what the Web does to the network. It has turned the web into a burst generator. A typical web page may have 10 (or even more images). See the "connections per page" plot in the link below. A browser downloads the base page, and then, over N connections, essentially simultaneously downloads those embedded objects. Many/most of them are small in size (4-10 packets). You never even get near slow start. So you get an IW amount of data/TCP connection, with no pacing, and no congestion avoidance. It is easy to observe 50-100 packets (or more) back to back at the bottleneck. This is (in practice) the amount you have to buffer today: that burst of packets from a web page. Without flow queuing, you are screwed. With it, it's annoying, but can be tolerated. I go over this is detail in: [http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/](http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/) So far, I don't believe anyone has tried pacing the IW burst of packets. I'd certainly like to see that, but pacing needs to be across TCP connections (host pairs) to be possibly effective to outwit the gaming the web has done to the network. - Jim On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mailto:swmike@swm.pp.se)> said: > On Sun, 25 May 2014, Dane Medic wrote: > > > Is it true that devices with less than 64 MB can't handle QOS? -> > > [https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html](https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html) > > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. > > I also don't see why performance and memory size would be relevant, I'd > say forwarding performance has more to do with CPU speed than anything > else. > > -- > Mikael Abrahamsson email: [swmike@swm.pp.se](mailto:swmike@swm.pp.se) > _______________________________________________ > Cerowrt-devel mailing list > [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) > [https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) > _______________________________________________ Cerowrt-devel mailing list [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) [https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) [-- Attachment #2: Type: text/html, Size: 15178 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-28 15:20 ` dpreed @ 2014-05-28 18:33 ` David Lang 2014-05-29 12:11 ` David P. Reed 0 siblings, 1 reply; 30+ messages in thread From: David Lang @ 2014-05-28 18:33 UTC (permalink / raw) To: dpreed; +Cc: cerowrt-devel [-- Attachment #1: Type: TEXT/Plain, Size: 10975 bytes --] On Wed, 28 May 2014, dpreed@reed.com wrote: > I did not mean that "pacing". Sorry I used a generic term. I meant what my > longer description described - a specific mechanism for reducing bunching that > is essentially "cooperative" among all active flows through a bottlenecked > link. That's part of a "closed loop" control system driving each TCP endpoint > into a cooperative mode. how do you think we can get feedback from the bottleneck node to all the different senders? what happens to the ones who try to play nice if one doesn't?, including what happens if one isn't just ignorant of the new cooperative mode, but activly tries to cheat? (as I understand it, this is the fatal flaw in many of the past buffering improvement proposals) While the in-house router is the first bottleneck that user's traffic hits, the bigger problems happen when the bottleneck is in the peering between ISPs, many hops away from any sender, with many different senders competing for the avialable bandwidth. This is where the new buffering approaches win. If the traffic is below the congestion level, they add very close to zero overhead, but when congestion happens, they manage the resulting buffers in a way that's works better for people (allowing short, fast connections to be fast with only a small impact on very long connections) David Lang > The thing you call "pacing" is something quite different. It is disconnected > from the TCP control loops involved, which basically means it is flying blind. > Introducing that kind of "pacing" almost certainly reduces throughput, because > it *delays* packets. > > The thing I called "pacing" is in no version of Linux that I know of. Give it > a different name: "anti-bunching cooperation" or "timing phase management for > congestion reduction". Rather than *delaying* packets, it tries to get packets > to avoid bunching only when reducing window size, and doing so by tightening > the control loop so that the sender transmits as *soon* as it can, not by > delaying sending after the sender dallies around not sending when it can. > > > > > > > > On Tuesday, May 27, 2014 11:23am, "Jim Gettys" <jg@freedesktop.org> said: > > > > > > > > On Sun, May 25, 2014 at 4:00 PM, <[dpreed@reed.com](mailto:dpreed@reed.com)> wrote: > > Not that it is directly relevant, but there is no essential reason to require 50 ms. of buffering. That might be true of some particular QOS-related router algorithm. 50 ms. is about all one can tolerate in any router between source and destination for today's networks - an upper-bound rather than a minimum. > > The optimum buffer state for throughput is 1-2 packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the input queue to the lowest speed link along the path) should have this much actually buffered. Buffering more than this increases end-to-end latency beyond its optimal state. Increased end-to-end latency reduces the effectiveness of control loops, creating more congestion. > > The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes run operating systems that tend to release packets in bursts. That's a whole other discussion - in an ideal world, source nodes would avoid bursty packet releases by letting the control by the receiver window be "tight" timing-wise. That is, to transmit a packet immediately at the instant an ACK arrives increasing the window. This would pace the flow - current OS's tend (due to scheduling mismatches) to send bursts of packets, "catching up" on sending that could have been spaced out and done earlier if the feedback from the receiver's window advancing were heeded. > > > > That is, endpoint network stacks (TCP implementations) can worsen congestion by "dallying". The ideal end-to-end flows occupying a congested router would have their packets paced so that the packets end up being sent in the least bursty manner that an application can support. The effect of this pacing is to move the "backlog" for each flow quickly into the source node for that flow, which then provides back pressure on the application driving the flow, which ultimately is necessary to stanch congestion. The ideal congestion control mechanism slows the sender part of the application to a pace that can go through the network without contributing to buffering. > > Pacing is in Linux 3.12(?). How long it will take to see widespread deployment is another question, and as for other operating systems, who knows. > See: [https://lwn.net/Articles/564978/](https://lwn.net/Articles/564978/) > > > Current network stacks (including Linux's) don't achieve that goal - their pushback on application sources is minimal - instead they accumulate buffering internal to the network implementation. > This is much, much less true than it once was. There have been substantial changes in the Linux TCP stack in the last year or two, to avoid generating packets before necessary. Again, how long it will take for people to deploy this on Linux (and implement on other OS's) is a question. > > This contributes to end-to-end latency as well. But if you think about it, this is almost as bad as switch-level bufferbloat in terms of degrading user experience. The reason I say "almost" is that there are tools, rarely used in practice, that allow an application to specify that buffering should not build up in the network stack (in the kernel or wherever it is). But the default is not to use those APIs, and to buffer way too much. > > Remember, the network send stack can act similarly to a congested switch (it is a switch among all the user applications running on that node). IF there is a heavy file transfer, the file transfer's buffering acts to increase latency for all other networked communications on that machine. > > Traditionally this problem has been thought of only as a within-node fairness issue, but in fact it has a big effect on the switches in between source and destination due to the lack of dispersed pacing of the packets at the source - in other words, the current design does nothing to stem the "burst groups" from a single source mentioned above. > > So we do need the source nodes to implement less "bursty" sending stacks. This is especially true for multiplexed source nodes, such as web servers implementing thousands of flows. > > A combination of codel-style switch-level buffer management and the stack at the sender being implemented to spread packets in a particular TCP flow out over time would improve things a lot. To achieve best throughput, the optimal way to spread packets out on an end-to-end basis is to update the receive window (sending ACK) at the receive end as quickly as possible, and to respond to the updated receive window as quickly as possible when it increases. > > Just like the "bufferbloat" issue, the problem is caused by applications like streaming video, file transfers and big web pages that the application programmer sees as not having a latency requirement within the flow, so the application programmer does not have an incentive to control pacing. Thus the operating system has got to push back on the applications' flow somehow, so that the flow ends up paced once it enters the Internet itself. So there's no real problem caused by large buffering in the network stack at the endpoint, as long as the stack's delivery to the Internet is paced by some mechanism, e.g. tight management of receive window control on an end-to-end basis. > > I don't think this can be fixed by cerowrt, so this is out of place here. It's partially ameliorated by cerowrt, if it aggressively drops packets from flows that burst without pacing. fq_codel does this, if the buffer size it aims for is small - but the problem is that the OS stacks don't respond by pacing... they tend to respond by bursting, not because TCP doesn't provide the mechanisms for pacing, but because the OS stack doesn't transmit as soon as it is allowed to - thus building up a burst unnecessarily. > > Bursts on a flow are thus bad in general. They make congestion happen when it need not. > By far the biggest headache is what the Web does to the network. It has turned the web into a burst generator. > A typical web page may have 10 (or even more images). See the "connections per page" plot in the link below. > A browser downloads the base page, and then, over N connections, essentially simultaneously downloads those embedded objects. Many/most of them are small in size (4-10 packets). You never even get near slow start. > So you get an IW amount of data/TCP connection, with no pacing, and no congestion avoidance. It is easy to observe 50-100 packets (or more) back to back at the bottleneck. > This is (in practice) the amount you have to buffer today: that burst of packets from a web page. Without flow queuing, you are screwed. With it, it's annoying, but can be tolerated. > I go over this is detail in: > > [http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/](http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/) > So far, I don't believe anyone has tried pacing the IW burst of packets. I'd certainly like to see that, but pacing needs to be across TCP connections (host pairs) to be possibly effective to outwit the gaming the web has done to the network. > - Jim > > > > > > > > > On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mailto:swmike@swm.pp.se)> said: > > > >> On Sun, 25 May 2014, Dane Medic wrote: >> >> > Is it true that devices with less than 64 MB can't handle QOS? -> >> > [https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html](https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html) > > >> At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = >> 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. >> >> I also don't see why performance and memory size would be relevant, I'd > > say forwarding performance has more to do with CPU speed than anything >> else. >> >> -- >> Mikael Abrahamsson email: [swmike@swm.pp.se](mailto:swmike@swm.pp.se) > > _______________________________________________ >> Cerowrt-devel mailing list >> [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) >> [https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) > > > _______________________________________________ > Cerowrt-devel mailing list > [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) > [https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) [-- Attachment #2: Type: TEXT/PLAIN, Size: 164 bytes --] _______________________________________________ Cerowrt-devel mailing list Cerowrt-devel@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/cerowrt-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-28 18:33 ` David Lang @ 2014-05-29 12:11 ` David P. Reed 2014-05-29 15:29 ` dpreed 2014-05-29 23:40 ` Michael Richardson 0 siblings, 2 replies; 30+ messages in thread From: David P. Reed @ 2014-05-29 12:11 UTC (permalink / raw) To: David Lang; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 12745 bytes --] ECN-style signaling has the right properties ... just like TTL it can provide valid and current sampling of the packet ' s environment as it travels. The idea is to sample what is happening at a bottleneck for the packet ' s flow. The bottleneck is the link with the most likelihood of a collision from flows sharing that link. A control - theoretic estimator of recent collision likelihood is easy to do at each queue. All active flows would receive that signal, with the busiest ones getting it most quickly. Also it is reasonable to count all potentially colliding flows at all outbound queues, and report that. The estimator can then provide the signal that each flow responds to. The problem of "defectors" is best dealt with by punishment... An aggressive packet drop policy that makes causing congestion reduce the cause's throughput and increases latency is the best kind of answer. Since the router can remember recent flow behavior, it can penalize recent flows. A Bloom style filter can remember flow statistics for both of these local policies. A great use for the memory no longer misapplied to buffering.... Simple? On May 28, 2014, David Lang <david@lang.hm> wrote: >On Wed, 28 May 2014, dpreed@reed.com wrote: > >> I did not mean that "pacing". Sorry I used a generic term. I meant >what my >> longer description described - a specific mechanism for reducing >bunching that >> is essentially "cooperative" among all active flows through a >bottlenecked >> link. That's part of a "closed loop" control system driving each TCP >endpoint >> into a cooperative mode. > >how do you think we can get feedback from the bottleneck node to all >the >different senders? > >what happens to the ones who try to play nice if one doesn't?, >including what >happens if one isn't just ignorant of the new cooperative mode, but >activly >tries to cheat? (as I understand it, this is the fatal flaw in many of >the past >buffering improvement proposals) > >While the in-house router is the first bottleneck that user's traffic >hits, the >bigger problems happen when the bottleneck is in the peering between >ISPs, many >hops away from any sender, with many different senders competing for >the >avialable bandwidth. > >This is where the new buffering approaches win. If the traffic is below >the >congestion level, they add very close to zero overhead, but when >congestion >happens, they manage the resulting buffers in a way that's works better >for >people (allowing short, fast connections to be fast with only a small >impact on >very long connections) > >David Lang > >> The thing you call "pacing" is something quite different. It is >disconnected >> from the TCP control loops involved, which basically means it is >flying blind. >> Introducing that kind of "pacing" almost certainly reduces >throughput, because >> it *delays* packets. >> >> The thing I called "pacing" is in no version of Linux that I know of. > Give it >> a different name: "anti-bunching cooperation" or "timing phase >management for >> congestion reduction". Rather than *delaying* packets, it tries to >get packets >> to avoid bunching only when reducing window size, and doing so by >tightening >> the control loop so that the sender transmits as *soon* as it can, >not by >> delaying sending after the sender dallies around not sending when it >can. >> >> >> >> >> >> >> >> On Tuesday, May 27, 2014 11:23am, "Jim Gettys" <jg@freedesktop.org> >said: >> >> >> >> >> >> >> >> On Sun, May 25, 2014 at 4:00 PM, ><[dpreed@reed.com](mailto:dpreed@reed.com)> wrote: >> >> Not that it is directly relevant, but there is no essential reason to >require 50 ms. of buffering. That might be true of some particular >QOS-related router algorithm. 50 ms. is about all one can tolerate in >any router between source and destination for today's networks - an >upper-bound rather than a minimum. >> >> The optimum buffer state for throughput is 1-2 packets worth - in >other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the >bottleneck buffer (the input queue to the lowest speed link along the >path) should have this much actually buffered. Buffering more than this >increases end-to-end latency beyond its optimal state. Increased >end-to-end latency reduces the effectiveness of control loops, creating >more congestion. >> >> The rationale for having 50 ms. of buffering is probably to avoid >disruption of bursty mixed flows where the bursts might persist for 50 >ms. and then die. One reason for this is that source nodes run >operating systems that tend to release packets in bursts. That's a >whole other discussion - in an ideal world, source nodes would avoid >bursty packet releases by letting the control by the receiver window be >"tight" timing-wise. That is, to transmit a packet immediately at the >instant an ACK arrives increasing the window. This would pace the flow >- current OS's tend (due to scheduling mismatches) to send bursts of >packets, "catching up" on sending that could have been spaced out and >done earlier if the feedback from the receiver's window advancing were >heeded. >> >> >> >> That is, endpoint network stacks (TCP implementations) can worsen >congestion by "dallying". The ideal end-to-end flows occupying a >congested router would have their packets paced so that the packets end >up being sent in the least bursty manner that an application can >support. The effect of this pacing is to move the "backlog" for each >flow quickly into the source node for that flow, which then provides >back pressure on the application driving the flow, which ultimately is >necessary to stanch congestion. The ideal congestion control mechanism >slows the sender part of the application to a pace that can go through >the network without contributing to buffering. >> >> Pacing is in Linux 3.12(?). How long it will take to see widespread >deployment is another question, and as for other operating systems, who >knows. >> See: >[https://lwn.net/Articles/564978/](https://lwn.net/Articles/564978/) >> >> >> Current network stacks (including Linux's) don't achieve that goal - >their pushback on application sources is minimal - instead they >accumulate buffering internal to the network implementation. >> This is much, much less true than it once was. There have been >substantial changes in the Linux TCP stack in the last year or two, to >avoid generating packets before necessary. Again, how long it will >take for people to deploy this on Linux (and implement on other OS's) >is a question. >> >> This contributes to end-to-end latency as well. But if you think >about it, this is almost as bad as switch-level bufferbloat in terms of >degrading user experience. The reason I say "almost" is that there are >tools, rarely used in practice, that allow an application to specify >that buffering should not build up in the network stack (in the kernel >or wherever it is). But the default is not to use those APIs, and to >buffer way too much. >> >> Remember, the network send stack can act similarly to a congested >switch (it is a switch among all the user applications running on that >node). IF there is a heavy file transfer, the file transfer's >buffering acts to increase latency for all other networked >communications on that machine. >> >> Traditionally this problem has been thought of only as a within-node >fairness issue, but in fact it has a big effect on the switches in >between source and destination due to the lack of dispersed pacing of >the packets at the source - in other words, the current design does >nothing to stem the "burst groups" from a single source mentioned >above. >> >> So we do need the source nodes to implement less "bursty" sending >stacks. This is especially true for multiplexed source nodes, such as >web servers implementing thousands of flows. >> >> A combination of codel-style switch-level buffer management and the >stack at the sender being implemented to spread packets in a particular >TCP flow out over time would improve things a lot. To achieve best >throughput, the optimal way to spread packets out on an end-to-end >basis is to update the receive window (sending ACK) at the receive end >as quickly as possible, and to respond to the updated receive window as >quickly as possible when it increases. >> >> Just like the "bufferbloat" issue, the problem is caused by >applications like streaming video, file transfers and big web pages >that the application programmer sees as not having a latency >requirement within the flow, so the application programmer does not >have an incentive to control pacing. Thus the operating system has got >to push back on the applications' flow somehow, so that the flow ends >up paced once it enters the Internet itself. So there's no real >problem caused by large buffering in the network stack at the endpoint, >as long as the stack's delivery to the Internet is paced by some >mechanism, e.g. tight management of receive window control on an >end-to-end basis. >> >> I don't think this can be fixed by cerowrt, so this is out of place >here. It's partially ameliorated by cerowrt, if it aggressively drops >packets from flows that burst without pacing. fq_codel does this, if >the buffer size it aims for is small - but the problem is that the OS >stacks don't respond by pacing... they tend to respond by bursting, not >because TCP doesn't provide the mechanisms for pacing, but because the >OS stack doesn't transmit as soon as it is allowed to - thus building >up a burst unnecessarily. >> >> Bursts on a flow are thus bad in general. They make congestion >happen when it need not. >> By far the biggest headache is what the Web does to the network. It >has turned the web into a burst generator. >> A typical web page may have 10 (or even more images). See the >"connections per page" plot in the link below. >> A browser downloads the base page, and then, over N connections, >essentially simultaneously downloads those embedded objects. Many/most >of them are small in size (4-10 packets). You never even get near slow >start. >> So you get an IW amount of data/TCP connection, with no pacing, and >no congestion avoidance. It is easy to observe 50-100 packets (or >more) back to back at the bottleneck. >> This is (in practice) the amount you have to buffer today: that burst >of packets from a web page. Without flow queuing, you are screwed. >With it, it's annoying, but can be tolerated. >> I go over this is detail in: >> >> >[http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/](http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/) >> So far, I don't believe anyone has tried pacing the IW burst of >packets. I'd certainly like to see that, but pacing needs to be across >TCP connections (host pairs) to be possibly effective to outwit the >gaming the web has done to the network. >> - Jim >> >> >> >> >> >> >> >> >> On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" ><[swmike@swm.pp.se](mailto:swmike@swm.pp.se)> said: >> >> >> >>> On Sun, 25 May 2014, Dane Medic wrote: >>> >>> > Is it true that devices with less than 64 MB can't handle QOS? -> >>> > >[https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html](https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html) >> > >>> At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = >>> 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. >>> >>> I also don't see why performance and memory size would be relevant, >I'd >> > say forwarding performance has more to do with CPU speed than >anything >>> else. >>> >>> -- >>> Mikael Abrahamsson email: >[swmike@swm.pp.se](mailto:swmike@swm.pp.se) >> > _______________________________________________ >>> Cerowrt-devel mailing list >>> >[Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) >>> >[https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) >> > >> _______________________________________________ >> Cerowrt-devel mailing list >> >[Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) >> >[https://lists.bufferbloat.net/listinfo/cerowrt-devel](https://lists.bufferbloat.net/listinfo/cerowrt-devel) > >------------------------------------------------------------------------ > >_______________________________________________ >Cerowrt-devel mailing list >Cerowrt-devel@lists.bufferbloat.net >https://lists.bufferbloat.net/listinfo/cerowrt-devel -- Sent from my Android device with K-@ Mail. Please excuse my brevity. [-- Attachment #2: Type: text/html, Size: 14624 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-29 12:11 ` David P. Reed @ 2014-05-29 15:29 ` dpreed 2014-05-29 19:30 ` David Lang 2014-05-29 23:40 ` Michael Richardson 1 sibling, 1 reply; 30+ messages in thread From: dpreed @ 2014-05-29 15:29 UTC (permalink / raw) To: David P. Reed; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 14006 bytes --] Note: this is all about "how to achieve and sustain the ballistic phase that is optimal for Internet transport" in an end-to-end based control system like TCP. I think those who have followed this know that, but I want to make it clear that I'm proposing a significant improvement that requires changes at the OS stacks and changes in the switches' approach to congestion signaling. There are ways to phase it in gradually. In "meshes", etc. it could probably be developed and deployed more quickly - but my thoughts on co-existence with the current TCP stacks and current IP routers are far less precisely worked out. I am way too busy with my day job to do what needs to be done ... but my sense is that the folks who reduce this to practice will make a HUGE difference to Internet performance. Bigger than getting bloat fixed, and to me that is a major, major potential triumph. On Thursday, May 29, 2014 8:11am, "David P. Reed" <dpreed@reed.com> said: ECN-style signaling has the right properties ... just like TTL it can provide valid and current sampling of the packet ' s environment as it travels. The idea is to sample what is happening at a bottleneck for the packet ' s flow. The bottleneck is the link with the most likelihood of a collision from flows sharing that link. A control - theoretic estimator of recent collision likelihood is easy to do at each queue. All active flows would receive that signal, with the busiest ones getting it most quickly. Also it is reasonable to count all potentially colliding flows at all outbound queues, and report that. The estimator can then provide the signal that each flow responds to. The problem of "defectors" is best dealt with by punishment... An aggressive packet drop policy that makes causing congestion reduce the cause's throughput and increases latency is the best kind of answer. Since the router can remember recent flow behavior, it can penalize recent flows. A Bloom style filter can remember flow statistics for both of these local policies. A great use for the memory no longer misapplied to buffering.... Simple? On May 28, 2014, David Lang <david@lang.hm> wrote: On Wed, 28 May 2014, dpreed@reed.com wrote: I did not mean that "pacing". Sorry I used a generic term. I meant what my longer description described - a specific mechanism for reducing bunching that is essentially "cooperative" among all active flows through a bottlenecked link. That's part of a "closed loop" control system driving each TCP endpoint into a cooperative mode. how do you think we can get feedback from the bottleneck node to all the different senders? what happens to the ones who try to play nice if one doesn't?, including what happens if one isn't just ignorant of the new cooperative mode, but activly tries to cheat? (as I understand it, this is the fatal flaw in many of the past buffering improvement proposals) While the in-h ouserouter is the first bottleneck that user's traffic hits, the bigger problems happen when the bottleneck is in the peering between ISPs, many hops away from any sender, with many different senders competing for the avialable bandwidth. This is where the new buffering approaches win. If the traffic is below the congestion level, they add very close to zero overhead, but when congestion happens, they manage the resulting buffers in a way that's works better for people (allowing short, fast connections to be fast with only a small impact on very long connections) David Lang The thing you call "pacing" is something quite different. It is disconnected from the TCP control loops involved, which basically means it is flying blind. Introducing that kind of "pacing" almost certainly reducesthroughput, because it *delays* packets. The thing I called "pacing" is in no version of Linux that I know of. Give it a different name: "anti-bunching cooperation" or "timing phase management for congestion reduction". Rather than *delaying* packets, it tries to get packets to avoid bunching only when reducing window size, and doing so by tightening the control loop so that the sender transmits as *soon* as it can, not by delaying sending after the sender dallies around not sending when it can. On Tuesday, May 27, 2014 11:23am, "Jim Gettys" <jg@freedesktop.org> said: On Sun, May 25, 2014 at 4:00 PM, <[dpreed@reed.com](mailto:dpreed@reed.com)> wrote: Not that it is directly relevant, but there is no essential reason to require 50 ms. of buffering. That might be true of some particular QOS-related router algorith m. 50ms. is about all one can tolerate in any router between source and destination for today's networks - an upper-bound rather than a minimum. The optimum buffer state for throughput is 1-2 packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the input queue to the lowest speed link along the path) should have this much actually buffered. Buffering more than this increases end-to-end latency beyond its optimal state. Increased end-to-end latency reduces the effectiveness of control loops, creating more congestion. The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes run operating systems that tend to release packets in bursts. That's a whole other discussion - in an ideal world, source nodes would avoid bursty packet releases by letting the control by the receiver windowbe "tight" timing-wise. That is, to transmit a packet immediately at the instant an ACK arrives increasing the window. This would pace the flow - current OS's tend (due to scheduling mismatches) to send bursts of packets, "catching up" on sending that could have been spaced out and done earlier if the feedback from the receiver's window advancing were heeded. That is, endpoint network stacks (TCP implementations) can worsen congestion by "dallying". The ideal end-to-end flows occupying a congested router would have their packets paced so that the packets end up being sent in the least bursty manner that an application can support. The effect of this pacing is to move the "backlog" for each flow quickly into the source node for that flow, which then provides back pressure on the application driving the flow, which ultimately is necessary to stanch congestion. The ideal congestion control mechanism slows the sender part of the application to a pac e thatcan go through the network without contributing to buffering. Pacing is in Linux 3.12(?). How long it will take to see widespread deployment is another question, and as for other operating systems, who knows. See: [[ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/564978 )/]([ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/564978 )/) Current network stacks (including Linux's) don't achieve that goal - their pushback on application sources is minimal - instead they accumulate buffering internal to the network implementation. This is much, much less true than it once was. There have been substantial changes in the Linux TCP stack in the last year or two, to avoid generating packets before necessary. Again, how long it will take for people to deploy this on Linux (and implement on other OS's) is a question. This contributes to end-to-end latency as well. But if you th inkabout it, this is almost as bad as switch-level bufferbloat in terms of degrading user experience. The reason I say "almost" is that there are tools, rarely used in practice, that allow an application to specify that buffering should not build up in the network stack (in the kernel or wherever it is). But the default is not to use those APIs, and to buffer way too much. Remember, the network send stack can act similarly to a congested switch (it is a switch among all the user applications running on that node). IF there is a heavy file transfer, the file transfer's buffering acts to increase latency for all other networked communications on that machine. Traditionally this problem has been thought of only as a within-node fairness issue, but in fact it has a big effect on the switches in between source and destination due to the lack of dispersed pacing of the packets at the source - in other words, the current design does nothing to stem the "burst g roups"from a single source mentioned above. So we do need the source nodes to implement less "bursty" sending stacks. This is especially true for multiplexed source nodes, such as web servers implementing thousands of flows. A combination of codel-style switch-level buffer management and the stack at the sender being implemented to spread packets in a particular TCP flow out over time would improve things a lot. To achieve best throughput, the optimal way to spread packets out on an end-to-end basis is to update the receive window (sending ACK) at the receive end as quickly as possible, and to respond to the updated receive window as quickly as possible when it increases. Just like the "bufferbloat" issue, the problem is caused by applications like streaming video, file transfers and big web pages that the application programmer sees as not having a latency requirement within the flow, so the application programmer does not have an incentive to co ntrolpacing. Thus the operating system has got to push back on the applications' flow somehow, so that the flow ends up paced once it enters the Internet itself. So there's no real problem caused by large buffering in the network stack at the endpoint, as long as the stack's delivery to the Internet is paced by some mechanism, e.g. tight management of receive window control on an end-to-end basis. I don't think this can be fixed by cerowrt, so this is out of place here. It's partially ameliorated by cerowrt, if it aggressively drops packets from flows that burst without pacing. fq_codel does this, if the buffer size it aims for is small - but the problem is that the OS stacks don't respond by pacing... they tend to respond by bursting, not because TCP doesn't provide the mechanisms for pacing, but because the OS stack doesn't transmit as soon as it is allowed to - thus building up a burst unnecessarily. Bursts on a flow are thus bad in general. They makecongestion happen when it need not. By far the biggest headache is what the Web does to the network. It has turned the web into a burst generator. A typical web page may have 10 (or even more images). See the "connections per page" plot in the link below. A browser downloads the base page, and then, over N connections, essentially simultaneously downloads those embedded objects. Many/most of them are small in size (4-10 packets). You never even get near slow start. So you get an IW amount of data/TCP connection, with no pacing, and no congestion avoidance. It is easy to observe 50-100 packets (or more) back to back at the bottleneck. This is (in practice) the amount you have to buffer today: that burst of packets from a web page. Without flow queuing, you are screwed. With it, it's annoying, but can be tolerated. I go over this is detail in: [[ http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/]([ http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/) So far, I don't believe anyone has tried pacing the IW burst of packets. I'd certainly like to see that, but pacing needs to be across TCP connections (host pairs) to be possibly effective to outwit the gaming the web has done to the network. - Jim On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mailto:swmike@swm.pp.se)> said: On Sun, 25 May 2014, Dane Medic wrote: Is it true that devices with less than 64 MB can't handle QOS? -> [[ https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html )]([ https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html )) At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. I also don't see why performance and memory size would be relevant, I'd say forwarding performance has more to do with CPU speed than anything else. -- Mikael Abrahamsson email:[swmike@swm.pp.se](mailto:swmike@swm.pp.se) Cerowrt-devel mailing list [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) [[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )]([ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )) Cerowrt-devel mailing list [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) [[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )]([ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )) Cerowrt-devel mailing list Cerowrt-devel@lists.bufferbloat.net [ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel ) -- Sent from my Android device with [ K-@ Mail ]( https://play.google.com/store/apps/details?id=com.onegravity.k10.pro2 ). Please excuse my brevity. [-- Attachment #2: Type: text/html, Size: 16145 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-29 15:29 ` dpreed @ 2014-05-29 19:30 ` David Lang 0 siblings, 0 replies; 30+ messages in thread From: David Lang @ 2014-05-29 19:30 UTC (permalink / raw) To: David P. Reed; +Cc: cerowrt-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 14602 bytes --] The problem is that without co-existing well with existing stacks (and especially misbehaing stacks), you are not talking about something that will ever be able to be used in real life. unless I am mixing things up, RED and it's varients are a perfect example of this. If everyone on the network is using them, they work pretty well, but when someone isn't (or decides to "cheat"), it becomes very unfair in favor of the non-complying system. David Lang On Thu, 29 May 2014, dpreed@reed.com wrote: > Note: this is all about "how to achieve and sustain the ballistic phase that is optimal for Internet transport" in an end-to-end based control system like TCP. > > I think those who have followed this know that, but I want to make it clear that I'm proposing a significant improvement that requires changes at the OS stacks and changes in the switches' approach to congestion signaling. There are ways to phase it in gradually. In "meshes", etc. it could probably be developed and deployed more quickly - but my thoughts on co-existence with the current TCP stacks and current IP routers are far less precisely worked out. > > I am way too busy with my day job to do what needs to be done ... but my sense is that the folks who reduce this to practice will make a HUGE difference to Internet performance. Bigger than getting bloat fixed, and to me that is a major, major potential triumph. > > > > On Thursday, May 29, 2014 8:11am, "David P. Reed" <dpreed@reed.com> said: > > > ECN-style signaling has the right properties ... just like TTL it can provide valid and current sampling of the packet ' s environment as it travels. The idea is to sample what is happening at a bottleneck for the packet ' s flow. The bottleneck is the link with the most likelihood of a collision from flows sharing that link. > > A control - theoretic estimator of recent collision likelihood is easy to do at each queue. All active flows would receive that signal, with the busiest ones getting it most quickly. Also it is reasonable to count all potentially colliding flows at all outbound queues, and report that. > > The estimator can then provide the signal that each flow responds to. > > The problem of "defectors" is best dealt with by punishment... An aggressive packet drop policy that makes causing congestion reduce the cause's throughput and increases latency is the best kind of answer. Since the router can remember recent flow behavior, it can penalize recent flows. > > A Bloom style filter can remember flow statistics for both of these local policies. A great use for the memory no longer misapplied to buffering.... > > Simple? > > > On May 28, 2014, David Lang <david@lang.hm> wrote: > On Wed, 28 May 2014, dpreed@reed.com wrote: > > I did not mean that "pacing". Sorry I used a generic term. I meant what my > longer description described - a specific mechanism for reducing bunching that > is essentially "cooperative" among all active flows through a bottlenecked > link. That's part of a "closed loop" control system driving each TCP endpoint > into a cooperative mode. > how do you think we can get feedback from the bottleneck node to all the > different senders? > > what happens to the ones who try to play nice if one doesn't?, including what > happens if one isn't just ignorant of the new cooperative mode, but activly > tries to cheat? (as I understand it, this is the fatal flaw in many of the past > buffering improvement proposals) > > While the in-h ouserouter is the first bottleneck that user's traffic hits, the > bigger problems happen when the bottleneck is in the peering between ISPs, many > hops away from any sender, with many different senders competing for the > avialable bandwidth. > > This is where the new buffering approaches win. If the traffic is below the > congestion level, they add very close to zero overhead, but when congestion > happens, they manage the resulting buffers in a way that's works better for > people (allowing short, fast connections to be fast with only a small impact on > very long connections) > > David Lang > > The thing you call "pacing" is something quite different. It is disconnected > from the TCP control loops involved, which basically means it is flying blind. > Introducing that kind of "pacing" almost certainly reducesthroughput, because > it *delays* packets. > > The thing I called "pacing" is in no version of Linux that I know of. Give it > a different name: "anti-bunching cooperation" or "timing phase management for > congestion reduction". Rather than *delaying* packets, it tries to get packets > to avoid bunching only when reducing window size, and doing so by tightening > the control loop so that the sender transmits as *soon* as it can, not by > delaying sending after the sender dallies around not sending when it can. > > > > > > > > On Tuesday, May 27, 2014 11:23am, "Jim Gettys" <jg@freedesktop.org> said: > > > > > > > > On Sun, May 25, 2014 at 4:00 PM, <[dpreed@reed.com](mailto:dpreed@reed.com)> wrote: > > Not that it is directly relevant, but there is no essential reason to require 50 ms. of buffering. That might be true of some particular QOS-related router algorith m. 50ms. is about all one can tolerate in any router between source and destination for today's networks - an upper-bound rather than a minimum. > > The optimum buffer state for throughput is 1-2 packets worth - in other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the input queue to the lowest speed link along the path) should have this much actually buffered. Buffering more than this increases end-to-end latency beyond its optimal state. Increased end-to-end latency reduces the effectiveness of control loops, creating more congestion. > > The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes run operating systems that tend to release packets in bursts. That's a whole other discussion - in an ideal world, source nodes would avoid bursty packet releases by letting the control by the receiver windowbe "tight" timing-wise. That is, to transmit a packet immediately at the instant an ACK arrives increasing the window. This would pace the flow - current OS's tend (due to scheduling mismatches) to send bursts of packets, "catching up" on sending that could have been spaced out and done earlier if the feedback from the receiver's window advancing were heeded. > > > > That is, endpoint network stacks (TCP implementations) can worsen congestion by "dallying". The ideal end-to-end flows occupying a congested router would have their packets paced so that the packets end up being sent in the least bursty manner that an application can support. The effect of this pacing is to move the "backlog" for each flow quickly into the source node for that flow, which then provides back pressure on the application driving the flow, which ultimately is necessary to stanch congestion. The ideal congestion control mechanism slows the sender part of the application to a pac e thatcan go through the network without contributing to buffering. > > Pacing is in Linux 3.12(?). How long it will take to see widespread deployment is another question, and as for other operating systems, who knows. > See: [[ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/564978 )/]([ https://lwn.net/Articles/564978 ]( https://lwn.net/Articles/564978 )/) > > > Current network stacks (including Linux's) don't achieve that goal - their pushback on application sources is minimal - instead they accumulate buffering internal to the network implementation. > This is much, much less true than it once was. There have been substantial changes in the Linux TCP stack in the last year or two, to avoid generating packets before necessary. Again, how long it will take for people to deploy this on Linux (and implement on other OS's) is a question. > > This contributes to end-to-end latency as well. But if you th inkabout it, this is almost as bad as switch-level bufferbloat in terms of degrading user experience. The reason I say "almost" is that there are tools, rarely used in practice, that allow an application to specify that buffering should not build up in the network stack (in the kernel or wherever it is). But the default is not to use those APIs, and to buffer way too much. > > Remember, the network send stack can act similarly to a congested switch (it is a switch among all the user applications running on that node). IF there is a heavy file transfer, the file transfer's buffering acts to increase latency for all other networked communications on that machine. > > Traditionally this problem has been thought of only as a within-node fairness issue, but in fact it has a big effect on the switches in between source and destination due to the lack of dispersed pacing of the packets at the source - in other words, the current design does nothing to stem the "burst g roups"from a single source mentioned above. > > So we do need the source nodes to implement less "bursty" sending stacks. This is especially true for multiplexed source nodes, such as web servers implementing thousands of flows. > > A combination of codel-style switch-level buffer management and the stack at the sender being implemented to spread packets in a particular TCP flow out over time would improve things a lot. To achieve best throughput, the optimal way to spread packets out on an end-to-end basis is to update the receive window (sending ACK) at the receive end as quickly as possible, and to respond to the updated receive window as quickly as possible when it increases. > > Just like the "bufferbloat" issue, the problem is caused by applications like streaming video, file transfers and big web pages that the application programmer sees as not having a latency requirement within the flow, so the application programmer does not have an incentive to co ntrolpacing. Thus the operating system has got to push back on the applications' flow somehow, so that the flow ends up paced once it enters the Internet itself. So there's no real problem caused by large buffering in the network stack at the endpoint, as long as the stack's delivery to the Internet is paced by some mechanism, e.g. tight management of receive window control on an end-to-end basis. > > I don't think this can be fixed by cerowrt, so this is out of place here. It's partially ameliorated by cerowrt, if it aggressively drops packets from flows that burst without pacing. fq_codel does this, if the buffer size it aims for is small - but the problem is that the OS stacks don't respond by pacing... they tend to respond by bursting, not because TCP doesn't provide the mechanisms for pacing, but because the OS stack doesn't transmit as soon as it is allowed to - thus building up a burst unnecessarily. > > Bursts on a flow are thus bad in general. They makecongestion happen when it need not. > By far the biggest headache is what the Web does to the network. It has turned the web into a burst generator. > A typical web page may have 10 (or even more images). See the "connections per page" plot in the link below. > A browser downloads the base page, and then, over N connections, essentially simultaneously downloads those embedded objects. Many/most of them are small in size (4-10 packets). You never even get near slow start. > So you get an IW amount of data/TCP connection, with no pacing, and no congestion avoidance. It is easy to observe 50-100 packets (or more) back to back at the bottleneck. > This is (in practice) the amount you have to buffer today: that burst of packets from a web page. Without flow queuing, you are screwed. With it, it's annoying, but can be tolerated. > I go over this is detail in: > > [[ http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/]([ http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough ]( http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough )/) > So far, I don't believe anyone has tried pacing the IW burst of packets. I'd certainly like to see that, but pacing needs to be across TCP connections (host pairs) to be possibly effective to outwit the gaming the web has done to the network. > - Jim > > > > > > > > > On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson" <[swmike@swm.pp.se](mailto:swmike@swm.pp.se)> said: > > > > On Sun, 25 May 2014, Dane Medic wrote: > > Is it true that devices with less than 64 MB can't handle QOS? -> > [[ https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html )]([ https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html ]( https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html )) > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer. > > I also don't see why performance and memory size would be relevant, I'd > say forwarding performance has more to do with CPU speed than anything > else. > > -- > Mikael Abrahamsson email:[swmike@swm.pp.se](mailto:swmike@swm.pp.se) > > Cerowrt-devel mailing list > [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) > [[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )]([ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )) > > Cerowrt-devel mailing list > [Cerowrt-devel@lists.bufferbloat.net](mailto:Cerowrt-devel@lists.bufferbloat.net) > [[ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )]([ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )) > > > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > [ https://lists.bufferbloat.net/listinfo/cerowrt-devel ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel ) > > -- Sent from my Android device with [ K-@ Mail ]( https://play.google.com/store/apps/details?id=com.onegravity.k10.pro2 ). Please excuse my brevity. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-29 12:11 ` David P. Reed 2014-05-29 15:29 ` dpreed @ 2014-05-29 23:40 ` Michael Richardson 2014-05-30 0:32 ` David P. Reed 2014-05-30 0:36 ` Dave Taht 1 sibling, 2 replies; 30+ messages in thread From: Michael Richardson @ 2014-05-29 23:40 UTC (permalink / raw) To: David P. Reed; +Cc: cerowrt-devel David P. Reed <dpreed@reed.com> wrote: > ECN-style signaling has the right properties ... just like TTL it can > provide How would you send these signals? > A Bloom style filter can remember flow statistics for both of these local > policies. A great use for the memory no longer misapplied to > buffering.... Well. On the higher speed dataflow equipment, the buffer is general purpose memory, so reuse like this is particularly possible. On routers built around general purpose architectures, the limiting factor in performance is often memory throughput; adding memory rarely increases total throughput. Packet I/O is generally quiet sequential and so makes good use of wide memory data paths and multiple accesses per address cycle. Updating of tables such as Bloom filter or any other hash has a big impact due to the RMW and random access nature. All I'm saying is that quantity of memory is seldom the problem, but access to it, is. I do like the entire idea; it seems that it has to be implemented at the places where the flow converge, which is often in the DSL line card, or CTMS... -- ] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | network architect [ ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-29 23:40 ` Michael Richardson @ 2014-05-30 0:32 ` David P. Reed 2014-05-30 0:36 ` Dave Taht 1 sibling, 0 replies; 30+ messages in thread From: David P. Reed @ 2014-05-30 0:32 UTC (permalink / raw) To: Michael Richardson; +Cc: cerowrt-devel [-- Attachment #1: Type: text/plain, Size: 1551 bytes --] Good points... On May 29, 2014, Michael Richardson <mcr@sandelman.ca> wrote: > >David P. Reed <dpreed@reed.com> wrote: >> ECN-style signaling has the right properties ... just like TTL it can > > provide > >How would you send these signals? > >> A Bloom style filter can remember flow statistics for both of these >local > > policies. A great use for the memory no longer misapplied to > > buffering.... > >Well. > >On the higher speed dataflow equipment, the buffer is general purpose >memory, >so reuse like this is particularly possible. > >On routers built around general purpose architectures, the limiting >factor >in performance is often memory throughput; adding memory rarely >increases >total throughput. Packet I/O is generally quiet sequential and so >makes >good use of wide memory data paths and multiple accesses per address >cycle. >Updating of tables such as Bloom filter or any other hash has a big >impact >due to the RMW and random access nature. > >All I'm saying is that quantity of memory is seldom the problem, but >access >to it, is. > >I do like the entire idea; it seems that it has to be implemented at >the >places where the flow converge, which is often in the DSL line card, or >CTMS... > >-- >] Never tell me the odds! | ipv6 mesh >networks [ >] Michael Richardson, Sandelman Software Works | network >architect [ >] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on >rails [ -- Sent from my Android device with K-@ Mail. Please excuse my brevity. [-- Attachment #2: Type: text/html, Size: 2303 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-29 23:40 ` Michael Richardson 2014-05-30 0:32 ` David P. Reed @ 2014-05-30 0:36 ` Dave Taht 1 sibling, 0 replies; 30+ messages in thread From: Dave Taht @ 2014-05-30 0:36 UTC (permalink / raw) To: Michael Richardson; +Cc: cerowrt-devel On Thu, May 29, 2014 at 4:40 PM, Michael Richardson <mcr@sandelman.ca> wrote: > > David P. Reed <dpreed@reed.com> wrote: > > ECN-style signaling has the right properties ... just like TTL it can > > provide > > How would you send these signals? > > > A Bloom style filter can remember flow statistics for both of these local > > policies. A great use for the memory no longer misapplied to > > buffering.... > > Well. > > On the higher speed dataflow equipment, the buffer is general purpose memory, > so reuse like this is particularly possible. > > On routers built around general purpose architectures, the limiting factor > in performance is often memory throughput; adding memory rarely increases > total throughput. Packet I/O is generally quiet sequential and so makes > good use of wide memory data paths and multiple accesses per address cycle. > Updating of tables such as Bloom filter or any other hash has a big impact > due to the RMW and random access nature. In hardware using a parallel memory layout makes sense. I had always envisioned the per flow fq_codel table to be on a lookaside cache, much like how mac and route lookups happen today in hw. In a general purpose architecture with fat amounts of cache (like ivy bridge) you can set aside some main cache if you like. It needent be big (64k for 1024 flows but you can shrink the structure some if you want) - and it needent be fast, just fast enough to be accessed on a per packet basis. There are other ways to do it of course. you could set it up as 1024 8 32 bit register register banks, for example, in the asic or fpga, and eliminate the concept of using ram for it entirely. This is not a lot of gates (quite a lot when you consider the invsqrt dependency in codel alone is 3k gates or so - or "free" in a FPGA with dsp multipliers) I've never thought that pure "drop head" was possible in high speed hardware - the various operations need to be pipelined, the timestamp needs to go at the head of the packet for codel to operate on them, etc, etc... > All I'm saying is that quantity of memory is seldom the problem, but access > to it, is. Concur. I keep hoping my parallela arrives. You can write your own ethernet device with that... > I do like the entire idea; it seems that it has to be implemented at the > places where the flow converge, which is often in the DSL line card, or > CTMS... The elephant in the room on those devices is the per user rate shaper. In software this accounts for 95% of the cpu time and scheduling headaches, and that's without dealing with ipv6 pools. > -- > ] Never tell me the odds! | ipv6 mesh networks [ > ] Michael Richardson, Sandelman Software Works | network architect [ > ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [ > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 6:17 [Cerowrt-devel] Ubiquiti QOS Dane Medic 2014-05-25 14:23 ` Valdis.Kletnieks 2014-05-25 15:42 ` Mikael Abrahamsson @ 2014-05-25 18:39 ` Sebastian Moeller 2014-05-25 19:33 ` Dave Taht 2 siblings, 1 reply; 30+ messages in thread From: Sebastian Moeller @ 2014-05-25 18:39 UTC (permalink / raw) To: Dane Medic; +Cc: cerowrt-devel Hi Dane, On May 25, 2014, at 08:17 , Dane Medic <dm70dm@gmail.com> wrote: > Is it true that devices with less than 64 MB can't handle QOS? -> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html I think this means that the commotion developers think that 64MB are required. But it does not sound like they have first hand experience so this is either hearsay, or commotions mesh networking is memory intensive. On the openwrt side there seems no documentation of minim ram requirements. Doing a quick back-of-the-envelop calculation here: openWRT qos has 4 tiers which run fq_codel in both directions so we have 8 fq_codel instances, with each fq_codel having a limit of 10240 packets, so worst case we expect: 4 * 2 * 10240 = 81920 packets at 1500bytes this equals 4 * 2 * 10240 * 1500 / (1024 * 1024) = 117.1875 MB this indeed is a bit heavy on a 32MB router, but honestly 64MB will not really help you. Then again current openwrt has a limit off 800 instead of 10240 so we end up at a worst case of: 4 * 2 * 800 * 1500 / (1024 * 1024) = 9.1552734375 MB which should still be possible with 32MB. (Note that typically fq_codel does not fill its queues up to limit, but it still would be bad if a router can easily be DOSed into OOM and rebooting…) (For current cerowrt with simple.qos the worst case is: (1001 * 4 + 1000 * 13 + 800 * 12) * 1500 / (1024 * 1024) = 38.0573272705 MB yet this still works quite well on a 64MB device (only 4 of these queues are connected to the WAN interface though) One of the bigger issues with devices with small RAM is that often they have relatively weak CPUs and I seem to recall that cerowrt tops out around 60 to 70 Mbit/sec (total for ingress and egress) due to its shaping performance. So unless you want to run commotion you might want to ask on the openwrt list… Best Regards Sebastan > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Cerowrt-devel] Ubiquiti QOS 2014-05-25 18:39 ` Sebastian Moeller @ 2014-05-25 19:33 ` Dave Taht 0 siblings, 0 replies; 30+ messages in thread From: Dave Taht @ 2014-05-25 19:33 UTC (permalink / raw) To: Sebastian Moeller; +Cc: cerowrt-devel On Sun, May 25, 2014 at 11:39 AM, Sebastian Moeller <moeller0@gmx.de> wrote: > Hi Dane, > > > On May 25, 2014, at 08:17 , Dane Medic <dm70dm@gmail.com> wrote: > >> Is it true that devices with less than 64 MB can't handle QOS? -> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html > > I think this means that the commotion developers think that 64MB are required. A dev thinks so. It aint true. (I have a deployed mesh network of mostly nano and picostations and they never crash due to being out of memory. Sadly, they do crash for other reasons) As you get below 64MB some compromises are needed, and as you try to push more bits compromises are needed, and as you add more interfaces and queues compromises are needed. If you are using up all your memory running some application or another, instead of having it free for packets, you are generally in more trouble than you want to be. So the starting factor is how much free ram you have in normal operation with your applications running. Start with that as a baseline. (it isn't strictly true either, you can discard (not swap out) a great deal of program text and still have your applications run well) You typically want something like sqm on your gateway ethernet interface. If you have seriously limited free ram , just run a single fq_codel instance as in simplest.qos. As for the mesh backbone, well, there remains so much buffering underneath in the ath9k wifi drivers that fq_codel only takes the edge off a little. We just disabled 802.11e completely (getting rid of 3 out of 4 queues per interface), and my results have always been better for that, and I hope to catagorize them this summer - and it also saves on memory usage. > But it does not sound like they have first hand experience so this is either hearsay, or commotions mesh networking is memory intensive. On the openwrt side there seems no documentation of minim ram requirements. Doing a quick back-of-the-envelop calculation here: > openWRT qos has 4 tiers which run fq_codel in both directions so we have 8 fq_codel instances, with each fq_codel having a limit of 10240 packets, so worst case we expect: > > 4 * 2 * 10240 = 81920 packets > > at 1500bytes this equals > > 4 * 2 * 10240 * 1500 / (1024 * 1024) = 117.1875 MB Try to watch out for this sort of equivalence. Acks in the other direction are 66 bytes. Arguably we should have specified fq_codel's outside limit in bytes, not packets, and made it autotune to some ratio around free memory. And NONE of this memory is pre-allocated... more on that in a paragraph > this indeed is a bit heavy on a 32MB router, but honestly 64MB will not really help you. Then again current openwrt has a limit off 800 instead of 10240 so we end up at a worst case of: It's the extra SSIDs that hurt most these days, followed by the queues "needed" for 802.11e, but wait a paragraph... > 4 * 2 * 800 * 1500 / (1024 * 1024) = 9.1552734375 MB > > which should still be possible with 32MB. (Note that typically fq_codel does not fill its queues up to limit, but it still would be bad if a router can easily be DOSed into OOM and rebooting…) The principal reason for the limit is! to avoid a DOS in memory limited routers. Otherwise I'd be perfectly happy if we could run it at the defaults. And the limit should go up some as we get closer to pushing gigE speeds (presently the router can only forward at about 330mbit). I am really hating seeing people cut/paste the limit into newer code without realizing that it was just there to keep a 32MB box from crashing under a dos... (and I note, that if you run out of memory to service packets, the odds are very good your box won't crash anyway - but this exercises code paths that are rarely touched.) > > (For current cerowrt with simple.qos the worst case is: > (1001 * 4 + 1000 * 13 + 800 * 12) * 1500 / (1024 * 1024) = 38.0573272705 MB > > yet this still works quite well on a 64MB device (only 4 of these queues are connected to the WAN interface though) 3 up, 3 down, actually in simple.qos, 1 up, 1 down in simple. And each wifi SSID consumes 4 queues (although right now, 3 are unused) I want to make really clear: the FIXED overhead of fq_codel is something like 100+ 64 bytes*flows (usually 1024) - so each fq_codel instance eats 64K (not M!) of data to run. The limit just keeps packet data (don't quote me, something like 200 bytes + packet size) under control, and usually only builds up on the bottleneck device, not anywhere else. So you might have a bottleneck on your wifi, someone dos-ing you maybe, but the ethernet interface will be running at only a few packets outstanding at anytime. > One of the bigger issues with devices with small RAM is that often they have relatively weak CPUs and I seem to recall that cerowrt tops out around 60 to 70 Mbit/sec (total for ingress and egress) due to its shaping performance. Yes. Doing considerably > > > So unless you want to run commotion you might want to ask on the openwrt list… > > Best Regards > Sebastan > > > >> _______________________________________________ >> Cerowrt-devel mailing list >> Cerowrt-devel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/cerowrt-devel > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2014-05-30 0:36 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-25 6:17 [Cerowrt-devel] Ubiquiti QOS Dane Medic 2014-05-25 14:23 ` Valdis.Kletnieks 2014-05-25 15:42 ` Mikael Abrahamsson 2014-05-25 20:00 ` dpreed 2014-05-26 0:18 ` Mikael Abrahamsson 2014-05-26 4:49 ` dpreed 2014-05-26 13:02 ` Mikael Abrahamsson 2014-05-26 14:01 ` dpreed 2014-05-26 14:11 ` Mikael Abrahamsson 2014-05-26 15:31 ` David P. Reed 2014-05-27 21:19 ` David Lang 2014-05-27 22:00 ` Dave Taht 2014-05-27 23:27 ` David Lang 2014-05-28 2:12 ` Dave Taht 2014-05-28 3:21 ` David Lang 2014-05-28 15:52 ` dpreed 2014-05-28 16:34 ` David Lang 2014-05-27 15:23 ` Jim Gettys 2014-05-27 17:31 ` Dave Taht 2014-05-28 15:33 ` dpreed 2014-05-28 15:20 ` dpreed 2014-05-28 18:33 ` David Lang 2014-05-29 12:11 ` David P. Reed 2014-05-29 15:29 ` dpreed 2014-05-29 19:30 ` David Lang 2014-05-29 23:40 ` Michael Richardson 2014-05-30 0:32 ` David P. Reed 2014-05-30 0:36 ` Dave Taht 2014-05-25 18:39 ` Sebastian Moeller 2014-05-25 19:33 ` Dave Taht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox