* [Make-wifi-fast] Diagram of the ath9k TX path @ 2016-05-09 11:00 Toke Høiland-Jørgensen 2016-05-09 15:35 ` Dave Taht 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2016-05-09 11:00 UTC (permalink / raw) To: make-wifi-fast I finally finished my flow diagram of the ath9k TX path (corresponding to the previous one I did for the mac80211 stack). In case anyone else is interested, it's available here: https://blog.tohojo.dk/2016/05/the-ath9k-tx-path.html -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-09 11:00 [Make-wifi-fast] Diagram of the ath9k TX path Toke Høiland-Jørgensen @ 2016-05-09 15:35 ` Dave Taht 2016-05-10 2:25 ` Jonathan Morton 0 siblings, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-09 15:35 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: make-wifi-fast, ath9k-devel On Mon, May 9, 2016 at 4:00 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote: > I finally finished my flow diagram of the ath9k TX path (corresponding > to the previous one I did for the mac80211 stack). In case anyone else > is interested, it's available here: > > https://blog.tohojo.dk/2016/05/the-ath9k-tx-path.html Looks quite helpful. I do not understand why there is a "fast path" at all in this driver, should we always wait a little bit to see if we can form an aggregate? It would be awesome to be able to adapt michal's work on fq_codeling things as he did here (leveraging rate control information) http://blog.cerowrt.org/post/fq_codel_on_ath10k/ rather than dql as he did here: http://blog.cerowrt.org/post/dql_on_wifi_2/ to the ath9k. > -Toke > _______________________________________________ > Make-wifi-fast mailing list > Make-wifi-fast@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/make-wifi-fast -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-09 15:35 ` Dave Taht @ 2016-05-10 2:25 ` Jonathan Morton 2016-05-10 2:59 ` Dave Taht 0 siblings, 1 reply; 25+ messages in thread From: Jonathan Morton @ 2016-05-10 2:25 UTC (permalink / raw) To: Dave Taht; +Cc: Toke Høiland-Jørgensen, make-wifi-fast, ath9k-devel > On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: > > should we always wait a little bit to see if we can form an aggregate? I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity. If we *don’t* have an immediate transmit opportunity, then we *must* wait regardless, and maybe some other packets will arrive which can then be aggregated. - Jonathan Morton ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 2:25 ` Jonathan Morton @ 2016-05-10 2:59 ` Dave Taht 2016-05-10 3:30 ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd 2016-05-10 3:41 ` [Make-wifi-fast] " David Lang 0 siblings, 2 replies; 25+ messages in thread From: Dave Taht @ 2016-05-10 2:59 UTC (permalink / raw) To: Jonathan Morton Cc: Toke Høiland-Jørgensen, make-wifi-fast, ath9k-devel On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> wrote: > >> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >> >> should we always wait a little bit to see if we can form an aggregate? > > I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity. I think it is more nuanced than how david lang has presented it. We haven't argued the finer points just yet - merely seeing 12-20ms latency across the entire 6-300mbit range I've tested thus has been a joy, and I'd like to at least think about ways to cut another order of magnitude off of that while making better use of packing the medium. http://blog.cerowrt.org/post/anomolies_thus_far/ So... I don't think we "achieved consensus", I just faded... I thought at the time that merely getting down from 2+seconds to 20ms induced latency was vastly more important :), and I didn't want to belabor the point until we had some solid results. I'll still settle for "1 agg in the hardware, 1 agg in the driver"... but smaller, and better formed, aggs under contention - which might sometimes involve a pause for a hundred usec to gather up more, when empty, or more, when the driver is known to be busy. ... Over the weekend I did some experiments setting the beacon advertised txop size for best effort traffic to 94 (same size as the vi queue that was so busted in earlier tests ( http://blog.cerowrt.org/post/cs5_lockout/ ) ) to try to see if the station (or AP) paid attention to it... it was remarkable the bandwidth symmetry I got compared to the defaults. This chart also shows the size of the win against the stock ath10k firmware and driver in terms of latency, and not having flows collapse... http://blog.cerowrt.org/flent/txop_94/rtt_fairbe_compared.svg Now, given then most people use wifi asymmetrically, perhaps there are fewer use cases for when the AP and station work more symmetrically, but this was a pretty result. http://blog.cerowrt.org/flent/dual-txop-94/up_down_vastly_better.svg Haven't finished writing up the result, aside from tweaking this parameter had no seeming affect on the baseline 10-15ms driver latency left in it, under load. > > If we *don’t* have an immediate transmit opportunity, then we *must* wait regardless, and maybe some other packets will arrive which can then be aggregated. > > - Jonathan Morton > -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path 2016-05-10 2:59 ` Dave Taht @ 2016-05-10 3:30 ` Adrian Chadd 2016-05-10 4:04 ` Dave Taht 2016-05-10 3:41 ` [Make-wifi-fast] " David Lang 1 sibling, 1 reply; 25+ messages in thread From: Adrian Chadd @ 2016-05-10 3:30 UTC (permalink / raw) To: Dave Taht Cc: Jonathan Morton, make-wifi-fast, ath9k-devel, Toke Høiland-Jørgensen Hi, So: * the hardware can give us a per-AC transmit opportunity; * software queuing needs to handle the per-STA transmit opportunity; * they (and I followed convention after testing) found the "best" compromise was to hardware queue up to two frames, which we could probably do slightly more of at higher MCS rates for "reasons", but if we're getting enough packets come in then if the hardware queues get drained slower than we can fill them, we naturally aggregate traffic. So it actually works pretty well in practice. The general aim is to keep up to ~8ms of aggregates queued, and that's typically two aggregate frames so we don't bust the block-ack window. -adrian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path 2016-05-10 3:30 ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd @ 2016-05-10 4:04 ` Dave Taht 2016-05-10 4:22 ` Aaron Wood 2016-05-10 7:15 ` Adrian Chadd 0 siblings, 2 replies; 25+ messages in thread From: Dave Taht @ 2016-05-10 4:04 UTC (permalink / raw) To: Adrian Chadd Cc: Jonathan Morton, make-wifi-fast, ath9k-devel, Toke Høiland-Jørgensen On Mon, May 9, 2016 at 8:30 PM, Adrian Chadd <adrian@freebsd.org> wrote: > Hi, > > So: > > * the hardware can give us a per-AC transmit opportunity; > * software queuing needs to handle the per-STA transmit opportunity; > * they (and I followed convention after testing) "They" had probably not made proper sacrifices to the layer 3 congestion control gods, nor envisioned a world with 10s of stations on a given ap and dozens of competing APs.... > found the "best" > compromise was to hardware queue up to two frames, which we could > probably do slightly more of at higher MCS rates for "reasons", but if > we're getting enough packets come in then if the hardware queues get > drained slower than we can fill them, we naturally aggregate traffic. > > So it actually works pretty well in practice. Aside from seconds of queuing on top. :) Is everybody here on board with reducing that by 2 orders of magnitude? I'm not posting all these results and all the flent data just to amuse myself... The size of the potential patch set for softmac devices has declined considerably - codel.h and the fq code are now generalized in some tree or another, and what's left is in two competing patches under test... one that leverages rate control stats and wins like crazy, the other, dql, and takes longer to win like crazy. http://blog.cerowrt.org/post/ has the writeups https://github.com/dtaht/blog-cerowrt has all the data and the writeups still in draft form, in git, for your own bemusement and data comparisons with the stock drivers. > The general aim is to > keep up to ~8ms of aggregates queued, and that's typically two > aggregate frames so we don't bust the block-ack window. My understanding was that hardware retry exists for the ath9k at least, and that block-acks responded in under 10us. Also that ath9k allowed you to describe and send up to 4 chains at different rates. Yes? As for "busting the window", what I wanted to try was adding in QoS Noack on frames that you feel could be safely dropped, so the first part of the txop would have an AMPDU of stuff you felt strongly about keeping, the second, one not block acknowledged, and a third consisting of the last bits of the flows you didn't care about as much, but want to provide packets for to inform the other side that drops happened, block acked.... As for software retry, we could be smarter about it than we currently are. A fixed number of retries (15 in the ath10k driver, 10 in the ath9k driver) is just nuts... As for 8ms, well, I'd much rather hand out 1ms each to 8 stations than 8ms each to 8 stations. > > > -adrian -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path 2016-05-10 4:04 ` Dave Taht @ 2016-05-10 4:22 ` Aaron Wood 2016-05-10 7:15 ` Adrian Chadd 1 sibling, 0 replies; 25+ messages in thread From: Aaron Wood @ 2016-05-10 4:22 UTC (permalink / raw) To: Dave Taht; +Cc: Adrian Chadd, ath9k-devel, make-wifi-fast [-- Attachment #1: Type: text/plain, Size: 1047 bytes --] On Mon, May 9, 2016 at 9:04 PM, Dave Taht <dave.taht@gmail.com> wrote: > > Is everybody here on board with reducing that by 2 orders of > magnitude? Yes! > I'm not posting all these results and all the flent data > just to amuse myself... The size of the potential patch set for > softmac devices has declined considerably - codel.h and the fq code > are now generalized in some tree or another, and what's left is in two > competing patches under test... one that leverages rate control stats > and wins like crazy, the other, dql, and takes longer to win like > crazy. > I'm really hoping to be able to use this. Wifi is the last buffer-bloated part of my home network (the latency differences between wired and wireless are an order of magnitude apart). And as for multiple STAs per AP: That's the norm, these days. My house might be a bit extreme, but we have 8 STAs in 5GHz, and one lowly printer in 2.4GHz (which is too far for 5GHz to get to, and I haven't run a cable to that end of the house yet as I hate crawlspaces). -Aaron [-- Attachment #2: Type: text/html, Size: 1609 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path 2016-05-10 4:04 ` Dave Taht 2016-05-10 4:22 ` Aaron Wood @ 2016-05-10 7:15 ` Adrian Chadd 2016-05-10 7:17 ` Adrian Chadd 1 sibling, 1 reply; 25+ messages in thread From: Adrian Chadd @ 2016-05-10 7:15 UTC (permalink / raw) To: Dave Taht Cc: Jonathan Morton, make-wifi-fast, ath9k-devel, Toke Høiland-Jørgensen Well, there shouldn't /also/ be a software queue behind each TXQ at that point. Eg, in FreeBSD, I queue up to 64 frames per station and then default to round robining between stations when it's time to form another aggregate. It's done that way so i or someone else can implement a wifi queue discipline in between the per-station / per-TID queues and the hardware queue that knew about time of flight, etc. The variations on the internal driver tended to slide some more complicated queue management and rate control between the bit that dequeued from the per-TID/per-STA packet queue and formed aggregates. Ie, the aggregate was only formed at hardware queue time, and only two were pushed into the hardware at once. There were only deep hardware queues in very old, pre-11n drivers, to minimise CPU overhead. So yeah, do reduce that a bit. The hardware queue should be two frames, there shouldn't be anything needed to be queued up behind it in the software queue backing it if you're aggregating, and if you /are/, that queue should be backed based on flight time (eg lots of little frames, or one big aggregate, but not lots of big aggregates.) Yeah, I also have to add NOACK support for A-MPDU in the freebsd driver for various reasons (voice, yes, but also multicast A-MPDU.) You just need to ensure that you slide along the BAW in a way that flushes the sender, or you may drop some frames but you're in the BAW, and then the receiver buffers them for up to their timeout value (typically tens of milliseconds) waiting for more frames in the BAW (hopefully retries). I dunno if you're really allowed to be sending NOACK data frames if you've negotiated immediate blockack though! And yeah for time? totally depends on what you're doing. Yes, if you have lots of stations actively doing traffic, then yes you should just form smaller aggregates. It's especially good for dealing with the full frame retries (ie, RTS/CTS worked, but the data frame didn't, and you didn't get a block-ack at all.) On longer aggregates, that can /really/ hurt - ie, you're likely better off doing one full retry only and then failing it so you can requeue it in software and break it up into a smaller set of aggregates after the rate control code gets the update. (God, I should really do this all to freebsd now that I'm kinda allowed to again..) -adrian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path 2016-05-10 7:15 ` Adrian Chadd @ 2016-05-10 7:17 ` Adrian Chadd 0 siblings, 0 replies; 25+ messages in thread From: Adrian Chadd @ 2016-05-10 7:17 UTC (permalink / raw) To: Dave Taht Cc: Jonathan Morton, make-wifi-fast, ath9k-devel, Toke Høiland-Jørgensen The other hack that I've seen ath9k do is they actually assign sequence numbers and CCMP IVs at the point of enqueue-to-hardware, rather than enqueue-to-driver. I tried this in FreeBSD and dropped it for other reasons, mostly in favour of shallower (configurable!) per-station queue depths. That way you /could/ drop data frames in the driver/wifi stack per-STA/TID queues as part of queue discipline, way before it gets sent out on the air. If you're modeling airtime on the fly then you could see the queue for a given station is way too deep now that the rate to it has dropped, so don't bother keeping those frames around. Once you have assigned the seqno/IV then you're "committed" so to speak. -adrian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 2:59 ` Dave Taht 2016-05-10 3:30 ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd @ 2016-05-10 3:41 ` David Lang 2016-05-10 4:59 ` Dave Taht 2016-05-13 17:46 ` Bob McMahon 1 sibling, 2 replies; 25+ messages in thread From: David Lang @ 2016-05-10 3:41 UTC (permalink / raw) To: Dave Taht; +Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org [-- Attachment #1: Type: TEXT/PLAIN, Size: 2776 bytes --] On Mon, 9 May 2016, Dave Taht wrote: > On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> wrote: >> >>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>> >>> should we always wait a little bit to see if we can form an aggregate? >> >> I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity. > > I think it is more nuanced than how david lang has presented it. I have four reasons for arguing for no speculative delays. 1. airtime that isn't used can't be saved. 2. lower best-case latency 3. simpler code 4. clean, and gradual service degredation under load. the arguments against are: 5. throughput per ms of transmit time is better if aggregation happens than if it doesn't. 6. if you don't transmit, some other station may choose to before you would have finished. #2 is obvious, but with the caviot that anytime you transmit you may be delaying someone else. #1 and #6 are flip sides of each other. we want _someone_ to use the airtime, the question is who. #3 and #4 are closely related. If you follow my approach (transmit immediately if you can, aggregate when you have a queue), the code really has one mode (plus queuing). "If you have a Transmit Oppertunity, transmit up to X packets from the queue", and it doesn't matter if it's only one packet. If you delay the first packet to give you a chance to aggregate it with others, you add in the complexity and overhead of timers (including cancelling timers, slippage in timers, etc) and you add "first packet, start timers" mode to deal with. I grant you that the first approach will "saturate" the airtime at lower traffic levels, but at that point all the stations will start aggregating the minimum amount needed to keep the air saturated, while still minimizing latency. I then expect that application related optimizations would then further complicate the second approach. there are just too many cases where small amounts of data have to be sent and other things serialize behind them. DNS lookup to find a domain to then to a 3-way handshake to then do a request to see if the <web something> library has been updated since last cached (repeat for several libraries) to then fetch the actual page content. All of these thing up to the actual page content could be single packets that have to be sent (and responded to with a single packet), waiting for the prior one to complete. If you add a few ms to each of these, you can easily hit 100ms in added latency. Once you start to try and special cases these sorts of things, the code complexity multiplies. So I believe that the KISS approach ends up with a 'worse is better' situation. David Lang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 3:41 ` [Make-wifi-fast] " David Lang @ 2016-05-10 4:59 ` Dave Taht 2016-05-10 5:22 ` David Lang 2016-05-13 17:46 ` Bob McMahon 1 sibling, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-10 4:59 UTC (permalink / raw) To: David Lang Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org, Randell Jesup This is a very good overview, thank you. I'd like to take apart station behavior on wifi with a web application... as a straw man. On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: > On Mon, 9 May 2016, Dave Taht wrote: > >> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >> wrote: >>> >>> >>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>> >>>> should we always wait a little bit to see if we can form an aggregate? >>> >>> >>> I thought the consensus on this front was “no”, as long as we’re making >>> the decision when we have an immediate transmit opportunity. >> >> >> I think it is more nuanced than how david lang has presented it. > > > I have four reasons for arguing for no speculative delays. > > 1. airtime that isn't used can't be saved. > > 2. lower best-case latency > > 3. simpler code > > 4. clean, and gradual service degredation under load. > > the arguments against are: > > 5. throughput per ms of transmit time is better if aggregation happens than > if it doesn't. > > 6. if you don't transmit, some other station may choose to before you would > have finished. > > #2 is obvious, but with the caviot that anytime you transmit you may be > delaying someone else. > > #1 and #6 are flip sides of each other. we want _someone_ to use the > airtime, the question is who. > > #3 and #4 are closely related. > > If you follow my approach (transmit immediately if you can, aggregate when > you have a queue), the code really has one mode (plus queuing). "If you have > a Transmit Oppertunity, transmit up to X packets from the queue", and it > doesn't matter if it's only one packet. > > If you delay the first packet to give you a chance to aggregate it with > others, you add in the complexity and overhead of timers (including > cancelling timers, slippage in timers, etc) and you add "first packet, start > timers" mode to deal with. > > I grant you that the first approach will "saturate" the airtime at lower > traffic levels, but at that point all the stations will start aggregating > the minimum amount needed to keep the air saturated, while still minimizing > latency. > > I then expect that application related optimizations would then further > complicate the second approach. there are just too many cases where small > amounts of data have to be sent and other things serialize behind them. > > DNS lookup to find a domain to then to a 3-way handshake to then do a > request to see if the <web something> library has been updated since last > cached (repeat for several libraries) to then fetch the actual page content. > All of these thing up to the actual page content could be single packets > that have to be sent (and responded to with a single packet), waiting for > the prior one to complete. If you add a few ms to each of these, you can > easily hit 100ms in added latency. Once you start to try and special cases > these sorts of things, the code complexity multiplies. Take web page parsing as an example. The first request is a dns lookup. The second request is a http get (which can include a few more round trips for negotiating SSL), the next is a flurry of page parsing that results in the internal web browser attempting to schedule it's requests best and then sending out the relevant dns and tcp flows as best it can figure out, and then, typically several seconds of data transfer across each set of flows. Page paint is bound by getting the critical portions of the resulting data parsed and laid out properly. Now, I'd really like that early phase to be optimized by APs by something more like SQF, where when a station appears and does a few packet exchanges that it gets priority over stations taking big flows on a more regular basis, so it more rapidly gets into flow balance with the other stations. (and then, for most use cases, like web, exits) the second phase, of actual transfer, is also bound by RTT. I have no idea to what extent wifi folk actually put into typical web transfer delays (20-80ms), but they are there... ... The idea of the wifi driver waiting a bit to form a better aggregate to fit into a txop ties into two slightly different timings and flow behaviors. If it is taking 10ms to get a txop in the first place, taking more time to assemble a good batch of packets to fit into "your" txop would be good. If it is taking 4ms to transfer your last txop, well, more packets may arrive for you in that interval, and feed into your existing flows to keep them going, if you defer feeding the hardware with them. Also, classic tcp acking goes out the window with competing acks at layer 2. I don't know if quic can do the equivalent of stretch acks... but one layer 3 ack, block acked by layer 2 in wifi, suffices... if you have a ton of tcp acks outstanding, block acking them all is expensive... > So I believe that the KISS approach ends up with a 'worse is better' > situation. Code is going to get more complex anyway, and there are other optimizations that could be made. One item I realized recently is that part of codel need not run on every packet in every flow for stuff destined to fit into a single txop. It is sufficient to see if it declared a drop on the first packet in a flow destined for a given txop. You can then mark that entire flow (in a txop) as droppable (QoSNoAck) within that txop (as it is within an RTT, and even losing all the packets there will only cause the rate to halve). > > David Lang -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 4:59 ` Dave Taht @ 2016-05-10 5:22 ` David Lang 2016-05-10 9:04 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: David Lang @ 2016-05-10 5:22 UTC (permalink / raw) To: Dave Taht Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org, Randell Jesup [-- Attachment #1: Type: TEXT/PLAIN, Size: 8546 bytes --] On Mon, 9 May 2016, Dave Taht wrote: > This is a very good overview, thank you. I'd like to take apart > station behavior on wifi with a web application... as a straw man. > > On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: >> On Mon, 9 May 2016, Dave Taht wrote: >> >>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>> wrote: >>>> >>>> >>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>> >>>>> should we always wait a little bit to see if we can form an aggregate? >>>> >>>> >>>> I thought the consensus on this front was “no”, as long as we’re making >>>> the decision when we have an immediate transmit opportunity. >>> >>> >>> I think it is more nuanced than how david lang has presented it. >> >> >> I have four reasons for arguing for no speculative delays. >> >> 1. airtime that isn't used can't be saved. >> >> 2. lower best-case latency >> >> 3. simpler code >> >> 4. clean, and gradual service degredation under load. >> >> the arguments against are: >> >> 5. throughput per ms of transmit time is better if aggregation happens than >> if it doesn't. >> >> 6. if you don't transmit, some other station may choose to before you would >> have finished. >> >> #2 is obvious, but with the caviot that anytime you transmit you may be >> delaying someone else. >> >> #1 and #6 are flip sides of each other. we want _someone_ to use the >> airtime, the question is who. >> >> #3 and #4 are closely related. >> >> If you follow my approach (transmit immediately if you can, aggregate when >> you have a queue), the code really has one mode (plus queuing). "If you have >> a Transmit Oppertunity, transmit up to X packets from the queue", and it >> doesn't matter if it's only one packet. >> >> If you delay the first packet to give you a chance to aggregate it with >> others, you add in the complexity and overhead of timers (including >> cancelling timers, slippage in timers, etc) and you add "first packet, start >> timers" mode to deal with. >> >> I grant you that the first approach will "saturate" the airtime at lower >> traffic levels, but at that point all the stations will start aggregating >> the minimum amount needed to keep the air saturated, while still minimizing >> latency. >> >> I then expect that application related optimizations would then further >> complicate the second approach. there are just too many cases where small >> amounts of data have to be sent and other things serialize behind them. >> >> DNS lookup to find a domain to then to a 3-way handshake to then do a >> request to see if the <web something> library has been updated since last >> cached (repeat for several libraries) to then fetch the actual page content. >> All of these thing up to the actual page content could be single packets >> that have to be sent (and responded to with a single packet), waiting for >> the prior one to complete. If you add a few ms to each of these, you can >> easily hit 100ms in added latency. Once you start to try and special cases >> these sorts of things, the code complexity multiplies. > > Take web page parsing as an example. The first request is a dns > lookup. The second request is a http get (which can include a few more > round trips for > negotiating SSL), the next is a flurry of page parsing that results in > the internal web browser attempting to schedule it's requests best and > then sending out the relevant dns and tcp flows as best it can figure > out, and then, typically several seconds of data transfer across each > set of flows. Actually, I think that a lot (if not the majority) of these flows are actually short, because the libraries/stylesheets/images/etc are cached recently enough that they don't need to be fetched again. The browser just needs to check if the copy they have is still good or if it's been changed. > Page paint is bound by getting the critical portions of the resulting > data parsed and laid out properly. > > Now, I'd really like that early phase to be optimized by APs by > something more like SQF, where when a station appears and does a few > packet exchanges that it gets priority over stations taking big flows > on a more regular basis, so it more rapidly gets into flow balance > with the other stations. There are two parts to this process 1. the tactical (do you send the pending packet immediately, or do you delay it to see if you can save airtime with aggregation) 2. the strategic (once a queue of pending packets has built up, how do you pick which one to send) what you are talking about is the strategic part of it, where you assume that there is a queue of data to be sent, and picking which stuff to send first affects the performance. What I'm talking about is the tactical, before the queue has built, don't add time to the flow by delaying packets. Especially because in this case the odds are good that there is not going to be anything to aggregate with it. DNS udp packets aren't going to have anything else to aggregate with. 3-way handshake packets aren't going to have anything else to aggregate with (until and unless you are doing them while you have other stuff being transmitted, even parallel connections to different servers are likely to be spread out due to differences in network distance) http checks for cache validation are unlikely to have anything to aggregate with. The SSL handshake is a bit more complex, but there's not a lot of data moving in either direction at any step, and there are a lot of exchanges. With 'modern' AJAX sites, even after the entire page is rendered and the javascript starts running and fetching data you may have a page retrieve a lot of stuff, but with lazy coding, there area lot of requests that retrieve very small amounts of data. Find some nasty sites (complexity wise) and do some sniffs on a nice, low-latency wired network and check the number of connections, and the sizes of all the packets (and their timing) artificially add some horrid latency to the connection to exaggerate the influence of serialized steps and watch what happens. David Lang > (and then, for most use cases, like web, exits) 1> > the second phase, of actual transfer, is also bound by RTT. I have no > idea to what extent wifi folk actually put into typical web transfer > delays (20-80ms), > but they are there... > > ... > > The idea of the wifi driver waiting a bit to form a better aggregate > to fit into a txop ties into two slightly different timings and flow > behaviors. > > If it is taking 10ms to get a txop in the first place, taking more > time to assemble a good batch of packets to fit into "your" txop would > be good. If you are not at a txop, all you can do is queue, so you queue. And when you get a txop, you send as much as you can (up to the configured max) no disagreement there. if the txop is a predictable minimum distance away (because you know that another station just started transmitting and will take 10ms), then you can spend more time being fancy about what you send and how you pack it. > > If it is taking 4ms to transfer your last txop, well, more packets may > arrive for you in that interval, and feed into your existing flows to > keep them going, > if you defer feeding the hardware with them. Yes, This strategy ideally is happening as close to the hardware as possible. > Also, classic tcp acking goes out the window with competing acks at layer 2. > > I don't know if quic can do the equivalent of stretch acks... > > but one layer 3 ack, block acked by layer 2 in wifi, suffices... if > you have a ton of tcp acks outstanding, block acking them all is > expensive... yes. >> So I believe that the KISS approach ends up with a 'worse is better' >> situation. > > Code is going to get more complex anyway, and there are other > optimizations that could be made. all the more reason to have that complexity on top of a simpler core :-) > One item I realized recently is that part of codel need not run on > every packet in every flow for stuff destined to fit into a single > txop. It is sufficient to see if it declared a drop on the first > packet in a flow destined for a given txop. > > You can then mark that entire flow (in a txop) as droppable (QoSNoAck) > within that txop (as it is within an RTT, and even losing all the > packets there will only cause the rate to halve). I would try to not drop all of them, in case the bitrate drops before you re-send (try to avoid having one txop worth of date become several). David Lang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 5:22 ` David Lang @ 2016-05-10 9:04 ` Toke Høiland-Jørgensen 2016-05-11 14:12 ` Dave Täht 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2016-05-10 9:04 UTC (permalink / raw) To: David Lang Cc: Dave Taht, ath9k-devel@lists.ath9k.org, Randell Jesup, make-wifi-fast David Lang <david@lang.hm> writes: > There are two parts to this process > > 1. the tactical (do you send the pending packet immediately, or do you > delay it to see if you can save airtime with aggregation) A colleague of mine looked into this some time ago as part of his PhD thesis. This was pre-801.11n, so they were doing experiments on adding aggregation to 802.11g by messing with the MTU. What they found was that they actually got better results from just sending data when they had it rather than waiting to see if more showed up. -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 9:04 ` Toke Høiland-Jørgensen @ 2016-05-11 14:12 ` Dave Täht 2016-05-11 15:09 ` Dave Taht 0 siblings, 1 reply; 25+ messages in thread From: Dave Täht @ 2016-05-11 14:12 UTC (permalink / raw) To: make-wifi-fast On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote: > David Lang <david@lang.hm> writes: > >> There are two parts to this process >> >> 1. the tactical (do you send the pending packet immediately, or do you >> delay it to see if you can save airtime with aggregation) > > A colleague of mine looked into this some time ago as part of his PhD > thesis. This was pre-801.11n, so they were doing experiments on adding > aggregation to 802.11g by messing with the MTU. What they found was that > they actually got better results from just sending data when they had it > rather than waiting to see if more showed up. cat 802.11g_research/* > /dev/null > > -Toke > _______________________________________________ > Make-wifi-fast mailing list > Make-wifi-fast@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/make-wifi-fast > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-11 14:12 ` Dave Täht @ 2016-05-11 15:09 ` Dave Taht 2016-05-11 15:20 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-11 15:09 UTC (permalink / raw) To: Dave Täht; +Cc: make-wifi-fast On Wed, May 11, 2016 at 7:12 AM, Dave Täht <dave@taht.net> wrote: > > > On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote: >> David Lang <david@lang.hm> writes: >> >>> There are two parts to this process >>> >>> 1. the tactical (do you send the pending packet immediately, or do you >>> delay it to see if you can save airtime with aggregation) >> >> A colleague of mine looked into this some time ago as part of his PhD >> thesis. This was pre-801.11n, so they were doing experiments on adding >> aggregation to 802.11g by messing with the MTU. What they found was that >> they actually got better results from just sending data when they had it >> rather than waiting to see if more showed up. > > cat 802.11g_research/* > /dev/null Sorry, that was overly pithy (precoffee here). What I'd meant was that 802.11e's assumptions as to how to do scheduling, particularly for QoS, and how 802.11g behaved in comparison to n and later, does not give you much of a starting point on how to address things now. Successive standards and implementations have made certain things much better and other things much worse. Adopting 802.11e style QoS framing - good idea Adopting 802.11e style hw queue scheduling (EDCA) by mapping diffserv queues blindly to to those hw queues - horrific, without also attempting to meet the service time requirements in the software queue management. I have been perpetually demonstrating 200+ms VO queues since day one, starving the other queues, where what you want is a very short queue (under 10ms), served sparsely (say, no more than 2 or 4/10ths of the overall airtime) - and only "The right stuff" to map into it. CS6/CS7 do not belong in VO. 802.11n - It became saner to just aggregate in most cases where you might have used 802.11e qos, steering flows into the next packed txop. And as noted elsewhere[1], per station queuing so you can, indeed, aggregate sanely. Packing on all the new management frames and crypto without sanely managing those, not so good idea. Holding multicast to it's 1998 rate... grump. 802.11ac - adopting the better framing universally for all hw queues - good. Still blindly exposing those queues to userspace - horrible. Hiding most the rate control and retry information (as all firmware seem to do thus far), tying our hands behind our backs. Using up all the channels in a world with an ever increasing density of aps and stations and trying to manage the allocations in hardware, scary. Four color theorem.... Cramming up to 4MBytes into a single TXOP - what a great lab result! I have no idea how to do that, and am pretty sure even trying is undesirable. ... I'd like to (re)start with 802.11ac assumptions and work backwards, rather than 802.11g assumptions and work forwards. The most desirable thing I'd love to see is hardware capable of turning the tail end of a txop around and sending some real data back, and to know if QosNoack can be selectively used..... And on my bad days, I really would like to go back to playing with 5mhz channels (which the ath9k still supports), and getting channel selection to work better. I'd rather have 5mbits of reliable low latency bandwidth in the real world, than 500Mbits in a faraday cage. /me goes in search of some more coffee. [1] I really wanted people to argue with me about this talk one day... http://blog.cerowrt.org/post/talks/make-wifi-fast/ -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-11 15:09 ` Dave Taht @ 2016-05-11 15:20 ` Toke Høiland-Jørgensen 0 siblings, 0 replies; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2016-05-11 15:20 UTC (permalink / raw) To: Dave Taht; +Cc: make-wifi-fast Dave Taht <dave.taht@gmail.com> writes: > On Wed, May 11, 2016 at 7:12 AM, Dave Täht <dave@taht.net> wrote: >> >> >> On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote: >>> David Lang <david@lang.hm> writes: >>> >>>> There are two parts to this process >>>> >>>> 1. the tactical (do you send the pending packet immediately, or do you >>>> delay it to see if you can save airtime with aggregation) >>> >>> A colleague of mine looked into this some time ago as part of his PhD >>> thesis. This was pre-801.11n, so they were doing experiments on adding >>> aggregation to 802.11g by messing with the MTU. What they found was that >>> they actually got better results from just sending data when they had it >>> rather than waiting to see if more showed up. >> >> cat 802.11g_research/* > /dev/null > > Sorry, that was overly pithy (precoffee here). What I'd meant was that > 802.11e's assumptions as to how to do scheduling, particularly for > QoS, and how 802.11g behaved in comparison to n and later, does not > give you much of a starting point on how to address things now. > Successive standards and implementations have made certain things much > better and other things much worse. 802.11e != 802.11g. I am completely ignoring 802.11e for now, and the rest of the standard is not that different between g/n/ac; it's basically different rates, encodings and interframe gaps. Which from a scheduling point of view simply means different constants. Once we have a working baseline scheduler we can think about how to behave sanely in an 802.11e world; as far as I'm concerned the right answer might just be "shut it off entirely"... :P -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-10 3:41 ` [Make-wifi-fast] " David Lang 2016-05-10 4:59 ` Dave Taht @ 2016-05-13 17:46 ` Bob McMahon 2016-05-13 17:49 ` Dave Taht 2016-05-13 20:49 ` David Lang 1 sibling, 2 replies; 25+ messages in thread From: Bob McMahon @ 2016-05-13 17:46 UTC (permalink / raw) To: David Lang; +Cc: Dave Taht, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1.1: Type: text/plain, Size: 3952 bytes --] On driver delays, from a driver development perspective the problem isn't to add delay or not (it shouldn't) it's that the TCP stack isn't presenting sufficient data to fully utilize aggregation. Below is a histogram comparing aggregations of 3 systems (units are mpdu per ampdu.) The lowest latency stack is in purple and it's also the worst performance with respect to average throughput. From a driver perspective, one would like TCP to present sufficient bytes into the pipe that the histogram leans toward the blue. [image: Inline image 1] I'm not an expert on TCP near congestion avoidance but maybe the algorithm could benefit from RTT as weighted by CWND (or bytes in flight) and hunt that maximum? Bob On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: > On Mon, 9 May 2016, Dave Taht wrote: > > On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >> wrote: >> >>> >>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>> >>>> should we always wait a little bit to see if we can form an aggregate? >>>> >>> >>> I thought the consensus on this front was “no”, as long as we’re making >>> the decision when we have an immediate transmit opportunity. >>> >> >> I think it is more nuanced than how david lang has presented it. >> > > I have four reasons for arguing for no speculative delays. > > 1. airtime that isn't used can't be saved. > > 2. lower best-case latency > > 3. simpler code > > 4. clean, and gradual service degredation under load. > > the arguments against are: > > 5. throughput per ms of transmit time is better if aggregation happens > than if it doesn't. > > 6. if you don't transmit, some other station may choose to before you > would have finished. > > #2 is obvious, but with the caviot that anytime you transmit you may be > delaying someone else. > > #1 and #6 are flip sides of each other. we want _someone_ to use the > airtime, the question is who. > > #3 and #4 are closely related. > > If you follow my approach (transmit immediately if you can, aggregate when > you have a queue), the code really has one mode (plus queuing). "If you > have a Transmit Oppertunity, transmit up to X packets from the queue", and > it doesn't matter if it's only one packet. > > If you delay the first packet to give you a chance to aggregate it with > others, you add in the complexity and overhead of timers (including > cancelling timers, slippage in timers, etc) and you add "first packet, > start timers" mode to deal with. > > I grant you that the first approach will "saturate" the airtime at lower > traffic levels, but at that point all the stations will start aggregating > the minimum amount needed to keep the air saturated, while still minimizing > latency. > > I then expect that application related optimizations would then further > complicate the second approach. there are just too many cases where small > amounts of data have to be sent and other things serialize behind them. > > DNS lookup to find a domain to then to a 3-way handshake to then do a > request to see if the <web something> library has been updated since last > cached (repeat for several libraries) to then fetch the actual page > content. All of these thing up to the actual page content could be single > packets that have to be sent (and responded to with a single packet), > waiting for the prior one to complete. If you add a few ms to each of > these, you can easily hit 100ms in added latency. Once you start to try and > special cases these sorts of things, the code complexity multiplies. > > So I believe that the KISS approach ends up with a 'worse is better' > situation. > > David Lang > _______________________________________________ > Make-wifi-fast mailing list > Make-wifi-fast@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/make-wifi-fast > > [-- Attachment #1.2: Type: text/html, Size: 5177 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 17:46 ` Bob McMahon @ 2016-05-13 17:49 ` Dave Taht 2016-05-13 18:05 ` Bob McMahon 2016-05-13 20:49 ` David Lang 1 sibling, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-13 17:49 UTC (permalink / raw) To: Bob McMahon; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1.1: Type: text/plain, Size: 4497 bytes --] I try to stress that single tcp flows should never use all the bandwidth for the sawtooth to function properly. What happens when you hit it with 4 flows? or 12? nice graph, but I don't understand the single blue spikes? On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com> wrote: > On driver delays, from a driver development perspective the problem isn't > to add delay or not (it shouldn't) it's that the TCP stack isn't presenting > sufficient data to fully utilize aggregation. Below is a histogram > comparing aggregations of 3 systems (units are mpdu per ampdu.) The lowest > latency stack is in purple and it's also the worst performance with respect > to average throughput. From a driver perspective, one would like TCP to > present sufficient bytes into the pipe that the histogram leans toward the > blue. > > [image: Inline image 1] > I'm not an expert on TCP near congestion avoidance but maybe the algorithm > could benefit from RTT as weighted by CWND (or bytes in flight) and hunt > that maximum? > > Bob > > On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: > >> On Mon, 9 May 2016, Dave Taht wrote: >> >> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>> wrote: >>> >>>> >>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>> >>>>> should we always wait a little bit to see if we can form an aggregate? >>>>> >>>> >>>> I thought the consensus on this front was “no”, as long as we’re making >>>> the decision when we have an immediate transmit opportunity. >>>> >>> >>> I think it is more nuanced than how david lang has presented it. >>> >> >> I have four reasons for arguing for no speculative delays. >> >> 1. airtime that isn't used can't be saved. >> >> 2. lower best-case latency >> >> 3. simpler code >> >> 4. clean, and gradual service degredation under load. >> >> the arguments against are: >> >> 5. throughput per ms of transmit time is better if aggregation happens >> than if it doesn't. >> >> 6. if you don't transmit, some other station may choose to before you >> would have finished. >> >> #2 is obvious, but with the caviot that anytime you transmit you may be >> delaying someone else. >> >> #1 and #6 are flip sides of each other. we want _someone_ to use the >> airtime, the question is who. >> >> #3 and #4 are closely related. >> >> If you follow my approach (transmit immediately if you can, aggregate >> when you have a queue), the code really has one mode (plus queuing). "If >> you have a Transmit Oppertunity, transmit up to X packets from the queue", >> and it doesn't matter if it's only one packet. >> >> If you delay the first packet to give you a chance to aggregate it with >> others, you add in the complexity and overhead of timers (including >> cancelling timers, slippage in timers, etc) and you add "first packet, >> start timers" mode to deal with. >> >> I grant you that the first approach will "saturate" the airtime at lower >> traffic levels, but at that point all the stations will start aggregating >> the minimum amount needed to keep the air saturated, while still minimizing >> latency. >> >> I then expect that application related optimizations would then further >> complicate the second approach. there are just too many cases where small >> amounts of data have to be sent and other things serialize behind them. >> >> DNS lookup to find a domain to then to a 3-way handshake to then do a >> request to see if the <web something> library has been updated since last >> cached (repeat for several libraries) to then fetch the actual page >> content. All of these thing up to the actual page content could be single >> packets that have to be sent (and responded to with a single packet), >> waiting for the prior one to complete. If you add a few ms to each of >> these, you can easily hit 100ms in added latency. Once you start to try and >> special cases these sorts of things, the code complexity multiplies. >> >> So I believe that the KISS approach ends up with a 'worse is better' >> situation. >> >> David Lang >> _______________________________________________ >> Make-wifi-fast mailing list >> Make-wifi-fast@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/make-wifi-fast >> >> > -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org [-- Attachment #1.2: Type: text/html, Size: 6249 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 17:49 ` Dave Taht @ 2016-05-13 18:05 ` Bob McMahon 2016-05-13 18:11 ` Bob McMahon 2016-05-13 18:57 ` Dave Taht 0 siblings, 2 replies; 25+ messages in thread From: Bob McMahon @ 2016-05-13 18:05 UTC (permalink / raw) To: Dave Taht; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1.1: Type: text/plain, Size: 5279 bytes --] The graphs are histograms of mpdu/ampdu, from 1 to 64. The blue spikes show that the vast majority of traffic is filling an ampdu with 64 mpdus. The fill stop reason is ampdu full. The purple fill stop reasons are that the sw fifo (above the driver) went empty indicating a too small CWND for maximum aggregation. A driver wants to aggregate to the fullest extent possible. A work around is to set initcwnd in the router table. I don't have the data available for multiple flows at the moment. Note: That will depend on what exactly defines a flow. Bob On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote: > I try to stress that single tcp flows should never use all the bandwidth > for the sawtooth to function properly. > > What happens when you hit it with 4 flows? or 12? > > nice graph, but I don't understand the single blue spikes? > > On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com> > wrote: > >> On driver delays, from a driver development perspective the problem isn't >> to add delay or not (it shouldn't) it's that the TCP stack isn't presenting >> sufficient data to fully utilize aggregation. Below is a histogram >> comparing aggregations of 3 systems (units are mpdu per ampdu.) The lowest >> latency stack is in purple and it's also the worst performance with respect >> to average throughput. From a driver perspective, one would like TCP to >> present sufficient bytes into the pipe that the histogram leans toward the >> blue. >> >> [image: Inline image 1] >> I'm not an expert on TCP near congestion avoidance but maybe the >> algorithm could benefit from RTT as weighted by CWND (or bytes in flight) >> and hunt that maximum? >> >> Bob >> >> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: >> >>> On Mon, 9 May 2016, Dave Taht wrote: >>> >>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>>> wrote: >>>> >>>>> >>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>>> >>>>>> should we always wait a little bit to see if we can form an aggregate? >>>>>> >>>>> >>>>> I thought the consensus on this front was “no”, as long as we’re >>>>> making the decision when we have an immediate transmit opportunity. >>>>> >>>> >>>> I think it is more nuanced than how david lang has presented it. >>>> >>> >>> I have four reasons for arguing for no speculative delays. >>> >>> 1. airtime that isn't used can't be saved. >>> >>> 2. lower best-case latency >>> >>> 3. simpler code >>> >>> 4. clean, and gradual service degredation under load. >>> >>> the arguments against are: >>> >>> 5. throughput per ms of transmit time is better if aggregation happens >>> than if it doesn't. >>> >>> 6. if you don't transmit, some other station may choose to before you >>> would have finished. >>> >>> #2 is obvious, but with the caviot that anytime you transmit you may be >>> delaying someone else. >>> >>> #1 and #6 are flip sides of each other. we want _someone_ to use the >>> airtime, the question is who. >>> >>> #3 and #4 are closely related. >>> >>> If you follow my approach (transmit immediately if you can, aggregate >>> when you have a queue), the code really has one mode (plus queuing). "If >>> you have a Transmit Oppertunity, transmit up to X packets from the queue", >>> and it doesn't matter if it's only one packet. >>> >>> If you delay the first packet to give you a chance to aggregate it with >>> others, you add in the complexity and overhead of timers (including >>> cancelling timers, slippage in timers, etc) and you add "first packet, >>> start timers" mode to deal with. >>> >>> I grant you that the first approach will "saturate" the airtime at lower >>> traffic levels, but at that point all the stations will start aggregating >>> the minimum amount needed to keep the air saturated, while still minimizing >>> latency. >>> >>> I then expect that application related optimizations would then further >>> complicate the second approach. there are just too many cases where small >>> amounts of data have to be sent and other things serialize behind them. >>> >>> DNS lookup to find a domain to then to a 3-way handshake to then do a >>> request to see if the <web something> library has been updated since last >>> cached (repeat for several libraries) to then fetch the actual page >>> content. All of these thing up to the actual page content could be single >>> packets that have to be sent (and responded to with a single packet), >>> waiting for the prior one to complete. If you add a few ms to each of >>> these, you can easily hit 100ms in added latency. Once you start to try and >>> special cases these sorts of things, the code complexity multiplies. >>> >>> So I believe that the KISS approach ends up with a 'worse is better' >>> situation. >>> >>> David Lang >>> _______________________________________________ >>> Make-wifi-fast mailing list >>> Make-wifi-fast@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/make-wifi-fast >>> >>> >> > > > -- > Dave Täht > Let's go make home routers and wifi faster! With better software! > http://blog.cerowrt.org > [-- Attachment #1.2: Type: text/html, Size: 7231 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 18:05 ` Bob McMahon @ 2016-05-13 18:11 ` Bob McMahon 2016-05-13 18:57 ` Dave Taht 1 sibling, 0 replies; 25+ messages in thread From: Bob McMahon @ 2016-05-13 18:11 UTC (permalink / raw) To: Dave Taht; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1.1: Type: text/plain, Size: 5753 bytes --] Also, I haven't done it but I don't think rate limiting TCP will solve this aggregation "problem." The faster RTT is driving CWND much below the maximum aggregation, i.e. CWND is too small relative to wi-fi aggregation. Bob On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com> wrote: > The graphs are histograms of mpdu/ampdu, from 1 to 64. The blue spikes > show that the vast majority of traffic is filling an ampdu with 64 mpdus. > The fill stop reason is ampdu full. The purple fill stop reasons are that > the sw fifo (above the driver) went empty indicating a too small CWND for > maximum aggregation. A driver wants to aggregate to the fullest extent > possible. A work around is to set initcwnd in the router table. > > I don't have the data available for multiple flows at the moment. Note: > That will depend on what exactly defines a flow. > > Bob > > On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote: > >> I try to stress that single tcp flows should never use all the bandwidth >> for the sawtooth to function properly. >> >> What happens when you hit it with 4 flows? or 12? >> >> nice graph, but I don't understand the single blue spikes? >> >> On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com> >> wrote: >> >>> On driver delays, from a driver development perspective the problem >>> isn't to add delay or not (it shouldn't) it's that the TCP stack isn't >>> presenting sufficient data to fully utilize aggregation. Below is a >>> histogram comparing aggregations of 3 systems (units are mpdu per ampdu.) >>> The lowest latency stack is in purple and it's also the worst performance >>> with respect to average throughput. From a driver perspective, one would >>> like TCP to present sufficient bytes into the pipe that the histogram leans >>> toward the blue. >>> >>> [image: Inline image 1] >>> I'm not an expert on TCP near congestion avoidance but maybe the >>> algorithm could benefit from RTT as weighted by CWND (or bytes in flight) >>> and hunt that maximum? >>> >>> Bob >>> >>> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: >>> >>>> On Mon, 9 May 2016, Dave Taht wrote: >>>> >>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>>>> >>>>>>> should we always wait a little bit to see if we can form an >>>>>>> aggregate? >>>>>>> >>>>>> >>>>>> I thought the consensus on this front was “no”, as long as we’re >>>>>> making the decision when we have an immediate transmit opportunity. >>>>>> >>>>> >>>>> I think it is more nuanced than how david lang has presented it. >>>>> >>>> >>>> I have four reasons for arguing for no speculative delays. >>>> >>>> 1. airtime that isn't used can't be saved. >>>> >>>> 2. lower best-case latency >>>> >>>> 3. simpler code >>>> >>>> 4. clean, and gradual service degredation under load. >>>> >>>> the arguments against are: >>>> >>>> 5. throughput per ms of transmit time is better if aggregation happens >>>> than if it doesn't. >>>> >>>> 6. if you don't transmit, some other station may choose to before you >>>> would have finished. >>>> >>>> #2 is obvious, but with the caviot that anytime you transmit you may be >>>> delaying someone else. >>>> >>>> #1 and #6 are flip sides of each other. we want _someone_ to use the >>>> airtime, the question is who. >>>> >>>> #3 and #4 are closely related. >>>> >>>> If you follow my approach (transmit immediately if you can, aggregate >>>> when you have a queue), the code really has one mode (plus queuing). "If >>>> you have a Transmit Oppertunity, transmit up to X packets from the queue", >>>> and it doesn't matter if it's only one packet. >>>> >>>> If you delay the first packet to give you a chance to aggregate it with >>>> others, you add in the complexity and overhead of timers (including >>>> cancelling timers, slippage in timers, etc) and you add "first packet, >>>> start timers" mode to deal with. >>>> >>>> I grant you that the first approach will "saturate" the airtime at >>>> lower traffic levels, but at that point all the stations will start >>>> aggregating the minimum amount needed to keep the air saturated, while >>>> still minimizing latency. >>>> >>>> I then expect that application related optimizations would then further >>>> complicate the second approach. there are just too many cases where small >>>> amounts of data have to be sent and other things serialize behind them. >>>> >>>> DNS lookup to find a domain to then to a 3-way handshake to then do a >>>> request to see if the <web something> library has been updated since last >>>> cached (repeat for several libraries) to then fetch the actual page >>>> content. All of these thing up to the actual page content could be single >>>> packets that have to be sent (and responded to with a single packet), >>>> waiting for the prior one to complete. If you add a few ms to each of >>>> these, you can easily hit 100ms in added latency. Once you start to try and >>>> special cases these sorts of things, the code complexity multiplies. >>>> >>>> So I believe that the KISS approach ends up with a 'worse is better' >>>> situation. >>>> >>>> David Lang >>>> _______________________________________________ >>>> Make-wifi-fast mailing list >>>> Make-wifi-fast@lists.bufferbloat.net >>>> https://lists.bufferbloat.net/listinfo/make-wifi-fast >>>> >>>> >>> >> >> >> -- >> Dave Täht >> Let's go make home routers and wifi faster! With better software! >> http://blog.cerowrt.org >> > > [-- Attachment #1.2: Type: text/html, Size: 7966 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 18:05 ` Bob McMahon 2016-05-13 18:11 ` Bob McMahon @ 2016-05-13 18:57 ` Dave Taht 2016-05-13 19:20 ` Aaron Wood 1 sibling, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-13 18:57 UTC (permalink / raw) To: Bob McMahon; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1.1: Type: text/plain, Size: 8802 bytes --] On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com> wrote: > The graphs are histograms of mpdu/ampdu, from 1 to 64. The blue spikes > show that the vast majority of traffic is filling an ampdu with 64 mpdus. > The fill stop reason is ampdu full. The purple fill stop reasons are that > the sw fifo (above the driver) went empty indicating a too small CWND for > maximum aggregation. > Can I get you to drive this plot, wifi rate limited to 6,20,and 300mbits and "native", with flent's tcp_upload and rtt_fair_up tests? My goal is far different than getting a single tcp flow to max speed, it is to get it to close to full throughput with multiple flows while not accumulating 2 sec of buffering... http://blog.cerowrt.org/post/rtt_fair_on_wifi/ Or even 100ms of it: http://blog.cerowrt.org/post/ath10_ath9k_1/ Early experiments with getting a good rate estimate "to fill the queue" from rate control info was basically successful, but lacking rate control, using dql only, is currently taking much longer at higher rates, but works well at lower ones. http://blog.cerowrt.org/post/dql_on_wifi_2/ http://blog.cerowrt.org/post/dql_on_wifi > A driver wants to aggregate to the fullest extent possible. > While still retaining tcp congestion control. There are other nuances, like being nice about total airtime to others sharing the media, minimizing retries due to an overlarge ampdu for the current BER, etc. I don't remember what section of the 802.11-2012 standard this is from, but: ``` Another unresolved issue is how large a concatenation threshold the devices should set. Ideally, the maximum value is preferable but in a noisy environment, short frame lengths are preferred because of potential retransmissions. The A-MPDU concatenation scheme operates only over the packets that are already buffered in the transmission queue, and thus, if the CPR data rate is low, then efficiency also will be small. There are many ongoing studies on alternative queuing mechanisms different from the standard FIFO. *A combination of frame aggregation and an enhanced queuing algorithm could increase channel efficiency further*. ``` A work around is to set initcwnd in the router table. > Ugh. Um... no... initcwnd 10 is already too large for many networks. If you set your wifi initcwnd to something like 64, what happens to the 5mbit cable uplink just upstream from that? There are a couple other parameters that might be of use - tcp_send_lowat and tcp_limit_output_bytes. These were set off, and originally too low for wifi. A good setting for the latter, for ethernet, was about 4096. Then the wifi folk complained, and it got bumped to 64k, and I think now, it's at 256k to make the xen folk happier. These are all work arounds against the real problem which was not tuning driver queueing to the actual achievable ampdu, and doing fq+aqm to spread the load (essentially "pace" bursts) which is what is happening in michal's patches. > I don't have the data available for multiple flows at the moment. > The world is full of folk trying to make single tcp flows go at maximum speed, with multiple alternatives to cubic. This quest has resulted in the near elimination of the sawtooth along the edge and horrific overbuffering, to a net loss in speed, and a huge perception of "slowness". Note: I have long figured that a different tcp should be used on wifi uplinks, after we fixed a ton of basic mis-assumptions. As well as tcp's should become more wifi/wireless aware, but tweaking initcwnd, tcp_limit_output_bytes, etc, is not the right thing. There has been some good tcp research published of late, look into "BBR", and "CDG". > Note: That will depend on what exactly defines a flow. > > Bob > > On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote: > >> I try to stress that single tcp flows should never use all the bandwidth >> for the sawtooth to function properly. >> >> What happens when you hit it with 4 flows? or 12? >> >> nice graph, but I don't understand the single blue spikes? >> >> On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com> >> wrote: >> >>> On driver delays, from a driver development perspective the problem >>> isn't to add delay or not (it shouldn't) it's that the TCP stack isn't >>> presenting sufficient data to fully utilize aggregation. Below is a >>> histogram comparing aggregations of 3 systems (units are mpdu per ampdu.) >>> The lowest latency stack is in purple and it's also the worst performance >>> with respect to average throughput. From a driver perspective, one would >>> like TCP to present sufficient bytes into the pipe that the histogram leans >>> toward the blue. >>> >>> [image: Inline image 1] >>> I'm not an expert on TCP near congestion avoidance but maybe the >>> algorithm could benefit from RTT as weighted by CWND (or bytes in flight) >>> and hunt that maximum? >>> >>> Bob >>> >>> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: >>> >>>> On Mon, 9 May 2016, Dave Taht wrote: >>>> >>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>>>> >>>>>>> should we always wait a little bit to see if we can form an >>>>>>> aggregate? >>>>>>> >>>>>> >>>>>> I thought the consensus on this front was “no”, as long as we’re >>>>>> making the decision when we have an immediate transmit opportunity. >>>>>> >>>>> >>>>> I think it is more nuanced than how david lang has presented it. >>>>> >>>> >>>> I have four reasons for arguing for no speculative delays. >>>> >>>> 1. airtime that isn't used can't be saved. >>>> >>>> 2. lower best-case latency >>>> >>>> 3. simpler code >>>> >>>> 4. clean, and gradual service degredation under load. >>>> >>>> the arguments against are: >>>> >>>> 5. throughput per ms of transmit time is better if aggregation happens >>>> than if it doesn't. >>>> >>>> 6. if you don't transmit, some other station may choose to before you >>>> would have finished. >>>> >>>> #2 is obvious, but with the caviot that anytime you transmit you may be >>>> delaying someone else. >>>> >>>> #1 and #6 are flip sides of each other. we want _someone_ to use the >>>> airtime, the question is who. >>>> >>>> #3 and #4 are closely related. >>>> >>>> If you follow my approach (transmit immediately if you can, aggregate >>>> when you have a queue), the code really has one mode (plus queuing). "If >>>> you have a Transmit Oppertunity, transmit up to X packets from the queue", >>>> and it doesn't matter if it's only one packet. >>>> >>>> If you delay the first packet to give you a chance to aggregate it with >>>> others, you add in the complexity and overhead of timers (including >>>> cancelling timers, slippage in timers, etc) and you add "first packet, >>>> start timers" mode to deal with. >>>> >>>> I grant you that the first approach will "saturate" the airtime at >>>> lower traffic levels, but at that point all the stations will start >>>> aggregating the minimum amount needed to keep the air saturated, while >>>> still minimizing latency. >>>> >>>> I then expect that application related optimizations would then further >>>> complicate the second approach. there are just too many cases where small >>>> amounts of data have to be sent and other things serialize behind them. >>>> >>>> DNS lookup to find a domain to then to a 3-way handshake to then do a >>>> request to see if the <web something> library has been updated since last >>>> cached (repeat for several libraries) to then fetch the actual page >>>> content. All of these thing up to the actual page content could be single >>>> packets that have to be sent (and responded to with a single packet), >>>> waiting for the prior one to complete. If you add a few ms to each of >>>> these, you can easily hit 100ms in added latency. Once you start to try and >>>> special cases these sorts of things, the code complexity multiplies. >>>> >>>> So I believe that the KISS approach ends up with a 'worse is better' >>>> situation. >>>> >>>> David Lang >>>> _______________________________________________ >>>> Make-wifi-fast mailing list >>>> Make-wifi-fast@lists.bufferbloat.net >>>> https://lists.bufferbloat.net/listinfo/make-wifi-fast >>>> >>>> >>> >> >> >> -- >> Dave Täht >> Let's go make home routers and wifi faster! With better software! >> http://blog.cerowrt.org >> > > -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org [-- Attachment #1.2: Type: text/html, Size: 13096 bytes --] [-- Attachment #2: image.png --] [-- Type: image/png, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 18:57 ` Dave Taht @ 2016-05-13 19:20 ` Aaron Wood 2016-05-13 20:21 ` Dave Taht 0 siblings, 1 reply; 25+ messages in thread From: Aaron Wood @ 2016-05-13 19:20 UTC (permalink / raw) To: Dave Taht; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org [-- Attachment #1: Type: text/plain, Size: 698 bytes --] On Fri, May 13, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote: > On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com> > wrote: > >> don't have the data available for multiple flows at the moment. >> > > The world is full of folk trying to make single tcp flows go at maximum > speed, with multiple alternatives to cubic. > And most web traffic is multiple-flow, even with HTTP/2 and SPDY, due to domain/host sharding. Many bursts of multi-flow traffic. About the only thing that's single-flow is streaming-video (which isn't latency sensitive). The only local services that I know of that could use maximal-rate wifi are NAS systems using SMB, AFP, etc. -Aaron [-- Attachment #2: Type: text/html, Size: 1435 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 19:20 ` Aaron Wood @ 2016-05-13 20:21 ` Dave Taht 2016-05-13 20:51 ` Dave Taht 0 siblings, 1 reply; 25+ messages in thread From: Dave Taht @ 2016-05-13 20:21 UTC (permalink / raw) To: Aaron Wood; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org On Fri, May 13, 2016 at 12:20 PM, Aaron Wood <woody77@gmail.com> wrote: > On Fri, May 13, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote: >> >> On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com> >> wrote: >>> >>> don't have the data available for multiple flows at the moment. >> >> >> The world is full of folk trying to make single tcp flows go at maximum >> speed, with multiple alternatives to cubic. > > > And most web traffic is multiple-flow, even with HTTP/2 and SPDY, due to > domain/host sharding. Many bursts of multi-flow traffic. About the only > thing that's single-flow is streaming-video (which isn't latency sensitive). And usually, rate limited. It would be nice if that streaming video actually fit into a single txop in many cases. > The only local services that I know of that could use maximal-rate wifi are > NAS systems using SMB, AFP, etc. And many of these, until recently, were actually bound by the speed of their hard disks and by inefficiencies in the protocol. > > -Aaron A useful flent test for seeing the impact of good congestion control, are tcp_2up_square and tcp_2up_dleay. There are also other related tests like "reno_cubic_westwood_cdg" which try one form of tcp against another. I really should sit down and write a piece about these, to try to show that one flow grabbing all the link hurts all successor flows. Both could be better. I like what the teacup people are doing here, using 3 staggered flows to show their results. ... and I misspoke a bit earlier, meant to say txop where instead I'd said ampdu. Multiple ampdus can fit into a txop, and so far as I know, be block acked differently. https://books.google.com/books?id=XsF5CgAAQBAJ&pg=PA32&lpg=PA32&dq=multiple+ampdus+in+a+txop&source=bl&ots=dRCYcD9rBc&sig=tVocMORuEXBOsfUlcmuSLTdM0Lw&hl=en&sa=X&ved=0ahUKEwiLxurP69fMAhVU5WMKHVejAlUQ6AEIHzAA#v=onepage&q=multiple%20ampdus%20in%20a%20txop&f=false One thing I don't have a grip on is the airtime cost on packing multiple ampdus into a txop, in terms of the block ack, also in relation to using amdsus as per the 2015 paper referenced off the "thoughts about airtime fairness thread" that the ath9k list was not cc'd on. https://lists.bufferbloat.net/pipermail/make-wifi-fast/2016-May/000661.html I note that some of my comments on that thread was due to the overly EE and math oriented analysis of the "perfect" solution, but I'm over that now. :) It was otherwise one of the best recent papers on wifi I've read, and more should read: http://www.hindawi.com/journals/misy/2015/548109/ (and all the other cites in that thread were good, too. MIT had the basics right back in 2003!) One of my longer term dreams for better congestion control in wifi is to pack one aggregate in a txop with stuff you care about deeply, and a second, with stuff you don't (or vice versa). As also per here, filling in my personal memory gap from 2004 or so (when I thought block acks would only be used on critical traffic) and where I started going back and reviewing the standard. http://blog.cerowrt.org/post/selective_unprotect/ -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 20:21 ` Dave Taht @ 2016-05-13 20:51 ` Dave Taht 0 siblings, 0 replies; 25+ messages in thread From: Dave Taht @ 2016-05-13 20:51 UTC (permalink / raw) To: Aaron Wood; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org And for the reason I originally cc'd ath9k-devel on this thread, was merely that I'd wanted to know if the diagram of the ath9k driver's current structure was actually accurate. Is it? http://blog.cerowrt.org/post/wifi_software_paths/ The "make-wifi-fast" mailing list exists merely because keeping up on patch traffic on linux-wireless and elsewhere, is too difficult for many of the people on it (including me!), and we do tend to wax and wane on issues of more theoretical interest. That said, I have a habit of cross-posting that comes from being an ancient decrepit former usenet junkie where topics did engage multiple people, and it's also a way to draw attention to the project here It is *very* annoying in how many posts get bounced off the mailing lists... still.... Arguing in an echo chamber is futile. I am curious, in terms of further out reach, if there are any other good wifi/5g/6lowpan/wireless mailing lists to be on - I'd like in particular to find one for iwl and the mediatek chips, and if anyone is working on fixing up the rpi3's wifi, that would be good too. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Make-wifi-fast] Diagram of the ath9k TX path 2016-05-13 17:46 ` Bob McMahon 2016-05-13 17:49 ` Dave Taht @ 2016-05-13 20:49 ` David Lang 1 sibling, 0 replies; 25+ messages in thread From: David Lang @ 2016-05-13 20:49 UTC (permalink / raw) To: Bob McMahon; +Cc: Dave Taht, ath9k-devel@lists.ath9k.org, make-wifi-fast [-- Attachment #1: Type: TEXT/Plain, Size: 5126 bytes --] On Fri, 13 May 2016, Bob McMahon wrote: > On driver delays, from a driver development perspective the problem isn't > to add delay or not (it shouldn't) it's that the TCP stack isn't presenting > sufficient data to fully utilize aggregation. Below is a histogram > comparing aggregations of 3 systems (units are mpdu per ampdu.) The lowest > latency stack is in purple and it's also the worst performance with respect > to average throughput. From a driver perspective, one would like TCP to > present sufficient bytes into the pipe that the histogram leans toward the > blue. The problem isn't providing sufficent bytes into the pipe, it's that to do aggregations sanely you need lots of bytes to a particular station in a seprate queue than the lots of bytes for each of the other stations. current qdisc settings don't provide this to the driver, they provide a packet to host A, a packet to host B, a packet to host C, etc. What's needed is for the higher level (fq_codel, etc) to provide a queue for host A, a queue for host B, a queue for host C and throttle what goes into these queues, and then the driver grabs an aggregates worth from the first queue and sends it to host A, an aggregates worth for host B (which is a different number of bytes because B has a different data rate than A), etc. the current model of the tcp stack having one set of queues and the driver having a separate set of queues just doesn't work for this. One of the sets of queues needs to go away and the queues that remain need to be able to be filled with fairness between flows in mind (fq_codel or similar) and drained with fairness of airtime in mind (and efficient aggregation) David Lang > [image: Inline image 1] > I'm not an expert on TCP near congestion avoidance but maybe the algorithm > could benefit from RTT as weighted by CWND (or bytes in flight) and hunt > that maximum? > > Bob > > On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote: > >> On Mon, 9 May 2016, Dave Taht wrote: >> >> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> >>> wrote: >>> >>>> >>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote: >>>>> >>>>> should we always wait a little bit to see if we can form an aggregate? >>>>> >>>> >>>> I thought the consensus on this front was “no”, as long as we’re making >>>> the decision when we have an immediate transmit opportunity. >>>> >>> >>> I think it is more nuanced than how david lang has presented it. >>> >> >> I have four reasons for arguing for no speculative delays. >> >> 1. airtime that isn't used can't be saved. >> >> 2. lower best-case latency >> >> 3. simpler code >> >> 4. clean, and gradual service degredation under load. >> >> the arguments against are: >> >> 5. throughput per ms of transmit time is better if aggregation happens >> than if it doesn't. >> >> 6. if you don't transmit, some other station may choose to before you >> would have finished. >> >> #2 is obvious, but with the caviot that anytime you transmit you may be >> delaying someone else. >> >> #1 and #6 are flip sides of each other. we want _someone_ to use the >> airtime, the question is who. >> >> #3 and #4 are closely related. >> >> If you follow my approach (transmit immediately if you can, aggregate when >> you have a queue), the code really has one mode (plus queuing). "If you >> have a Transmit Oppertunity, transmit up to X packets from the queue", and >> it doesn't matter if it's only one packet. >> >> If you delay the first packet to give you a chance to aggregate it with >> others, you add in the complexity and overhead of timers (including >> cancelling timers, slippage in timers, etc) and you add "first packet, >> start timers" mode to deal with. >> >> I grant you that the first approach will "saturate" the airtime at lower >> traffic levels, but at that point all the stations will start aggregating >> the minimum amount needed to keep the air saturated, while still minimizing >> latency. >> >> I then expect that application related optimizations would then further >> complicate the second approach. there are just too many cases where small >> amounts of data have to be sent and other things serialize behind them. >> >> DNS lookup to find a domain to then to a 3-way handshake to then do a >> request to see if the <web something> library has been updated since last >> cached (repeat for several libraries) to then fetch the actual page >> content. All of these thing up to the actual page content could be single >> packets that have to be sent (and responded to with a single packet), >> waiting for the prior one to complete. If you add a few ms to each of >> these, you can easily hit 100ms in added latency. Once you start to try and >> special cases these sorts of things, the code complexity multiplies. >> >> So I believe that the KISS approach ends up with a 'worse is better' >> situation. >> >> David Lang >> _______________________________________________ >> Make-wifi-fast mailing list >> Make-wifi-fast@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/make-wifi-fast >> >> > [-- Attachment #2: Type: IMAGE/PNG, Size: 27837 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-05-13 20:51 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-09 11:00 [Make-wifi-fast] Diagram of the ath9k TX path Toke Høiland-Jørgensen 2016-05-09 15:35 ` Dave Taht 2016-05-10 2:25 ` Jonathan Morton 2016-05-10 2:59 ` Dave Taht 2016-05-10 3:30 ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd 2016-05-10 4:04 ` Dave Taht 2016-05-10 4:22 ` Aaron Wood 2016-05-10 7:15 ` Adrian Chadd 2016-05-10 7:17 ` Adrian Chadd 2016-05-10 3:41 ` [Make-wifi-fast] " David Lang 2016-05-10 4:59 ` Dave Taht 2016-05-10 5:22 ` David Lang 2016-05-10 9:04 ` Toke Høiland-Jørgensen 2016-05-11 14:12 ` Dave Täht 2016-05-11 15:09 ` Dave Taht 2016-05-11 15:20 ` Toke Høiland-Jørgensen 2016-05-13 17:46 ` Bob McMahon 2016-05-13 17:49 ` Dave Taht 2016-05-13 18:05 ` Bob McMahon 2016-05-13 18:11 ` Bob McMahon 2016-05-13 18:57 ` Dave Taht 2016-05-13 19:20 ` Aaron Wood 2016-05-13 20:21 ` Dave Taht 2016-05-13 20:51 ` Dave Taht 2016-05-13 20:49 ` David Lang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox