[Make-wifi-fast] Diagram of the ath9k TX path

Lets make wifi fast again!
 help / color / mirror / Atom feed

* [Make-wifi-fast] Diagram of the ath9k TX path
@ 2016-05-09 11:00 Toke Høiland-Jørgensen
  2016-05-09 15:35 ` Dave Taht
  0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-05-09 11:00 UTC (permalink / raw)
  To: make-wifi-fast

I finally finished my flow diagram of the ath9k TX path (corresponding
to the previous one I did for the mac80211 stack). In case anyone else
is interested, it's available here:

https://blog.tohojo.dk/2016/05/the-ath9k-tx-path.html

-Toke

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-09 11:00 [Make-wifi-fast] Diagram of the ath9k TX path Toke Høiland-Jørgensen
@ 2016-05-09 15:35 ` Dave Taht
  2016-05-10  2:25   ` Jonathan Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-09 15:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: make-wifi-fast, ath9k-devel

On Mon, May 9, 2016 at 4:00 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> I finally finished my flow diagram of the ath9k TX path (corresponding
> to the previous one I did for the mac80211 stack). In case anyone else
> is interested, it's available here:
>
> https://blog.tohojo.dk/2016/05/the-ath9k-tx-path.html

Looks quite helpful. I do not understand why there is a "fast path" at
all in this driver, should we always wait a little bit to see if we
can
form an aggregate?

It would be awesome to be able to adapt michal's work on fq_codeling
things as he did here (leveraging rate control information)

http://blog.cerowrt.org/post/fq_codel_on_ath10k/

rather than dql as he did here:

http://blog.cerowrt.org/post/dql_on_wifi_2/

to the ath9k.

> -Toke
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast



-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-09 15:35 ` Dave Taht
@ 2016-05-10  2:25   ` Jonathan Morton
  2016-05-10  2:59     ` Dave Taht
  0 siblings, 1 reply; 25+ messages in thread
From: Jonathan Morton @ 2016-05-10  2:25 UTC (permalink / raw)
  To: Dave Taht; +Cc: Toke Høiland-Jørgensen, make-wifi-fast, ath9k-devel

> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
> 
> should we always wait a little bit to see if we can form an aggregate?

I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity.

If we *don’t* have an immediate transmit opportunity, then we *must* wait regardless, and maybe some other packets will arrive which can then be aggregated.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  2:25   ` Jonathan Morton
@ 2016-05-10  2:59     ` Dave Taht
  2016-05-10  3:30       ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd
  2016-05-10  3:41       ` [Make-wifi-fast] " David Lang
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Taht @ 2016-05-10  2:59 UTC (permalink / raw)
  To: Jonathan Morton
  Cc: Toke Høiland-Jørgensen, make-wifi-fast, ath9k-devel

On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
>
>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>
>> should we always wait a little bit to see if we can form an aggregate?
>
> I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity.

I think it is more nuanced than how david lang has presented it. We
haven't argued the finer points just yet -
merely seeing 12-20ms latency across the entire 6-300mbit range I've
tested thus has been a joy,
and I'd like to at least think about ways to cut another order of
magnitude off of that while making better use of packing the medium.

http://blog.cerowrt.org/post/anomolies_thus_far/

So... I don't think we "achieved consensus", I just faded... I thought
at the time that merely getting down from 2+seconds to 20ms induced
latency was vastly more important :), and I didn't want to belabor the
point until we had some solid results. I'll still settle for "1 agg in
the hardware, 1 agg in the driver"... but smaller, and better formed,
aggs under contention - which might sometimes involve a pause for a
hundred usec to gather up more, when empty, or more, when the driver
is known to be busy.

...

Over the weekend I did some experiments setting the beacon advertised
txop size for best effort traffic to 94 (same size as the vi queue
that was so busted in earlier tests (
http://blog.cerowrt.org/post/cs5_lockout/ ) ) to try to see if the
station (or AP) paid attention to it... it was remarkable the
bandwidth symmetry I got compared to the defaults. This chart also
shows the size of the win against the stock ath10k firmware and driver
in terms of latency, and not having flows collapse...

http://blog.cerowrt.org/flent/txop_94/rtt_fairbe_compared.svg

Now, given then most people use wifi asymmetrically, perhaps there are
fewer use cases for when the AP and station work more symmetrically,
but this was a pretty result.

http://blog.cerowrt.org/flent/dual-txop-94/up_down_vastly_better.svg

Haven't finished writing up the result, aside from tweaking this
parameter had no seeming affect on the baseline 10-15ms driver latency
left in it, under load.

>
> If we *don’t* have an immediate transmit opportunity, then we *must* wait regardless, and maybe some other packets will arrive which can then be aggregated.
>
>  - Jonathan Morton
>

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] [ath9k-devel]  Diagram of the ath9k TX path
  2016-05-10  2:59     ` Dave Taht
@ 2016-05-10  3:30       ` Adrian Chadd
  2016-05-10  4:04         ` Dave Taht
  2016-05-10  3:41       ` [Make-wifi-fast] " David Lang
  1 sibling, 1 reply; 25+ messages in thread
From: Adrian Chadd @ 2016-05-10  3:30 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel,
	Toke Høiland-Jørgensen

Hi,

So:

* the hardware can give us a per-AC transmit opportunity;
* software queuing needs to handle the per-STA transmit opportunity;
* they (and I followed convention after testing) found the "best"
compromise was to hardware queue up to two frames, which we could
probably do slightly more of at higher MCS rates for "reasons", but if
we're getting enough packets come in then if the hardware queues get
drained slower than we can fill them, we naturally aggregate traffic.

So it actually works pretty well in practice. The general aim is to
keep up to ~8ms of aggregates queued, and that's typically two
aggregate frames so we don't bust the block-ack window.

-adrian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] [ath9k-devel]  Diagram of the ath9k TX path
  2016-05-10  3:30       ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd
@ 2016-05-10  4:04         ` Dave Taht
  2016-05-10  4:22           ` Aaron Wood
  2016-05-10  7:15           ` Adrian Chadd
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Taht @ 2016-05-10  4:04 UTC (permalink / raw)
  To: Adrian Chadd
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel,
	Toke Høiland-Jørgensen

On Mon, May 9, 2016 at 8:30 PM, Adrian Chadd <adrian@freebsd.org> wrote:
> Hi,
>
> So:
>
> * the hardware can give us a per-AC transmit opportunity;
> * software queuing needs to handle the per-STA transmit opportunity;
> * they (and I followed convention after testing)

"They" had probably not made proper sacrifices to the layer 3
congestion control gods, nor envisioned a world with 10s of stations
on a given ap and dozens of competing APs....

> found the "best"
> compromise was to hardware queue up to two frames, which we could
> probably do slightly more of at higher MCS rates for "reasons", but if
> we're getting enough packets come in then if the hardware queues get
> drained slower than we can fill them, we naturally aggregate traffic.
>
> So it actually works pretty well in practice.

Aside from seconds of queuing on top. :)

Is everybody here on board with reducing that by 2 orders of
magnitude?  I'm not posting all these results and all the flent data
just to amuse myself... The size of the potential patch set for
softmac devices has declined considerably - codel.h and the fq code
are now generalized in some tree or another, and what's left is in two
competing patches under test... one that leverages rate control stats
and wins like crazy, the other, dql, and takes longer to win like
crazy.

http://blog.cerowrt.org/post/ has the writeups
https://github.com/dtaht/blog-cerowrt has all the data and the
writeups still in draft form, in git, for your own bemusement and data
comparisons with the stock drivers.

> The general aim is to
> keep up to ~8ms of aggregates queued, and that's typically two
> aggregate frames so we don't bust the block-ack window.

My understanding was that hardware retry exists for the ath9k at
least, and that block-acks responded in under 10us. Also that ath9k
allowed you to describe and send up to 4 chains at different rates.
Yes?

As for "busting the window", what I wanted to try was adding in QoS
Noack on frames that you feel could be safely dropped,
so the first part of the txop would have an AMPDU of  stuff you felt
strongly about keeping, the second, one not block acknowledged, and a
third consisting of the last bits of the flows you didn't care about
as much, but want to provide packets for to inform the other side that
drops happened, block acked....

As for software retry, we could be smarter about it than we currently
are. A fixed number of retries (15 in the ath10k driver, 10 in the
ath9k driver) is just nuts...

As for 8ms, well, I'd much rather hand out 1ms each to 8 stations than
8ms each to 8 stations.

>
>
> -adrian

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] [ath9k-devel] Diagram of the ath9k TX path
  2016-05-10  4:04         ` Dave Taht
@ 2016-05-10  4:22           ` Aaron Wood
  2016-05-10  7:15           ` Adrian Chadd
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Wood @ 2016-05-10  4:22 UTC (permalink / raw)
  To: Dave Taht; +Cc: Adrian Chadd, ath9k-devel, make-wifi-fast

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

On Mon, May 9, 2016 at 9:04 PM, Dave Taht <dave.taht@gmail.com> wrote:

>
> Is everybody here on board with reducing that by 2 orders of
> magnitude?

Yes!

> I'm not posting all these results and all the flent data
> just to amuse myself... The size of the potential patch set for
> softmac devices has declined considerably - codel.h and the fq code
> are now generalized in some tree or another, and what's left is in two
> competing patches under test... one that leverages rate control stats
> and wins like crazy, the other, dql, and takes longer to win like
> crazy.
>

I'm really hoping to be able to use this.  Wifi is the last buffer-bloated
part of my home network (the latency differences between wired and wireless
are an order of magnitude apart).

And as for multiple STAs per AP:  That's the norm, these days.  My house
might be a bit extreme, but we have 8 STAs in 5GHz, and one lowly printer
in 2.4GHz (which is too far for 5GHz to get to, and I haven't run a cable
to that end of the house yet as I hate crawlspaces).

-Aaron

[-- Attachment #2: Type: text/html, Size: 1609 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] [ath9k-devel]  Diagram of the ath9k TX path
  2016-05-10  4:04         ` Dave Taht
  2016-05-10  4:22           ` Aaron Wood
@ 2016-05-10  7:15           ` Adrian Chadd
  2016-05-10  7:17             ` Adrian Chadd
  1 sibling, 1 reply; 25+ messages in thread
From: Adrian Chadd @ 2016-05-10  7:15 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel,
	Toke Høiland-Jørgensen

Well, there shouldn't /also/ be a software queue behind each TXQ at
that point. Eg, in FreeBSD, I queue up to 64 frames per station and
then default to round robining between stations when it's time to form
another aggregate. It's done that way so i or someone else can
implement a wifi queue discipline in between the per-station / per-TID
queues and the hardware queue that knew about time of flight, etc.

The variations on the internal driver tended to slide some more
complicated queue management and rate control between the bit that
dequeued from the per-TID/per-STA packet queue and formed aggregates.
Ie, the aggregate was only formed at hardware queue time, and only two
were pushed into the hardware at once. There were only deep hardware
queues in very old, pre-11n drivers, to minimise CPU overhead.

So yeah, do reduce that a bit. The hardware queue should be two
frames, there shouldn't be anything needed to be queued up behind it
in the software queue backing it if you're aggregating, and if you
/are/, that queue should be backed based on flight time (eg lots of
little frames, or one big aggregate, but not lots of big aggregates.)

Yeah, I also have to add NOACK support for A-MPDU in the freebsd
driver for various reasons (voice, yes, but also multicast A-MPDU.)
You just need to ensure that you slide along the BAW in a way that
flushes the sender, or you may drop some frames but you're in the BAW,
and then the receiver buffers them for up to their timeout value
(typically tens of milliseconds) waiting for more frames in the BAW
(hopefully retries). I dunno if you're really allowed to be sending
NOACK data frames if you've negotiated immediate blockack though!

And yeah for time? totally depends on what you're doing. Yes, if you
have lots of stations actively doing traffic, then yes you should just
form smaller aggregates. It's especially good for dealing with the
full frame retries (ie, RTS/CTS worked, but the data frame didn't, and
you didn't get a block-ack at all.) On longer aggregates, that can
/really/ hurt - ie, you're likely better off doing one full retry only
and then failing it so you can requeue it in software and break it up
into a smaller set of aggregates after the rate control code gets the
update.

(God, I should really do this all to freebsd now that I'm kinda
allowed to again..)

-adrian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] [ath9k-devel]  Diagram of the ath9k TX path
  2016-05-10  7:15           ` Adrian Chadd
@ 2016-05-10  7:17             ` Adrian Chadd
  0 siblings, 0 replies; 25+ messages in thread
From: Adrian Chadd @ 2016-05-10  7:17 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel,
	Toke Høiland-Jørgensen

The other hack that I've seen ath9k do is they actually assign
sequence numbers and CCMP IVs at the point of enqueue-to-hardware,
rather than enqueue-to-driver. I tried this in FreeBSD and dropped it
for other reasons, mostly in favour of shallower (configurable!)
per-station queue depths.

That way you /could/ drop data frames in the driver/wifi stack
per-STA/TID queues as part of queue discipline, way before it gets
sent out on the air. If you're modeling airtime on the fly then you
could see the queue for a given station is way too deep now that the
rate to it has dropped, so don't bother keeping those frames around.

Once you have assigned the seqno/IV then you're "committed" so to speak.

-adrian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  2:59     ` Dave Taht
  2016-05-10  3:30       ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd
@ 2016-05-10  3:41       ` David Lang
  2016-05-10  4:59         ` Dave Taht
  2016-05-13 17:46         ` Bob McMahon
  1 sibling, 2 replies; 25+ messages in thread
From: David Lang @ 2016-05-10  3:41 UTC (permalink / raw)
  To: Dave Taht; +Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2776 bytes --]

On Mon, 9 May 2016, Dave Taht wrote:

> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
>>
>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> should we always wait a little bit to see if we can form an aggregate?
>>
>> I thought the consensus on this front was “no”, as long as we’re making the decision when we have an immediate transmit opportunity.
>
> I think it is more nuanced than how david lang has presented it.

I have four reasons for arguing for no speculative delays.

1. airtime that isn't used can't be saved.

2. lower best-case latency

3. simpler code

4. clean, and gradual service degredation under load.

the arguments against are:

5. throughput per ms of transmit time is better if aggregation happens than if 
it doesn't.

6. if you don't transmit, some other station may choose to before you would have 
finished.

#2 is obvious, but with the caviot that anytime you transmit you may be delaying 
someone else.

#1 and #6 are flip sides of each other. we want _someone_ to use the airtime, 
the question is who.

#3 and #4 are closely related.

If you follow my approach (transmit immediately if you can, aggregate when you 
have a queue), the code really has one mode (plus queuing). "If you have a 
Transmit Oppertunity, transmit up to X packets from the queue", and it doesn't 
matter if it's only one packet.

If you delay the first packet to give you a chance to aggregate it with others, 
you add in the complexity and overhead of timers (including cancelling 
timers, slippage in timers, etc) and you add "first packet, 
start timers" mode to deal with.

I grant you that the first approach will "saturate" the airtime at lower traffic 
levels, but at that point all the stations will start aggregating the minimum 
amount needed to keep the air saturated, while still minimizing latency.

I then expect that application related optimizations would then further 
complicate the second approach. there are just too many cases where small 
amounts of data have to be sent and other things serialize behind them.

DNS lookup to find a domain to then to a 3-way handshake to then do a request to 
see if the <web something> library has been updated since last cached (repeat 
for several libraries) to then fetch the actual page content. All of these thing 
up to the actual page content could be single packets that have to be sent (and 
responded to with a single packet), waiting for the prior one to complete. If 
you add a few ms to each of these, you can easily hit 100ms in added latency. 
Once you start to try and special cases these sorts of things, the code 
complexity multiplies.

So I believe that the KISS approach ends up with a 'worse is better' situation.

David Lang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  3:41       ` [Make-wifi-fast] " David Lang
@ 2016-05-10  4:59         ` Dave Taht
  2016-05-10  5:22           ` David Lang
  2016-05-13 17:46         ` Bob McMahon
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-10  4:59 UTC (permalink / raw)
  To: David Lang
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org,
	Randell Jesup

This is a very good overview, thank you. I'd like to take apart
station behavior on wifi with a web application... as a straw man.

On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
> On Mon, 9 May 2016, Dave Taht wrote:
>
>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>> wrote:
>>>
>>>
>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>> should we always wait a little bit to see if we can form an aggregate?
>>>
>>>
>>> I thought the consensus on this front was “no”, as long as we’re making
>>> the decision when we have an immediate transmit opportunity.
>>
>>
>> I think it is more nuanced than how david lang has presented it.
>
>
> I have four reasons for arguing for no speculative delays.
>
> 1. airtime that isn't used can't be saved.
>
> 2. lower best-case latency
>
> 3. simpler code
>
> 4. clean, and gradual service degredation under load.
>
> the arguments against are:
>
> 5. throughput per ms of transmit time is better if aggregation happens than
> if it doesn't.
>
> 6. if you don't transmit, some other station may choose to before you would
> have finished.
>
> #2 is obvious, but with the caviot that anytime you transmit you may be
> delaying someone else.
>
> #1 and #6 are flip sides of each other. we want _someone_ to use the
> airtime, the question is who.
>
> #3 and #4 are closely related.
>
> If you follow my approach (transmit immediately if you can, aggregate when
> you have a queue), the code really has one mode (plus queuing). "If you have
> a Transmit Oppertunity, transmit up to X packets from the queue", and it
> doesn't matter if it's only one packet.
>
> If you delay the first packet to give you a chance to aggregate it with
> others, you add in the complexity and overhead of timers (including
> cancelling timers, slippage in timers, etc) and you add "first packet, start
> timers" mode to deal with.
>
> I grant you that the first approach will "saturate" the airtime at lower
> traffic levels, but at that point all the stations will start aggregating
> the minimum amount needed to keep the air saturated, while still minimizing
> latency.
>
> I then expect that application related optimizations would then further
> complicate the second approach. there are just too many cases where small
> amounts of data have to be sent and other things serialize behind them.
>
> DNS lookup to find a domain to then to a 3-way handshake to then do a
> request to see if the <web something> library has been updated since last
> cached (repeat for several libraries) to then fetch the actual page content.
> All of these thing up to the actual page content could be single packets
> that have to be sent (and responded to with a single packet), waiting for
> the prior one to complete. If you add a few ms to each of these, you can
> easily hit 100ms in added latency. Once you start to try and special cases
> these sorts of things, the code complexity multiplies.

Take web page parsing as an example. The first request is a dns
lookup. The second request is a http get (which can include a few more
round trips for
negotiating SSL), the next is a flurry of page parsing that results in
the internal web browser attempting to schedule it's requests best and
then sending out the relevant dns and tcp flows as best it can figure
out, and then, typically several seconds of data transfer across each
set of flows.

Page paint is bound by getting the critical portions of the resulting
data parsed and laid out properly.

Now, I'd really like that early phase to be optimized by APs by
something more like SQF, where when a station appears and does a few
packet exchanges that it gets priority over stations taking big flows
on a more regular basis, so it more rapidly gets into flow balance
with the other stations.

(and then, for most use cases, like web, exits)

the second phase, of actual transfer, is also bound by RTT. I have no
idea to what extent wifi folk actually put into typical web transfer
delays (20-80ms),
but they are there...

...

The idea of the wifi driver waiting a bit to form a better aggregate
to fit into a txop ties into two slightly different timings and flow
behaviors.

If it is taking 10ms to get a txop in the first place, taking more
time to assemble a good batch of packets to fit into "your" txop would
be good.

If it is taking 4ms to transfer your last txop, well, more packets may
arrive for you in that interval, and feed into your existing flows to
keep them going,
if you defer feeding the hardware with them.

Also, classic tcp acking goes out the window with competing acks at layer 2.

I don't know if quic can do the equivalent of stretch acks...

but one layer 3 ack, block acked by layer 2 in wifi, suffices... if
you have a ton of tcp acks outstanding, block acking them all is
expensive...

> So I believe that the KISS approach ends up with a 'worse is better'
> situation.

Code is going to get more complex anyway, and there are other
optimizations that could be made.

One item I realized recently is that part of codel need not run on
every packet in every flow for stuff destined to fit into a single
txop. It is sufficient to see if it declared a drop on the first
packet in a flow destined for a given txop.

You can then mark that entire flow (in a txop) as droppable (QoSNoAck)
within that txop (as it is within an RTT, and even losing all the
packets there will only cause the rate to halve).

>
> David Lang

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  4:59         ` Dave Taht
@ 2016-05-10  5:22           ` David Lang
  2016-05-10  9:04             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 25+ messages in thread
From: David Lang @ 2016-05-10  5:22 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jonathan Morton, make-wifi-fast, ath9k-devel@lists.ath9k.org,
	Randell Jesup

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8546 bytes --]

On Mon, 9 May 2016, Dave Taht wrote:

> This is a very good overview, thank you. I'd like to take apart
> station behavior on wifi with a web application... as a straw man.
>
> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>> On Mon, 9 May 2016, Dave Taht wrote:
>>
>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>> wrote:
>>>>
>>>>
>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>
>>>>> should we always wait a little bit to see if we can form an aggregate?
>>>>
>>>>
>>>> I thought the consensus on this front was “no”, as long as we’re making
>>>> the decision when we have an immediate transmit opportunity.
>>>
>>>
>>> I think it is more nuanced than how david lang has presented it.
>>
>>
>> I have four reasons for arguing for no speculative delays.
>>
>> 1. airtime that isn't used can't be saved.
>>
>> 2. lower best-case latency
>>
>> 3. simpler code
>>
>> 4. clean, and gradual service degredation under load.
>>
>> the arguments against are:
>>
>> 5. throughput per ms of transmit time is better if aggregation happens than
>> if it doesn't.
>>
>> 6. if you don't transmit, some other station may choose to before you would
>> have finished.
>>
>> #2 is obvious, but with the caviot that anytime you transmit you may be
>> delaying someone else.
>>
>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>> airtime, the question is who.
>>
>> #3 and #4 are closely related.
>>
>> If you follow my approach (transmit immediately if you can, aggregate when
>> you have a queue), the code really has one mode (plus queuing). "If you have
>> a Transmit Oppertunity, transmit up to X packets from the queue", and it
>> doesn't matter if it's only one packet.
>>
>> If you delay the first packet to give you a chance to aggregate it with
>> others, you add in the complexity and overhead of timers (including
>> cancelling timers, slippage in timers, etc) and you add "first packet, start
>> timers" mode to deal with.
>>
>> I grant you that the first approach will "saturate" the airtime at lower
>> traffic levels, but at that point all the stations will start aggregating
>> the minimum amount needed to keep the air saturated, while still minimizing
>> latency.
>>
>> I then expect that application related optimizations would then further
>> complicate the second approach. there are just too many cases where small
>> amounts of data have to be sent and other things serialize behind them.
>>
>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>> request to see if the <web something> library has been updated since last
>> cached (repeat for several libraries) to then fetch the actual page content.
>> All of these thing up to the actual page content could be single packets
>> that have to be sent (and responded to with a single packet), waiting for
>> the prior one to complete. If you add a few ms to each of these, you can
>> easily hit 100ms in added latency. Once you start to try and special cases
>> these sorts of things, the code complexity multiplies.
>
> Take web page parsing as an example. The first request is a dns
> lookup. The second request is a http get (which can include a few more
> round trips for
> negotiating SSL), the next is a flurry of page parsing that results in
> the internal web browser attempting to schedule it's requests best and
> then sending out the relevant dns and tcp flows as best it can figure
> out, and then, typically several seconds of data transfer across each
> set of flows.

Actually, I think that a lot (if not the majority) of these flows are actually 
short, because the libraries/stylesheets/images/etc are cached recently enough 
that they don't need to be fetched again. The browser just needs to check if the 
copy they have is still good or if it's been changed.

> Page paint is bound by getting the critical portions of the resulting
> data parsed and laid out properly.
>
> Now, I'd really like that early phase to be optimized by APs by
> something more like SQF, where when a station appears and does a few
> packet exchanges that it gets priority over stations taking big flows
> on a more regular basis, so it more rapidly gets into flow balance
> with the other stations.

There are two parts to this process

1. the tactical (do you send the pending packet immediately, or do you delay it 
to see if you can save airtime with aggregation)

2. the strategic (once a queue of pending packets has built up, how do you pick 
which one to send)

what you are talking about is the strategic part of it, where you assume that 
there is a queue of data to be sent, and picking which stuff to send first 
affects the performance.

What I'm talking about is the tactical, before the queue has built, don't add 
time to the flow by delaying packets. Especially because in this case the odds 
are good that there is not going to be anything to aggregate with it.

DNS udp packets aren't going to have anything else to aggregate with.

3-way handshake packets aren't going to have anything else to aggregate with 
(until and unless you are doing them while you have other stuff being 
transmitted, even parallel connections to different servers are likely to be 
spread out due to differences in network distance)

http checks for cache validation are unlikely to have anything to aggregate 
with.

The SSL handshake is a bit more complex, but there's not a lot of data moving in 
either direction at any step, and there are a lot of exchanges.

With 'modern' AJAX sites, even after the entire page is rendered and the 
javascript starts running and fetching data you may have a page retrieve a lot 
of stuff, but with lazy coding, there area lot of requests that retrieve very 
small amounts of data.

Find some nasty sites (complexity wise) and do some sniffs on a nice, 
low-latency wired network and check the number of connections, and the sizes of 
all the packets (and their timing)

artificially add some horrid latency to the connection to exaggerate the 
influence of serialized steps and watch what happens.

David Lang

> (and then, for most use cases, like web, exits)
1>
> the second phase, of actual transfer, is also bound by RTT. I have no
> idea to what extent wifi folk actually put into typical web transfer
> delays (20-80ms),
> but they are there...
>
> ...
>
> The idea of the wifi driver waiting a bit to form a better aggregate
> to fit into a txop ties into two slightly different timings and flow
> behaviors.
>
> If it is taking 10ms to get a txop in the first place, taking more
> time to assemble a good batch of packets to fit into "your" txop would
> be good.

If you are not at a txop, all you can do is queue, so you queue. And when you 
get a txop, you send as much as you can (up to the configured max)

no disagreement there.

if the txop is a predictable minimum distance away (because you know that 
another station just started transmitting and will take 10ms), then you can 
spend more time being fancy about what you send and how you pack it.

>
> If it is taking 4ms to transfer your last txop, well, more packets may
> arrive for you in that interval, and feed into your existing flows to
> keep them going,
> if you defer feeding the hardware with them.

Yes, This strategy ideally is happening as close to the hardware as possible.

> Also, classic tcp acking goes out the window with competing acks at layer 2.
>
> I don't know if quic can do the equivalent of stretch acks...
>
> but one layer 3 ack, block acked by layer 2 in wifi, suffices... if
> you have a ton of tcp acks outstanding, block acking them all is
> expensive...

yes.

>> So I believe that the KISS approach ends up with a 'worse is better'
>> situation.
>
> Code is going to get more complex anyway, and there are other
> optimizations that could be made.

all the more reason to have that complexity on top of a simpler core :-)

> One item I realized recently is that part of codel need not run on
> every packet in every flow for stuff destined to fit into a single
> txop. It is sufficient to see if it declared a drop on the first
> packet in a flow destined for a given txop.
>
> You can then mark that entire flow (in a txop) as droppable (QoSNoAck)
> within that txop (as it is within an RTT, and even losing all the
> packets there will only cause the rate to halve).

I would try to not drop all of them, in case the bitrate drops before you 
re-send (try to avoid having one txop worth of date become several).

David Lang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  5:22           ` David Lang
@ 2016-05-10  9:04             ` Toke Høiland-Jørgensen
  2016-05-11 14:12               ` Dave Täht
  0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-05-10  9:04 UTC (permalink / raw)
  To: David Lang
  Cc: Dave Taht, ath9k-devel@lists.ath9k.org, Randell Jesup, make-wifi-fast

David Lang <david@lang.hm> writes:

> There are two parts to this process
>
> 1. the tactical (do you send the pending packet immediately, or do you
> delay it to see if you can save airtime with aggregation)

A colleague of mine looked into this some time ago as part of his PhD
thesis. This was pre-801.11n, so they were doing experiments on adding
aggregation to 802.11g by messing with the MTU. What they found was that
they actually got better results from just sending data when they had it
rather than waiting to see if more showed up.

-Toke

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  9:04             ` Toke Høiland-Jørgensen
@ 2016-05-11 14:12               ` Dave Täht
  2016-05-11 15:09                 ` Dave Taht
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Täht @ 2016-05-11 14:12 UTC (permalink / raw)
  To: make-wifi-fast



On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote:
> David Lang <david@lang.hm> writes:
> 
>> There are two parts to this process
>>
>> 1. the tactical (do you send the pending packet immediately, or do you
>> delay it to see if you can save airtime with aggregation)
> 
> A colleague of mine looked into this some time ago as part of his PhD
> thesis. This was pre-801.11n, so they were doing experiments on adding
> aggregation to 802.11g by messing with the MTU. What they found was that
> they actually got better results from just sending data when they had it
> rather than waiting to see if more showed up.

cat 802.11g_research/* > /dev/null

> 
> -Toke
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-11 14:12               ` Dave Täht
@ 2016-05-11 15:09                 ` Dave Taht
  2016-05-11 15:20                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-11 15:09 UTC (permalink / raw)
  To: Dave Täht; +Cc: make-wifi-fast

On Wed, May 11, 2016 at 7:12 AM, Dave Täht <dave@taht.net> wrote:
>
>
> On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote:
>> David Lang <david@lang.hm> writes:
>>
>>> There are two parts to this process
>>>
>>> 1. the tactical (do you send the pending packet immediately, or do you
>>> delay it to see if you can save airtime with aggregation)
>>
>> A colleague of mine looked into this some time ago as part of his PhD
>> thesis. This was pre-801.11n, so they were doing experiments on adding
>> aggregation to 802.11g by messing with the MTU. What they found was that
>> they actually got better results from just sending data when they had it
>> rather than waiting to see if more showed up.
>
> cat 802.11g_research/* > /dev/null

Sorry, that was overly pithy (precoffee here). What I'd meant was that
802.11e's assumptions as to how to do scheduling, particularly for
QoS, and how 802.11g behaved in comparison to n and later, does not
give you much of a starting point on how to address things now.
Successive standards and implementations have made certain things much
better and other things much worse.

Adopting 802.11e style QoS framing - good idea
Adopting 802.11e style hw queue scheduling (EDCA) by mapping diffserv
queues blindly to to those hw queues - horrific, without also
attempting to meet the service time requirements in the software queue
management. I have been perpetually demonstrating 200+ms VO queues
since day one, starving the other queues, where what you want is a
very short queue (under 10ms), served sparsely (say, no more than 2 or
4/10ths of the overall airtime) - and only "The right stuff" to map
into it. CS6/CS7 do not belong in VO.

802.11n - It became saner to just aggregate in most cases where you
might have used 802.11e qos, steering flows into the next packed txop.
And as noted elsewhere[1], per station queuing so you can, indeed,
aggregate sanely.

Packing on all the new management frames and crypto without sanely
managing those, not so good idea.

Holding multicast to it's 1998 rate... grump.

802.11ac - adopting the better framing universally for all hw queues -
good. Still blindly exposing those queues to userspace - horrible.
Hiding most the rate control and retry information (as all firmware
seem to do thus far), tying our hands behind our backs. Using up all
the channels in a world with an ever increasing density of aps and
stations and trying to manage the allocations in hardware, scary. Four
color theorem....

Cramming up to 4MBytes into a single TXOP - what a great lab result! I
have no idea how to do that, and am pretty sure even trying is
undesirable.

...

I'd like to (re)start with 802.11ac assumptions and work backwards,
rather than 802.11g assumptions and work forwards. The most desirable
thing I'd love to see is hardware capable of turning the tail end of a
txop around and
sending some real data back, and to know if QosNoack can be
selectively used.....

And on my bad days, I really would like to go back to playing with
5mhz channels (which the ath9k still supports), and getting channel
selection to work better. I'd rather have 5mbits of reliable low
latency bandwidth in the real world, than 500Mbits in a faraday cage.

/me goes in search of some more coffee.

[1] I really wanted people to argue with me about this talk one day...
http://blog.cerowrt.org/post/talks/make-wifi-fast/

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-11 15:09                 ` Dave Taht
@ 2016-05-11 15:20                   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-05-11 15:20 UTC (permalink / raw)
  To: Dave Taht; +Cc: make-wifi-fast

Dave Taht <dave.taht@gmail.com> writes:

> On Wed, May 11, 2016 at 7:12 AM, Dave Täht <dave@taht.net> wrote:
>>
>>
>> On 5/10/16 2:04 AM, Toke Høiland-Jørgensen wrote:
>>> David Lang <david@lang.hm> writes:
>>>
>>>> There are two parts to this process
>>>>
>>>> 1. the tactical (do you send the pending packet immediately, or do you
>>>> delay it to see if you can save airtime with aggregation)
>>>
>>> A colleague of mine looked into this some time ago as part of his PhD
>>> thesis. This was pre-801.11n, so they were doing experiments on adding
>>> aggregation to 802.11g by messing with the MTU. What they found was that
>>> they actually got better results from just sending data when they had it
>>> rather than waiting to see if more showed up.
>>
>> cat 802.11g_research/* > /dev/null
>
> Sorry, that was overly pithy (precoffee here). What I'd meant was that
> 802.11e's assumptions as to how to do scheduling, particularly for
> QoS, and how 802.11g behaved in comparison to n and later, does not
> give you much of a starting point on how to address things now.
> Successive standards and implementations have made certain things much
> better and other things much worse.

802.11e != 802.11g. I am completely ignoring 802.11e for now, and the
rest of the standard is not that different between g/n/ac; it's
basically different rates, encodings and interframe gaps. Which from a
scheduling point of view simply means different constants.

Once we have a working baseline scheduler we can think about how to
behave sanely in an 802.11e world; as far as I'm concerned the right
answer might just be "shut it off entirely"... :P

-Toke

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-10  3:41       ` [Make-wifi-fast] " David Lang
  2016-05-10  4:59         ` Dave Taht
@ 2016-05-13 17:46         ` Bob McMahon
  2016-05-13 17:49           ` Dave Taht
  2016-05-13 20:49           ` David Lang
  1 sibling, 2 replies; 25+ messages in thread
From: Bob McMahon @ 2016-05-13 17:46 UTC (permalink / raw)
  To: David Lang; +Cc: Dave Taht, ath9k-devel@lists.ath9k.org, make-wifi-fast


[-- Attachment #1.1: Type: text/plain, Size: 3952 bytes --]

On driver delays, from a driver development perspective the problem isn't
to add delay or not (it shouldn't) it's that the TCP stack isn't presenting
sufficient data to fully utilize aggregation.  Below is a histogram
comparing aggregations of 3 systems (units are mpdu per ampdu.)  The lowest
latency stack is in purple and it's also the worst performance with respect
to average throughput.   From a driver perspective, one would like TCP to
present sufficient bytes into the pipe that the histogram leans toward the
blue.

[image: Inline image 1]
I'm not an expert on TCP near congestion avoidance but maybe the algorithm
could benefit from RTT as weighted by CWND (or bytes in flight) and hunt
that maximum?

Bob

On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:

> On Mon, 9 May 2016, Dave Taht wrote:
>
> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>> wrote:
>>
>>>
>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>> should we always wait a little bit to see if we can form an aggregate?
>>>>
>>>
>>> I thought the consensus on this front was “no”, as long as we’re making
>>> the decision when we have an immediate transmit opportunity.
>>>
>>
>> I think it is more nuanced than how david lang has presented it.
>>
>
> I have four reasons for arguing for no speculative delays.
>
> 1. airtime that isn't used can't be saved.
>
> 2. lower best-case latency
>
> 3. simpler code
>
> 4. clean, and gradual service degredation under load.
>
> the arguments against are:
>
> 5. throughput per ms of transmit time is better if aggregation happens
> than if it doesn't.
>
> 6. if you don't transmit, some other station may choose to before you
> would have finished.
>
> #2 is obvious, but with the caviot that anytime you transmit you may be
> delaying someone else.
>
> #1 and #6 are flip sides of each other. we want _someone_ to use the
> airtime, the question is who.
>
> #3 and #4 are closely related.
>
> If you follow my approach (transmit immediately if you can, aggregate when
> you have a queue), the code really has one mode (plus queuing). "If you
> have a Transmit Oppertunity, transmit up to X packets from the queue", and
> it doesn't matter if it's only one packet.
>
> If you delay the first packet to give you a chance to aggregate it with
> others, you add in the complexity and overhead of timers (including
> cancelling timers, slippage in timers, etc) and you add "first packet,
> start timers" mode to deal with.
>
> I grant you that the first approach will "saturate" the airtime at lower
> traffic levels, but at that point all the stations will start aggregating
> the minimum amount needed to keep the air saturated, while still minimizing
> latency.
>
> I then expect that application related optimizations would then further
> complicate the second approach. there are just too many cases where small
> amounts of data have to be sent and other things serialize behind them.
>
> DNS lookup to find a domain to then to a 3-way handshake to then do a
> request to see if the <web something> library has been updated since last
> cached (repeat for several libraries) to then fetch the actual page
> content. All of these thing up to the actual page content could be single
> packets that have to be sent (and responded to with a single packet),
> waiting for the prior one to complete. If you add a few ms to each of
> these, you can easily hit 100ms in added latency. Once you start to try and
> special cases these sorts of things, the code complexity multiplies.
>
> So I believe that the KISS approach ends up with a 'worse is better'
> situation.
>
> David Lang
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>
>

[-- Attachment #1.2: Type: text/html, Size: 5177 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 17:46         ` Bob McMahon
@ 2016-05-13 17:49           ` Dave Taht
  2016-05-13 18:05             ` Bob McMahon
  2016-05-13 20:49           ` David Lang
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-13 17:49 UTC (permalink / raw)
  To: Bob McMahon; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast


[-- Attachment #1.1: Type: text/plain, Size: 4497 bytes --]

I try to stress that single tcp flows should never use all the bandwidth
for the sawtooth to function properly.

What happens when you hit it with 4 flows? or 12?

nice graph, but I don't understand the single blue spikes?

On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com>
wrote:

> On driver delays, from a driver development perspective the problem isn't
> to add delay or not (it shouldn't) it's that the TCP stack isn't presenting
> sufficient data to fully utilize aggregation.  Below is a histogram
> comparing aggregations of 3 systems (units are mpdu per ampdu.)  The lowest
> latency stack is in purple and it's also the worst performance with respect
> to average throughput.   From a driver perspective, one would like TCP to
> present sufficient bytes into the pipe that the histogram leans toward the
> blue.
>
> [image: Inline image 1]
> I'm not an expert on TCP near congestion avoidance but maybe the algorithm
> could benefit from RTT as weighted by CWND (or bytes in flight) and hunt
> that maximum?
>
> Bob
>
> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>
>> On Mon, 9 May 2016, Dave Taht wrote:
>>
>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>
>>>>> should we always wait a little bit to see if we can form an aggregate?
>>>>>
>>>>
>>>> I thought the consensus on this front was “no”, as long as we’re making
>>>> the decision when we have an immediate transmit opportunity.
>>>>
>>>
>>> I think it is more nuanced than how david lang has presented it.
>>>
>>
>> I have four reasons for arguing for no speculative delays.
>>
>> 1. airtime that isn't used can't be saved.
>>
>> 2. lower best-case latency
>>
>> 3. simpler code
>>
>> 4. clean, and gradual service degredation under load.
>>
>> the arguments against are:
>>
>> 5. throughput per ms of transmit time is better if aggregation happens
>> than if it doesn't.
>>
>> 6. if you don't transmit, some other station may choose to before you
>> would have finished.
>>
>> #2 is obvious, but with the caviot that anytime you transmit you may be
>> delaying someone else.
>>
>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>> airtime, the question is who.
>>
>> #3 and #4 are closely related.
>>
>> If you follow my approach (transmit immediately if you can, aggregate
>> when you have a queue), the code really has one mode (plus queuing). "If
>> you have a Transmit Oppertunity, transmit up to X packets from the queue",
>> and it doesn't matter if it's only one packet.
>>
>> If you delay the first packet to give you a chance to aggregate it with
>> others, you add in the complexity and overhead of timers (including
>> cancelling timers, slippage in timers, etc) and you add "first packet,
>> start timers" mode to deal with.
>>
>> I grant you that the first approach will "saturate" the airtime at lower
>> traffic levels, but at that point all the stations will start aggregating
>> the minimum amount needed to keep the air saturated, while still minimizing
>> latency.
>>
>> I then expect that application related optimizations would then further
>> complicate the second approach. there are just too many cases where small
>> amounts of data have to be sent and other things serialize behind them.
>>
>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>> request to see if the <web something> library has been updated since last
>> cached (repeat for several libraries) to then fetch the actual page
>> content. All of these thing up to the actual page content could be single
>> packets that have to be sent (and responded to with a single packet),
>> waiting for the prior one to complete. If you add a few ms to each of
>> these, you can easily hit 100ms in added latency. Once you start to try and
>> special cases these sorts of things, the code complexity multiplies.
>>
>> So I believe that the KISS approach ends up with a 'worse is better'
>> situation.
>>
>> David Lang
>> _______________________________________________
>> Make-wifi-fast mailing list
>> Make-wifi-fast@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>>
>>
>


-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

[-- Attachment #1.2: Type: text/html, Size: 6249 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 17:49           ` Dave Taht
@ 2016-05-13 18:05             ` Bob McMahon
  2016-05-13 18:11               ` Bob McMahon
  2016-05-13 18:57               ` Dave Taht
  0 siblings, 2 replies; 25+ messages in thread
From: Bob McMahon @ 2016-05-13 18:05 UTC (permalink / raw)
  To: Dave Taht; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast


[-- Attachment #1.1: Type: text/plain, Size: 5279 bytes --]

The graphs are histograms of mpdu/ampdu, from 1 to 64.   The blue spikes
show that the vast majority of traffic is filling an ampdu with 64 mpdus.
The fill stop reason is ampdu full.  The purple fill stop reasons are that
the sw fifo (above the driver) went empty indicating a too small CWND for
maximum aggregation.  A driver wants to aggregate to the fullest extent
possible.     A work around is to set initcwnd in the router table.

I don't have the data available for multiple flows at the moment.  Note:
That will depend on what exactly defines a flow.

Bob

On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote:

> I try to stress that single tcp flows should never use all the bandwidth
> for the sawtooth to function properly.
>
> What happens when you hit it with 4 flows? or 12?
>
> nice graph, but I don't understand the single blue spikes?
>
> On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com>
> wrote:
>
>> On driver delays, from a driver development perspective the problem isn't
>> to add delay or not (it shouldn't) it's that the TCP stack isn't presenting
>> sufficient data to fully utilize aggregation.  Below is a histogram
>> comparing aggregations of 3 systems (units are mpdu per ampdu.)  The lowest
>> latency stack is in purple and it's also the worst performance with respect
>> to average throughput.   From a driver perspective, one would like TCP to
>> present sufficient bytes into the pipe that the histogram leans toward the
>> blue.
>>
>> [image: Inline image 1]
>> I'm not an expert on TCP near congestion avoidance but maybe the
>> algorithm could benefit from RTT as weighted by CWND (or bytes in flight)
>> and hunt that maximum?
>>
>> Bob
>>
>> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>>
>>> On Mon, 9 May 2016, Dave Taht wrote:
>>>
>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>>
>>>>>> should we always wait a little bit to see if we can form an aggregate?
>>>>>>
>>>>>
>>>>> I thought the consensus on this front was “no”, as long as we’re
>>>>> making the decision when we have an immediate transmit opportunity.
>>>>>
>>>>
>>>> I think it is more nuanced than how david lang has presented it.
>>>>
>>>
>>> I have four reasons for arguing for no speculative delays.
>>>
>>> 1. airtime that isn't used can't be saved.
>>>
>>> 2. lower best-case latency
>>>
>>> 3. simpler code
>>>
>>> 4. clean, and gradual service degredation under load.
>>>
>>> the arguments against are:
>>>
>>> 5. throughput per ms of transmit time is better if aggregation happens
>>> than if it doesn't.
>>>
>>> 6. if you don't transmit, some other station may choose to before you
>>> would have finished.
>>>
>>> #2 is obvious, but with the caviot that anytime you transmit you may be
>>> delaying someone else.
>>>
>>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>>> airtime, the question is who.
>>>
>>> #3 and #4 are closely related.
>>>
>>> If you follow my approach (transmit immediately if you can, aggregate
>>> when you have a queue), the code really has one mode (plus queuing). "If
>>> you have a Transmit Oppertunity, transmit up to X packets from the queue",
>>> and it doesn't matter if it's only one packet.
>>>
>>> If you delay the first packet to give you a chance to aggregate it with
>>> others, you add in the complexity and overhead of timers (including
>>> cancelling timers, slippage in timers, etc) and you add "first packet,
>>> start timers" mode to deal with.
>>>
>>> I grant you that the first approach will "saturate" the airtime at lower
>>> traffic levels, but at that point all the stations will start aggregating
>>> the minimum amount needed to keep the air saturated, while still minimizing
>>> latency.
>>>
>>> I then expect that application related optimizations would then further
>>> complicate the second approach. there are just too many cases where small
>>> amounts of data have to be sent and other things serialize behind them.
>>>
>>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>>> request to see if the <web something> library has been updated since last
>>> cached (repeat for several libraries) to then fetch the actual page
>>> content. All of these thing up to the actual page content could be single
>>> packets that have to be sent (and responded to with a single packet),
>>> waiting for the prior one to complete. If you add a few ms to each of
>>> these, you can easily hit 100ms in added latency. Once you start to try and
>>> special cases these sorts of things, the code complexity multiplies.
>>>
>>> So I believe that the KISS approach ends up with a 'worse is better'
>>> situation.
>>>
>>> David Lang
>>> _______________________________________________
>>> Make-wifi-fast mailing list
>>> Make-wifi-fast@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>>>
>>>
>>
>
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org
>

[-- Attachment #1.2: Type: text/html, Size: 7231 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 18:05             ` Bob McMahon
@ 2016-05-13 18:11               ` Bob McMahon
  2016-05-13 18:57               ` Dave Taht
  1 sibling, 0 replies; 25+ messages in thread
From: Bob McMahon @ 2016-05-13 18:11 UTC (permalink / raw)
  To: Dave Taht; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast


[-- Attachment #1.1: Type: text/plain, Size: 5753 bytes --]

Also, I haven't done it but I don't think rate limiting TCP will solve this
aggregation "problem."  The faster RTT is driving CWND much below the
maximum aggregation, i.e. CWND is too small relative to wi-fi aggregation.

Bob

On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com>
wrote:

> The graphs are histograms of mpdu/ampdu, from 1 to 64.   The blue spikes
> show that the vast majority of traffic is filling an ampdu with 64 mpdus.
> The fill stop reason is ampdu full.  The purple fill stop reasons are that
> the sw fifo (above the driver) went empty indicating a too small CWND for
> maximum aggregation.  A driver wants to aggregate to the fullest extent
> possible.     A work around is to set initcwnd in the router table.
>
> I don't have the data available for multiple flows at the moment.  Note:
> That will depend on what exactly defines a flow.
>
> Bob
>
> On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote:
>
>> I try to stress that single tcp flows should never use all the bandwidth
>> for the sawtooth to function properly.
>>
>> What happens when you hit it with 4 flows? or 12?
>>
>> nice graph, but I don't understand the single blue spikes?
>>
>> On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com>
>> wrote:
>>
>>> On driver delays, from a driver development perspective the problem
>>> isn't to add delay or not (it shouldn't) it's that the TCP stack isn't
>>> presenting sufficient data to fully utilize aggregation.  Below is a
>>> histogram comparing aggregations of 3 systems (units are mpdu per ampdu.)
>>>  The lowest latency stack is in purple and it's also the worst performance
>>> with respect to average throughput.   From a driver perspective, one would
>>> like TCP to present sufficient bytes into the pipe that the histogram leans
>>> toward the blue.
>>>
>>> [image: Inline image 1]
>>> I'm not an expert on TCP near congestion avoidance but maybe the
>>> algorithm could benefit from RTT as weighted by CWND (or bytes in flight)
>>> and hunt that maximum?
>>>
>>> Bob
>>>
>>> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>>>
>>>> On Mon, 9 May 2016, Dave Taht wrote:
>>>>
>>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>>>
>>>>>>> should we always wait a little bit to see if we can form an
>>>>>>> aggregate?
>>>>>>>
>>>>>>
>>>>>> I thought the consensus on this front was “no”, as long as we’re
>>>>>> making the decision when we have an immediate transmit opportunity.
>>>>>>
>>>>>
>>>>> I think it is more nuanced than how david lang has presented it.
>>>>>
>>>>
>>>> I have four reasons for arguing for no speculative delays.
>>>>
>>>> 1. airtime that isn't used can't be saved.
>>>>
>>>> 2. lower best-case latency
>>>>
>>>> 3. simpler code
>>>>
>>>> 4. clean, and gradual service degredation under load.
>>>>
>>>> the arguments against are:
>>>>
>>>> 5. throughput per ms of transmit time is better if aggregation happens
>>>> than if it doesn't.
>>>>
>>>> 6. if you don't transmit, some other station may choose to before you
>>>> would have finished.
>>>>
>>>> #2 is obvious, but with the caviot that anytime you transmit you may be
>>>> delaying someone else.
>>>>
>>>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>>>> airtime, the question is who.
>>>>
>>>> #3 and #4 are closely related.
>>>>
>>>> If you follow my approach (transmit immediately if you can, aggregate
>>>> when you have a queue), the code really has one mode (plus queuing). "If
>>>> you have a Transmit Oppertunity, transmit up to X packets from the queue",
>>>> and it doesn't matter if it's only one packet.
>>>>
>>>> If you delay the first packet to give you a chance to aggregate it with
>>>> others, you add in the complexity and overhead of timers (including
>>>> cancelling timers, slippage in timers, etc) and you add "first packet,
>>>> start timers" mode to deal with.
>>>>
>>>> I grant you that the first approach will "saturate" the airtime at
>>>> lower traffic levels, but at that point all the stations will start
>>>> aggregating the minimum amount needed to keep the air saturated, while
>>>> still minimizing latency.
>>>>
>>>> I then expect that application related optimizations would then further
>>>> complicate the second approach. there are just too many cases where small
>>>> amounts of data have to be sent and other things serialize behind them.
>>>>
>>>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>>>> request to see if the <web something> library has been updated since last
>>>> cached (repeat for several libraries) to then fetch the actual page
>>>> content. All of these thing up to the actual page content could be single
>>>> packets that have to be sent (and responded to with a single packet),
>>>> waiting for the prior one to complete. If you add a few ms to each of
>>>> these, you can easily hit 100ms in added latency. Once you start to try and
>>>> special cases these sorts of things, the code complexity multiplies.
>>>>
>>>> So I believe that the KISS approach ends up with a 'worse is better'
>>>> situation.
>>>>
>>>> David Lang
>>>> _______________________________________________
>>>> Make-wifi-fast mailing list
>>>> Make-wifi-fast@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>>>>
>>>>
>>>
>>
>>
>> --
>> Dave Täht
>> Let's go make home routers and wifi faster! With better software!
>> http://blog.cerowrt.org
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 7966 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 18:05             ` Bob McMahon
  2016-05-13 18:11               ` Bob McMahon
@ 2016-05-13 18:57               ` Dave Taht
  2016-05-13 19:20                 ` Aaron Wood
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-13 18:57 UTC (permalink / raw)
  To: Bob McMahon; +Cc: David Lang, ath9k-devel@lists.ath9k.org, make-wifi-fast

[-- Attachment #1.1: Type: text/plain, Size: 8802 bytes --]

On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com>
wrote:

> The graphs are histograms of mpdu/ampdu, from 1 to 64.   The blue spikes
> show that the vast majority of traffic is filling an ampdu with 64 mpdus.
> The fill stop reason is ampdu full.  The purple fill stop reasons are that
> the sw fifo (above the driver) went empty indicating a too small CWND for
> maximum aggregation.
>

Can I get you to drive this plot, wifi rate limited to 6,20,and 300mbits
and "native", with flent's tcp_upload and rtt_fair_up tests?

My goal is far different than getting a single tcp flow to max speed, it is
to get it to close to full throughput with multiple flows while not
accumulating 2 sec of buffering...

http://blog.cerowrt.org/post/rtt_fair_on_wifi/

Or even 100ms of it:

http://blog.cerowrt.org/post/ath10_ath9k_1/

Early experiments with getting a good rate estimate "to fill the queue"
from rate control info was basically successful, but lacking rate control,
using dql only, is currently taking much longer at higher rates, but works
well at lower ones.

http://blog.cerowrt.org/post/dql_on_wifi_2/

http://blog.cerowrt.org/post/dql_on_wifi

> A driver wants to aggregate to the fullest extent possible.
>

 While still retaining tcp congestion control. There are other nuances,
like being nice
about total airtime to others sharing the media, minimizing retries due to
an overlarge ampdu for the current BER, etc.

I don't remember what section of the 802.11-2012 standard this is from, but:

```
Another unresolved issue is how large a concatenation threshold the devices
should set. Ideally, the maximum value is preferable but in a noisy
environment, short frame lengths are preferred because of potential
retransmissions. The A-MPDU concatenation scheme operates only over the
packets that are already buffered in the transmission queue, and thus, if
the CPR data rate is low, then efficiency also will be small. There are
many ongoing studies on alternative queuing mechanisms different from the
standard FIFO. *A combination of frame aggregation and an enhanced queuing
algorithm could increase channel efficiency further*.
```

  A work around is to set initcwnd in the router table.
>

Ugh. Um... no... initcwnd 10 is already too large for many networks. If you
set your wifi initcwnd to something like 64, what happens to the 5mbit
cable uplink just upstream from that?

There are a couple other parameters that might be of use - tcp_send_lowat
and tcp_limit_output_bytes. These were set off, and originally too low for
wifi. A good setting for the latter, for ethernet, was about 4096. Then the
wifi folk complained, and it got bumped to 64k, and I think now, it's at
256k to make the xen folk happier.

These are all work arounds against the real problem which was not tuning
driver queueing to the actual achievable ampdu, and doing fq+aqm to spread
the load (essentially "pace" bursts) which is what is happening in michal's
patches.

> I don't have the data available for multiple flows at the moment.
>

The world is full of folk trying to make single tcp flows go at maximum
speed, with multiple alternatives to cubic. This quest has resulted in the
near elimination of the sawtooth along the edge and horrific overbuffering,
to a net loss in speed, and a huge perception of "slowness".

Note: I have long figured that a different tcp should be used on wifi
uplinks, after we fixed a ton of basic mis-assumptions. As well as tcp's
should become more wifi/wireless aware, but tweaking initcwnd,
tcp_limit_output_bytes, etc, is not the right thing.

There has been some good tcp research published of late, look into "BBR",
and "CDG".

> Note: That will depend on what exactly defines a flow.
>
> Bob
>
> On Fri, May 13, 2016 at 10:49 AM, Dave Taht <dave.taht@gmail.com> wrote:
>
>> I try to stress that single tcp flows should never use all the bandwidth
>> for the sawtooth to function properly.
>>
>> What happens when you hit it with 4 flows? or 12?
>>
>> nice graph, but I don't understand the single blue spikes?
>>
>> On Fri, May 13, 2016 at 10:46 AM, Bob McMahon <bob.mcmahon@broadcom.com>
>> wrote:
>>
>>> On driver delays, from a driver development perspective the problem
>>> isn't to add delay or not (it shouldn't) it's that the TCP stack isn't
>>> presenting sufficient data to fully utilize aggregation.  Below is a
>>> histogram comparing aggregations of 3 systems (units are mpdu per ampdu.)
>>>  The lowest latency stack is in purple and it's also the worst performance
>>> with respect to average throughput.   From a driver perspective, one would
>>> like TCP to present sufficient bytes into the pipe that the histogram leans
>>> toward the blue.
>>>
>>> [image: Inline image 1]
>>> I'm not an expert on TCP near congestion avoidance but maybe the
>>> algorithm could benefit from RTT as weighted by CWND (or bytes in flight)
>>> and hunt that maximum?
>>>
>>> Bob
>>>
>>> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>>>
>>>> On Mon, 9 May 2016, Dave Taht wrote:
>>>>
>>>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>>>
>>>>>>> should we always wait a little bit to see if we can form an
>>>>>>> aggregate?
>>>>>>>
>>>>>>
>>>>>> I thought the consensus on this front was “no”, as long as we’re
>>>>>> making the decision when we have an immediate transmit opportunity.
>>>>>>
>>>>>
>>>>> I think it is more nuanced than how david lang has presented it.
>>>>>
>>>>
>>>> I have four reasons for arguing for no speculative delays.
>>>>
>>>> 1. airtime that isn't used can't be saved.
>>>>
>>>> 2. lower best-case latency
>>>>
>>>> 3. simpler code
>>>>
>>>> 4. clean, and gradual service degredation under load.
>>>>
>>>> the arguments against are:
>>>>
>>>> 5. throughput per ms of transmit time is better if aggregation happens
>>>> than if it doesn't.
>>>>
>>>> 6. if you don't transmit, some other station may choose to before you
>>>> would have finished.
>>>>
>>>> #2 is obvious, but with the caviot that anytime you transmit you may be
>>>> delaying someone else.
>>>>
>>>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>>>> airtime, the question is who.
>>>>
>>>> #3 and #4 are closely related.
>>>>
>>>> If you follow my approach (transmit immediately if you can, aggregate
>>>> when you have a queue), the code really has one mode (plus queuing). "If
>>>> you have a Transmit Oppertunity, transmit up to X packets from the queue",
>>>> and it doesn't matter if it's only one packet.
>>>>
>>>> If you delay the first packet to give you a chance to aggregate it with
>>>> others, you add in the complexity and overhead of timers (including
>>>> cancelling timers, slippage in timers, etc) and you add "first packet,
>>>> start timers" mode to deal with.
>>>>
>>>> I grant you that the first approach will "saturate" the airtime at
>>>> lower traffic levels, but at that point all the stations will start
>>>> aggregating the minimum amount needed to keep the air saturated, while
>>>> still minimizing latency.
>>>>
>>>> I then expect that application related optimizations would then further
>>>> complicate the second approach. there are just too many cases where small
>>>> amounts of data have to be sent and other things serialize behind them.
>>>>
>>>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>>>> request to see if the <web something> library has been updated since last
>>>> cached (repeat for several libraries) to then fetch the actual page
>>>> content. All of these thing up to the actual page content could be single
>>>> packets that have to be sent (and responded to with a single packet),
>>>> waiting for the prior one to complete. If you add a few ms to each of
>>>> these, you can easily hit 100ms in added latency. Once you start to try and
>>>> special cases these sorts of things, the code complexity multiplies.
>>>>
>>>> So I believe that the KISS approach ends up with a 'worse is better'
>>>> situation.
>>>>
>>>> David Lang
>>>> _______________________________________________
>>>> Make-wifi-fast mailing list
>>>> Make-wifi-fast@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>>>>
>>>>
>>>
>>
>>
>> --
>> Dave Täht
>> Let's go make home routers and wifi faster! With better software!
>> http://blog.cerowrt.org
>>
>
>

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

[-- Attachment #1.2: Type: text/html, Size: 13096 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 18:57               ` Dave Taht
@ 2016-05-13 19:20                 ` Aaron Wood
  2016-05-13 20:21                   ` Dave Taht
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Wood @ 2016-05-13 19:20 UTC (permalink / raw)
  To: Dave Taht; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Fri, May 13, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:

> On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com>
> wrote:
>
>>  don't have the data available for multiple flows at the moment.
>>
>
> The world is full of folk trying to make single tcp flows go at maximum
> speed, with multiple alternatives to cubic.
>

And most web traffic is multiple-flow, even with HTTP/2 and SPDY, due to
domain/host sharding.  Many bursts of multi-flow traffic.  About the only
thing that's single-flow is streaming-video (which isn't latency
sensitive).  The only local services that I know of that could use
maximal-rate wifi are NAS systems using SMB, AFP, etc.

-Aaron

[-- Attachment #2: Type: text/html, Size: 1435 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 19:20                 ` Aaron Wood
@ 2016-05-13 20:21                   ` Dave Taht
  2016-05-13 20:51                     ` Dave Taht
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Taht @ 2016-05-13 20:21 UTC (permalink / raw)
  To: Aaron Wood; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org

On Fri, May 13, 2016 at 12:20 PM, Aaron Wood <woody77@gmail.com> wrote:
> On Fri, May 13, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:
>>
>> On Fri, May 13, 2016 at 11:05 AM, Bob McMahon <bob.mcmahon@broadcom.com>
>> wrote:
>>>
>>>  don't have the data available for multiple flows at the moment.
>>
>>
>> The world is full of folk trying to make single tcp flows go at maximum
>> speed, with multiple alternatives to cubic.
>
>
> And most web traffic is multiple-flow, even with HTTP/2 and SPDY, due to
> domain/host sharding.  Many bursts of multi-flow traffic.  About the only
> thing that's single-flow is streaming-video (which isn't latency sensitive).

And usually, rate limited. It would be nice if that streaming video actually
fit into a single txop in many cases.

> The only local services that I know of that could use maximal-rate wifi are
> NAS systems using SMB, AFP, etc.

And many of these, until recently, were actually bound by the speed of
their hard disks and by inefficiencies in the protocol.

>
> -Aaron

A useful flent test for seeing the impact of good congestion control,
are tcp_2up_square and tcp_2up_dleay.

There are also other related tests like "reno_cubic_westwood_cdg"
which try one form of tcp against another. I really should sit down
and write a piece about these, to try to show that one flow grabbing
all the link hurts all successor flows.

Both could be better. I like what the teacup people are doing here,
using 3 staggered flows to show their results.

...

and I misspoke a bit earlier, meant to say txop where instead I'd said
ampdu. Multiple ampdus can fit into a txop, and so far as I know, be
block acked differently.

https://books.google.com/books?id=XsF5CgAAQBAJ&pg=PA32&lpg=PA32&dq=multiple+ampdus+in+a+txop&source=bl&ots=dRCYcD9rBc&sig=tVocMORuEXBOsfUlcmuSLTdM0Lw&hl=en&sa=X&ved=0ahUKEwiLxurP69fMAhVU5WMKHVejAlUQ6AEIHzAA#v=onepage&q=multiple%20ampdus%20in%20a%20txop&f=false

One thing I don't have a grip on is the airtime cost on packing
multiple ampdus into a txop, in terms of the block ack, also in
relation to using amdsus as per the 2015 paper referenced off the
"thoughts about airtime fairness thread" that the ath9k list was not
cc'd on.

https://lists.bufferbloat.net/pipermail/make-wifi-fast/2016-May/000661.html

 I note that some of my comments on that thread was due to the overly
EE and math oriented analysis of the "perfect" solution, but I'm over
that now. :) It was otherwise one of the best recent papers on wifi
I've read, and more should read:

 http://www.hindawi.com/journals/misy/2015/548109/

(and all the other cites in that thread were good, too. MIT had the
basics right back in 2003!)

One of my longer term dreams for better congestion control in wifi is
to pack one aggregate in a txop with stuff you care about deeply, and
a second, with stuff you don't (or vice versa).

As also per here, filling in my personal memory gap from 2004 or so
(when I thought block acks would only be used on critical traffic) and
where I started going back and reviewing the standard.

http://blog.cerowrt.org/post/selective_unprotect/

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 20:21                   ` Dave Taht
@ 2016-05-13 20:51                     ` Dave Taht
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Taht @ 2016-05-13 20:51 UTC (permalink / raw)
  To: Aaron Wood; +Cc: Bob McMahon, make-wifi-fast, ath9k-devel@lists.ath9k.org

And for the reason I originally cc'd ath9k-devel on this thread, was
merely that I'd wanted
to know if the diagram of the ath9k driver's current structure was
actually accurate.  Is it?

http://blog.cerowrt.org/post/wifi_software_paths/

The "make-wifi-fast" mailing list exists merely because keeping up on
patch traffic on linux-wireless and elsewhere, is too difficult for
many of the people on it (including me!), and we do tend to wax and
wane on issues of more theoretical interest.

That said, I have a habit of cross-posting that comes from being an
ancient decrepit former usenet junkie where topics did engage multiple
people, and it's also a way to draw attention to the project here

It is *very* annoying in how many posts get bounced off the mailing lists...

still.... Arguing in an echo chamber is futile.

I am curious, in terms of further out reach, if there are any other
good wifi/5g/6lowpan/wireless mailing lists to be on - I'd like in
particular to find one for iwl and the mediatek chips, and if anyone
is working on fixing up the rpi3's wifi, that would be good too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Make-wifi-fast] Diagram of the ath9k TX path
  2016-05-13 17:46         ` Bob McMahon
  2016-05-13 17:49           ` Dave Taht
@ 2016-05-13 20:49           ` David Lang
  1 sibling, 0 replies; 25+ messages in thread
From: David Lang @ 2016-05-13 20:49 UTC (permalink / raw)
  To: Bob McMahon; +Cc: Dave Taht, ath9k-devel@lists.ath9k.org, make-wifi-fast

[-- Attachment #1: Type: TEXT/Plain, Size: 5126 bytes --]

On Fri, 13 May 2016, Bob McMahon wrote:

> On driver delays, from a driver development perspective the problem isn't
> to add delay or not (it shouldn't) it's that the TCP stack isn't presenting
> sufficient data to fully utilize aggregation.  Below is a histogram
> comparing aggregations of 3 systems (units are mpdu per ampdu.)  The lowest
> latency stack is in purple and it's also the worst performance with respect
> to average throughput.   From a driver perspective, one would like TCP to
> present sufficient bytes into the pipe that the histogram leans toward the
> blue.

The problem isn't providing sufficent bytes into the pipe, it's that to do 
aggregations sanely you need lots of bytes to a particular station in a seprate 
queue than the lots of bytes for each of the other stations.

current qdisc settings don't provide this to the driver, they provide a packet 
to host A, a packet to host B, a packet to host C, etc.

What's needed is for the higher level (fq_codel, etc) to provide a queue for 
host A, a queue for host B, a queue for host C and throttle what goes into these 
queues, and then the driver grabs an aggregates worth from the first queue and 
sends it to host A, an aggregates worth for host B (which is a different number 
of bytes because B has a different data rate than A), etc.

the current model of the tcp stack having one set of queues and the driver 
having a separate set of queues just doesn't work for this. One of the sets of 
queues needs to go away and the queues that remain need to be able to be filled 
with fairness between flows in mind (fq_codel or similar) and drained with 
fairness of airtime in mind (and efficient aggregation)

David Lang

> [image: Inline image 1]
> I'm not an expert on TCP near congestion avoidance but maybe the algorithm
> could benefit from RTT as weighted by CWND (or bytes in flight) and hunt
> that maximum?
>
> Bob
>
> On Mon, May 9, 2016 at 8:41 PM, David Lang <david@lang.hm> wrote:
>
>> On Mon, 9 May 2016, Dave Taht wrote:
>>
>> On Mon, May 9, 2016 at 7:25 PM, Jonathan Morton <chromatix99@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 9 May, 2016, at 18:35, Dave Taht <dave.taht@gmail.com> wrote:
>>>>>
>>>>> should we always wait a little bit to see if we can form an aggregate?
>>>>>
>>>>
>>>> I thought the consensus on this front was “no”, as long as we’re making
>>>> the decision when we have an immediate transmit opportunity.
>>>>
>>>
>>> I think it is more nuanced than how david lang has presented it.
>>>
>>
>> I have four reasons for arguing for no speculative delays.
>>
>> 1. airtime that isn't used can't be saved.
>>
>> 2. lower best-case latency
>>
>> 3. simpler code
>>
>> 4. clean, and gradual service degredation under load.
>>
>> the arguments against are:
>>
>> 5. throughput per ms of transmit time is better if aggregation happens
>> than if it doesn't.
>>
>> 6. if you don't transmit, some other station may choose to before you
>> would have finished.
>>
>> #2 is obvious, but with the caviot that anytime you transmit you may be
>> delaying someone else.
>>
>> #1 and #6 are flip sides of each other. we want _someone_ to use the
>> airtime, the question is who.
>>
>> #3 and #4 are closely related.
>>
>> If you follow my approach (transmit immediately if you can, aggregate when
>> you have a queue), the code really has one mode (plus queuing). "If you
>> have a Transmit Oppertunity, transmit up to X packets from the queue", and
>> it doesn't matter if it's only one packet.
>>
>> If you delay the first packet to give you a chance to aggregate it with
>> others, you add in the complexity and overhead of timers (including
>> cancelling timers, slippage in timers, etc) and you add "first packet,
>> start timers" mode to deal with.
>>
>> I grant you that the first approach will "saturate" the airtime at lower
>> traffic levels, but at that point all the stations will start aggregating
>> the minimum amount needed to keep the air saturated, while still minimizing
>> latency.
>>
>> I then expect that application related optimizations would then further
>> complicate the second approach. there are just too many cases where small
>> amounts of data have to be sent and other things serialize behind them.
>>
>> DNS lookup to find a domain to then to a 3-way handshake to then do a
>> request to see if the <web something> library has been updated since last
>> cached (repeat for several libraries) to then fetch the actual page
>> content. All of these thing up to the actual page content could be single
>> packets that have to be sent (and responded to with a single packet),
>> waiting for the prior one to complete. If you add a few ms to each of
>> these, you can easily hit 100ms in added latency. Once you start to try and
>> special cases these sorts of things, the code complexity multiplies.
>>
>> So I believe that the KISS approach ends up with a 'worse is better'
>> situation.
>>
>> David Lang
>> _______________________________________________
>> Make-wifi-fast mailing list
>> Make-wifi-fast@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>>
>>
>

[-- Attachment #2: Type: IMAGE/PNG, Size: 27837 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-05-13 20:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09 11:00 [Make-wifi-fast] Diagram of the ath9k TX path Toke Høiland-Jørgensen
2016-05-09 15:35 ` Dave Taht
2016-05-10  2:25   ` Jonathan Morton
2016-05-10  2:59     ` Dave Taht
2016-05-10  3:30       ` [Make-wifi-fast] [ath9k-devel] " Adrian Chadd
2016-05-10  4:04         ` Dave Taht
2016-05-10  4:22           ` Aaron Wood
2016-05-10  7:15           ` Adrian Chadd
2016-05-10  7:17             ` Adrian Chadd
2016-05-10  3:41       ` [Make-wifi-fast] " David Lang
2016-05-10  4:59         ` Dave Taht
2016-05-10  5:22           ` David Lang
2016-05-10  9:04             ` Toke Høiland-Jørgensen
2016-05-11 14:12               ` Dave Täht
2016-05-11 15:09                 ` Dave Taht
2016-05-11 15:20                   ` Toke Høiland-Jørgensen
2016-05-13 17:46         ` Bob McMahon
2016-05-13 17:49           ` Dave Taht
2016-05-13 18:05             ` Bob McMahon
2016-05-13 18:11               ` Bob McMahon
2016-05-13 18:57               ` Dave Taht
2016-05-13 19:20                 ` Aaron Wood
2016-05-13 20:21                   ` Dave Taht
2016-05-13 20:51                     ` Dave Taht
2016-05-13 20:49           ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox