* [Make-wifi-fast] On the ath9k performance regression with FQ and crypto @ 2016-08-16 20:41 Toke Høiland-Jørgensen 2016-08-16 20:47 ` Eric Dumazet 2016-08-17 4:18 ` Felix Fietkau 0 siblings, 2 replies; 6+ messages in thread From: Toke Høiland-Jørgensen @ 2016-08-16 20:41 UTC (permalink / raw) To: make-wifi-fast, linux-wireless; +Cc: Felix Fietkau, Michal Kazior, Dave Taht So Dave and I have been spending the last couple of days trying to narrow down why there's a performance regression in some cases on ath9k with the softq-FQ patches. Felix first noticed this regression, and LEDE currently carries a patch [1] to disable the FQ portion of the softq patches to avoid it. While we have been able to narrow it down a little bit, no solution has been forthcoming, so this is an attempt to describe the bug in the hope that someone else will have an idea about what could be causing it. What we're seeing is the following (when the access point is running ath9k with the softq patches): When running two or more flows to a station, their combined throughput will be roughly 20-30% lower than the throughput of a single flow to the same station. This happens: - for both TCP and UDP traffic. - independent of the base rate (i.e. signal quality). - but only with crypto enabled (WPA2 CCMP in this case). However, the regression completely disappears if either of the following is true: - no crypto is enabled. - the FQ part of mac80211 is disabled (as in [1]). We have been able to reproduce this behaviour on two different ath9k hardware chips and two different architectures. The cause of the regression seems to be that the aggregates are smaller when there are two flows than when there is only one. Adding debug statements to the aggregate forming code indicates that this is because no more packets are available when the aggregates are built (i.e. ieee80211_tx_dequeue() returns NULL). We have not been able to determine why the queues run empty when this combination of circumstances arise. Since we easily get upwards of 120 Mbps of TCP throughput without crypto but with full FQ, it's clearly not the hashing overhead in itself that does it (and the hashing also happens with just one flow, so the overhead is still there). And the crypto itself should be offloaded to hardware (shouldn't it? we do see a marked drop in overall throughput from just enabling crypto), so how would the queueing (say, mixing of packets from different flows) influence that? Does anyone have any ideas? We are stumped... -Toke [1] https://git.lede-project.org/?p=lede/nbd/staging.git;a=blob;f=package/kernel/mac80211/patches/220-fq_disable_hack.patch;h=7f420beea56335d5043de6fd71b5febae3e9bd79;hb=HEAD ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto 2016-08-16 20:41 [Make-wifi-fast] On the ath9k performance regression with FQ and crypto Toke Høiland-Jørgensen @ 2016-08-16 20:47 ` Eric Dumazet 2016-08-16 23:13 ` Kevin Hayes 2016-08-16 23:16 ` Dave Täht 2016-08-17 4:18 ` Felix Fietkau 1 sibling, 2 replies; 6+ messages in thread From: Eric Dumazet @ 2016-08-16 20:47 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: make-wifi-fast, linux-wireless, Felix Fietkau Do you have tcpdumps of 1) sample with crypto 2) sample without crypto. Looks like some TCP Small queue interaction with skb->truesize, if GSO is involved, or encapsulation adding overhead. On Tue, 2016-08-16 at 22:41 +0200, Toke Høiland-Jørgensen wrote: > So Dave and I have been spending the last couple of days trying to > narrow down why there's a performance regression in some cases on ath9k > with the softq-FQ patches. Felix first noticed this regression, and LEDE > currently carries a patch [1] to disable the FQ portion of the softq > patches to avoid it. > > While we have been able to narrow it down a little bit, no solution has > been forthcoming, so this is an attempt to describe the bug in the hope > that someone else will have an idea about what could be causing it. > > What we're seeing is the following (when the access point is running > ath9k with the softq patches): > > When running two or more flows to a station, their combined throughput > will be roughly 20-30% lower than the throughput of a single flow to the > same station. This happens: > > - for both TCP and UDP traffic. > - independent of the base rate (i.e. signal quality). > - but only with crypto enabled (WPA2 CCMP in this case). > > However, the regression completely disappears if either of the > following is true: > > - no crypto is enabled. > - the FQ part of mac80211 is disabled (as in [1]). > > We have been able to reproduce this behaviour on two different ath9k > hardware chips and two different architectures. > > The cause of the regression seems to be that the aggregates are smaller > when there are two flows than when there is only one. Adding debug > statements to the aggregate forming code indicates that this is because > no more packets are available when the aggregates are built (i.e. > ieee80211_tx_dequeue() returns NULL). > > We have not been able to determine why the queues run empty when this > combination of circumstances arise. Since we easily get upwards of 120 > Mbps of TCP throughput without crypto but with full FQ, it's clearly not > the hashing overhead in itself that does it (and the hashing also > happens with just one flow, so the overhead is still there). And the > crypto itself should be offloaded to hardware (shouldn't it? we do see a > marked drop in overall throughput from just enabling crypto), so how > would the queueing (say, mixing of packets from different flows) > influence that? > > Does anyone have any ideas? We are stumped... > > -Toke > > [1] https://git.lede-project.org/?p=lede/nbd/staging.git;a=blob;f=package/kernel/mac80211/patches/220-fq_disable_hack.patch;h=7f420beea56335d5043de6fd71b5febae3e9bd79;hb=HEAD > _______________________________________________ > Make-wifi-fast mailing list > Make-wifi-fast@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/make-wifi-fast ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto 2016-08-16 20:47 ` Eric Dumazet @ 2016-08-16 23:13 ` Kevin Hayes 2016-08-16 23:16 ` Dave Täht 1 sibling, 0 replies; 6+ messages in thread From: Kevin Hayes @ 2016-08-16 23:13 UTC (permalink / raw) To: Eric Dumazet Cc: Toke Høiland-Jørgensen, make-wifi-fast, linux-wireless, Felix Fietkau [-- Attachment #1: Type: text/plain, Size: 3722 bytes --] >And the crypto itself should be offloaded to hardware (shouldn't it? we do see a marked drop in overall throughput from just enabling crypto) Seems like you need to deterministically determine if the hw crypto is enabled, and actually happening. If not, then SW crypto could be consuming, um, more CPU than you want. I assume you are working on this. Beyond that, yes, the use of crypto will add about 16 B of overhead to each MPDU, so maybe 1%. Maybe the reported MTU might be less than 1500B so something must fragment?? K++ On Tue, Aug 16, 2016 at 1:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Do you have tcpdumps of > > 1) sample with crypto > > 2) sample without crypto. > > Looks like some TCP Small queue interaction with skb->truesize, if GSO > is involved, or encapsulation adding overhead. > > > On Tue, 2016-08-16 at 22:41 +0200, Toke Høiland-Jørgensen wrote: > > So Dave and I have been spending the last couple of days trying to > > narrow down why there's a performance regression in some cases on ath9k > > with the softq-FQ patches. Felix first noticed this regression, and LEDE > > currently carries a patch [1] to disable the FQ portion of the softq > > patches to avoid it. > > > > While we have been able to narrow it down a little bit, no solution has > > been forthcoming, so this is an attempt to describe the bug in the hope > > that someone else will have an idea about what could be causing it. > > > > What we're seeing is the following (when the access point is running > > ath9k with the softq patches): > > > > When running two or more flows to a station, their combined throughput > > will be roughly 20-30% lower than the throughput of a single flow to the > > same station. This happens: > > > > - for both TCP and UDP traffic. > > - independent of the base rate (i.e. signal quality). > > - but only with crypto enabled (WPA2 CCMP in this case). > > > > However, the regression completely disappears if either of the > > following is true: > > > > - no crypto is enabled. > > - the FQ part of mac80211 is disabled (as in [1]). > > > > We have been able to reproduce this behaviour on two different ath9k > > hardware chips and two different architectures. > > > > The cause of the regression seems to be that the aggregates are smaller > > when there are two flows than when there is only one. Adding debug > > statements to the aggregate forming code indicates that this is because > > no more packets are available when the aggregates are built (i.e. > > ieee80211_tx_dequeue() returns NULL). > > > > We have not been able to determine why the queues run empty when this > > combination of circumstances arise. Since we easily get upwards of 120 > > Mbps of TCP throughput without crypto but with full FQ, it's clearly not > > the hashing overhead in itself that does it (and the hashing also > > happens with just one flow, so the overhead is still there). And the > > crypto itself should be offloaded to hardware (shouldn't it? we do see a > > marked drop in overall throughput from just enabling crypto), so how > > would the queueing (say, mixing of packets from different flows) > > influence that? > > > > Does anyone have any ideas? We are stumped... > > > > -Toke > > > > [1] https://git.lede-project.org/?p=lede/nbd/staging.git;a=blob; > f=package/kernel/mac80211/patches/220-fq_disable_hack.patch;h= > 7f420beea56335d5043de6fd71b5febae3e9bd79;hb=HEAD > > _______________________________________________ > > Make-wifi-fast mailing list > > Make-wifi-fast@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/make-wifi-fast > > > -- Kevin Hayes [-- Attachment #2: Type: text/html, Size: 5111 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto 2016-08-16 20:47 ` Eric Dumazet 2016-08-16 23:13 ` Kevin Hayes @ 2016-08-16 23:16 ` Dave Täht 1 sibling, 0 replies; 6+ messages in thread From: Dave Täht @ 2016-08-16 23:16 UTC (permalink / raw) To: Eric Dumazet, Toke Høiland-Jørgensen Cc: make-wifi-fast, linux-wireless, Felix Fietkau On 8/16/16 10:47 PM, Eric Dumazet wrote: > > Do you have tcpdumps of > > 1) sample with crypto > > 2) sample without crypto. decrypted aircaps (ssid: borgen-public key: mysecret) for 1 flow and for 2 flows are at: http://www.taht.net/~d/fqcryptbug/ There are also regular captures... flent results for all test scenarios comparison graphed here: http://www.taht.net/~d/fqcryptbug/cryptvsfqwndr3800.svg Total throughput degrades somewhat relative of the total number of flows in the crypted scenario - 80 mbits total with one flow. ~35 with 12. (elsewhere: 120mbit without encryption, with fq, any number of flows, and you can see codel working at least somewhat) > > Looks like some TCP Small queue interaction with skb->truesize, if GSO > is involved, or encapsulation adding overhead. My own suspicion has been around breaking the block ack window, or on misunderstanding how complex aggregates are hw/sw retried. > > > On Tue, 2016-08-16 at 22:41 +0200, Toke Høiland-Jørgensen wrote: >> So Dave and I have been spending the last couple of days trying to >> narrow down why there's a performance regression in some cases on ath9k >> with the softq-FQ patches. Felix first noticed this regression, and LEDE >> currently carries a patch [1] to disable the FQ portion of the softq >> patches to avoid it. >> >> While we have been able to narrow it down a little bit, no solution has >> been forthcoming, so this is an attempt to describe the bug in the hope >> that someone else will have an idea about what could be causing it. >> >> What we're seeing is the following (when the access point is running >> ath9k with the softq patches): >> >> When running two or more flows to a station, their combined throughput >> will be roughly 20-30% lower than the throughput of a single flow to the >> same station. This happens: >> >> - for both TCP and UDP traffic. >> - independent of the base rate (i.e. signal quality). >> - but only with crypto enabled (WPA2 CCMP in this case). >> >> However, the regression completely disappears if either of the >> following is true: >> >> - no crypto is enabled. >> - the FQ part of mac80211 is disabled (as in [1]). >> >> We have been able to reproduce this behaviour on two different ath9k >> hardware chips and two different architectures. >> >> The cause of the regression seems to be that the aggregates are smaller >> when there are two flows than when there is only one. Adding debug >> statements to the aggregate forming code indicates that this is because >> no more packets are available when the aggregates are built (i.e. >> ieee80211_tx_dequeue() returns NULL). >> >> We have not been able to determine why the queues run empty when this >> combination of circumstances arise. Since we easily get upwards of 120 >> Mbps of TCP throughput without crypto but with full FQ, it's clearly not >> the hashing overhead in itself that does it (and the hashing also >> happens with just one flow, so the overhead is still there). And the >> crypto itself should be offloaded to hardware (shouldn't it? we do see a >> marked drop in overall throughput from just enabling crypto), so how >> would the queueing (say, mixing of packets from different flows) >> influence that? >> >> Does anyone have any ideas? We are stumped... >> >> -Toke >> >> [1] https://git.lede-project.org/?p=lede/nbd/staging.git;a=blob;f=package/kernel/mac80211/patches/220-fq_disable_hack.patch;h=7f420beea56335d5043de6fd71b5febae3e9bd79;hb=HEAD >> _______________________________________________ >> Make-wifi-fast mailing list >> Make-wifi-fast@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/make-wifi-fast > > > _______________________________________________ > Make-wifi-fast mailing list > Make-wifi-fast@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/make-wifi-fast > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto 2016-08-16 20:41 [Make-wifi-fast] On the ath9k performance regression with FQ and crypto Toke Høiland-Jørgensen 2016-08-16 20:47 ` Eric Dumazet @ 2016-08-17 4:18 ` Felix Fietkau 2016-08-17 12:04 ` Toke Høiland-Jørgensen 1 sibling, 1 reply; 6+ messages in thread From: Felix Fietkau @ 2016-08-17 4:18 UTC (permalink / raw) To: Toke Høiland-Jørgensen, make-wifi-fast, linux-wireless On 2016-08-16 22:41, Toke Høiland-Jørgensen wrote: > So Dave and I have been spending the last couple of days trying to > narrow down why there's a performance regression in some cases on ath9k > with the softq-FQ patches. Felix first noticed this regression, and LEDE > currently carries a patch [1] to disable the FQ portion of the softq > patches to avoid it. > > While we have been able to narrow it down a little bit, no solution has > been forthcoming, so this is an attempt to describe the bug in the hope > that someone else will have an idea about what could be causing it. > > What we're seeing is the following (when the access point is running > ath9k with the softq patches): > > When running two or more flows to a station, their combined throughput > will be roughly 20-30% lower than the throughput of a single flow to the > same station. This happens: > > - for both TCP and UDP traffic. > - independent of the base rate (i.e. signal quality). > - but only with crypto enabled (WPA2 CCMP in this case). > > However, the regression completely disappears if either of the > following is true: > > - no crypto is enabled. > - the FQ part of mac80211 is disabled (as in [1]). > > We have been able to reproduce this behaviour on two different ath9k > hardware chips and two different architectures. > > The cause of the regression seems to be that the aggregates are smaller > when there are two flows than when there is only one. Adding debug > statements to the aggregate forming code indicates that this is because > no more packets are available when the aggregates are built (i.e. > ieee80211_tx_dequeue() returns NULL). > > We have not been able to determine why the queues run empty when this > combination of circumstances arise. Since we easily get upwards of 120 > Mbps of TCP throughput without crypto but with full FQ, it's clearly not > the hashing overhead in itself that does it (and the hashing also > happens with just one flow, so the overhead is still there). And the > crypto itself should be offloaded to hardware (shouldn't it? we do see a > marked drop in overall throughput from just enabling crypto), so how > would the queueing (say, mixing of packets from different flows) > influence that? > > Does anyone have any ideas? We are stumped... I have not done any further tests, but based on your analysis, I think I finally understand what's causing this issue: The CCMP PN (crypto IV) is assigned in the tx path before a packet is put into the txq. It is also used to protect against replay attacks, so it is sensitive to reordering. The receiver is simply dropping any packet where the PN value is lower than the highest PN received so far. To fix this, we will have to move the IV/PN assignment to ieee80211_tx_dequeue. - Felix ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Make-wifi-fast] On the ath9k performance regression with FQ and crypto 2016-08-17 4:18 ` Felix Fietkau @ 2016-08-17 12:04 ` Toke Høiland-Jørgensen 0 siblings, 0 replies; 6+ messages in thread From: Toke Høiland-Jørgensen @ 2016-08-17 12:04 UTC (permalink / raw) To: Felix Fietkau; +Cc: make-wifi-fast, linux-wireless, Michal Kazior, Dave Taht Felix Fietkau <nbd@nbd.name> writes: > I have not done any further tests, but based on your analysis, I think I > finally understand what's causing this issue: > The CCMP PN (crypto IV) is assigned in the tx path before a packet is > put into the txq. It is also used to protect against replay attacks, so > it is sensitive to reordering. The receiver is simply dropping any > packet where the PN value is lower than the highest PN received so far. > To fix this, we will have to move the IV/PN assignment to > ieee80211_tx_dequeue. Indeed this seems to be the cause of the problem. Thanks! Will send a patch once we've run it through another couple rounds of testing. Looks promising so far :) -Toke ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-08-17 12:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-16 20:41 [Make-wifi-fast] On the ath9k performance regression with FQ and crypto Toke Høiland-Jørgensen 2016-08-16 20:47 ` Eric Dumazet 2016-08-16 23:13 ` Kevin Hayes 2016-08-16 23:16 ` Dave Täht 2016-08-17 4:18 ` Felix Fietkau 2016-08-17 12:04 ` Toke Høiland-Jørgensen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox