[Make-wifi-fast] TCP performance regression in mac80211 triggered by the fq code

Tue Jul 12 10:02:15 EDT 2016

On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <nbd at nbd.name> wrote:
> On 2016-07-12 14:13, Dave Taht wrote:
>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd at nbd.name> wrote:
>>> Hi,
>>>
>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>> regression when running local iperf on an AP (running the txq stuff) to
>>> a wireless client.
>>
>> Your kernel? cpu architecture?
> QCA9558, 720 MHz, running Linux 4.4.14
>
>> What happens when going through the AP to a server from the wireless client?
> Will test that next.
>
>> Which direction?
> AP->STA, iperf running on the AP. Client is a regular MacBook Pro
> (Broadcom).

There are always 2 wifi chips in play. Like the Sith.

>>> Here's some things that I found:
>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>
>> with how much cpu left over?
> ~20%
>
>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>> with how much cpu left over?
> ~30%

Hmm.

Care to try netperf?

>
>> context switch difference between the two tests?
> What's the easiest way to track that?

if you have gnu "time" time -v the_process

or:

perf record -e context-switches -ag

or: process /proc/$PID/status for cntx

>> tcp_limit_output_bytes is?
> 262144

I keep hoping to be able to reduce this to something saner like 4096
one day. It got bumped to 64k based on bad wifi performance once, and
then to it's current size to make the Xen folk happier.

The other param I'd like to see fiddled with is tcp_notsent_lowat.

In both cases reductions will increase your context switches but
reduce memory pressure and lead to a more reactive tcp.

And in neither case I think this is the real cause of this problem.

>> got perf?
> Need to make a new build for that.
>
>>> - fairness between TCP streams looks completely fine
>>
>> A codel will get to long term fairness pretty fast. Packet captures
>> from a fq will show much more regular interleaving of packets,
>> regardless.
>>
>>> - there's no big queue buildup, the code never actually drops any packets
>>
>> A "trick" I have been using to observe codel behavior has been to
>> enable ecn on server and client, then checking in wireshark for ect(3)
>> marked packets.
> I verified this with printk. The same issue already appears if I have
> just the fq patch (with the codel patch reverted).

OK. A four flow test "should" trigger codel....

Running out of cpu (or hitting some other bottleneck), without
loss/marking "should" result in a tcptrace -G and xplot.org of the
packet capture showing the window continuing to increase....

>>> - if I put a hack in the fq code to force the hash to a constant value
>>
>> You could also set "flows" to 1 to keep the hash being generated, but
>> not actually use it.
>>
>>> (effectively disabling fq without disabling codel), the problem
>>> disappears and even multiple streams get proper performance.
>>
>> Meaning you get 90-110Mbits ?
> Right.
>
>> Do you have a "before toke" figure for this platform?
> It's quite similar.
>
>>> Please let me know if you have any ideas.
>>
>> I am in berlin, packing hardware...
> Nice!
>
> - Felix
>

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org