[Cerowrt-devel] periodic hang of ath9k

Sun Jul 13 15:18:46 EDT 2014

cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442

this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...

some new details below

On Sun, Jul 13, 2014 at 10:44 AM, Sebastian Moeller <moeller0 at gmx.de> wrote:
> Hi List, hi Dave,
>
> I just had a case of devices on the 2.4GHz radio not connecting anymore (the 5GHz radio still worked).
>
> This output was stable while the devices failed to obtain IP addresses:
> root at nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> \(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:  13 stopped: 1

Yes, that is the one sure symptom of something like bug 442. As the
default in cerowrt is a queue depth of 12 for BE, this kind of implies
we have some sort of off-by-one or atomic update/race condition
problem somewhere.

Maybe setting it even lower (say, 4) and doing exaustive tests will
trigger the bug more often?

In the next release, I've given up on setting the queue depth
entirely, and have it set to the default of 123 (123, I think) - and I
HAVE seen it hang with it set that high too, and more recent
benchmarks show 48 as offering more throughput than 12, and the usual
terrible latency regardless.

 (the right thing here is to rework queue handling entirely to work on
a per-sta basis, incorporate fq_codel like ideas down there, etc, but
we're
a long way from doing that as yet)

> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> Note one of the devices was connected to the same radio before and got somehow forced to reconnect and failed to actually do so…
>
> Here is what I saw from log read:
>
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: authenticated
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: associated (aid 2)
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 WPA: pairwise key handshake completed (RSN)
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
>
>
>         I reconnected the 2.4GHz radio from https://gw.home.lan:81/cgi-bin/luci/;stok=64f33ba722ed8a68b13ad5644e60629b/admin/network/network (by hitting the connect button for sw00)
>
> Now I see:
>
> root at nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> (VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> So this seems to be “fixed” without having to reboot cerowrt. So we might consider working aroung this bug by checking  /sys/kernel/debug/ieee80211/phy0/ath9k/queues repeatedly and reconnecting the 2.4GHz radio if the queue seems stopped?

I don't know what "reconnecting" actually triggers in the backend uci code.
(anyone? a network reload? what?)

Certainly if we could check for "stuckness" and then do the right
thing that might be a workaround.

>Now I would like to know whether this actually is bug 442 (I seem to recall that other afflicted users needed to reboot cerowrt to get the radio back, or did they simply not try to just get sw00 unstuck in a less drastic manner?

No, most people just hit it with a large hammer and start over,
without letting me get a crack at it.

>And/or maybe the re-connect only fixes some symptoms and the router will wedge layer on for good?) I attached the output of cerostats.sh just in case someone has an idea what to try next...

Damned if I know, but this is progress of a sort.

> Best Regards
>         Sebastian
>
>
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>

-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article