* Re: [Cerowrt-devel] periodic hang of ath9k
@ 2014-07-14 23:10 Dave Taht
0 siblings, 0 replies; 5+ messages in thread
From: Dave Taht @ 2014-07-14 23:10 UTC (permalink / raw)
To: R., ath9k-devel; +Cc: cerowrt-devel
I have little doubt that we have been coping with more than one bug.
On Mon, Jul 14, 2014 at 4:02 PM, R. <redag2@gmail.com> wrote:
> Hello David & list,
>
> Do you think we could get a build with CONFIG_PACKAGE_ATH_DEBUG
> enabled? I've been following OpenWRT ticket #15320 and it looks like
> this would be the way to go in order to get field logs of WiFi issues
> from end-users.
>
> In the meantime, here's the latest WiFi failures on my end (running
> CeroWRT 3.10.44-6) -- recovered within two minutes:
>
> [14378.023437] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [15130.140625] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [15349.164062] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [15349.179687] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024
> AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0
> [15349.191406] ath: phy0: Could not stop RX, we could be confusing the
> DMA engine when we start RX up
> [16886.886718] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [19839.468750] ath: phy0: Failed to stop TX DMA, queues=0x005!
> [20286.019531] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [20825.996093] ath: phy0: Failed to stop TX DMA, queues=0x005!
> [48749.316406] ath: phy0: Failed to stop TX DMA, queues=0x004!
> [48749.433593] ath: phy0: Failed to stop TX DMA, queues=0x004!
> -------------
> root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset
> Baseband Hang: 0
> Baseband Watchdog: 0
> Fatal HW Error: 0
> TX HW error: 0
> Transmit timeout: 0
> TX Path Hang: 1
> PLL RX Hang: 0
> MAC Hang: 13
> Stuck Beacon: 7
> MCI Reset: 0
> Calibration error: 1
> -------------
> Reply from 74.125.225.112: bytes=32 time=39ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=114ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=42ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=1323ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
> Request timed out.
> Request timed out.
> Request timed out.
> Request timed out.
> Request timed out.
> Reply from 74.125.225.112: bytes=32 time=258ms TTL=50
> Request timed out.
> Reply from 74.125.225.112: bytes=32 time=1043ms TTL=50
> Request timed out.
> Reply from 74.125.225.112: bytes=32 time=206ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=91ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
> Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
>
> On Mon, Jul 14, 2014 at 2:51 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>> I think the stock netgear firmware has similar issues, 2G wireless is
>> flaky when the Mac's are using it.
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Cerowrt-devel] periodic hang of ath9k
2014-07-14 18:51 ` Stephen Hemminger
@ 2014-07-14 23:02 ` R.
0 siblings, 0 replies; 5+ messages in thread
From: R. @ 2014-07-14 23:02 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: cerowrt-devel
Hello David & list,
Do you think we could get a build with CONFIG_PACKAGE_ATH_DEBUG
enabled? I've been following OpenWRT ticket #15320 and it looks like
this would be the way to go in order to get field logs of WiFi issues
from end-users.
In the meantime, here's the latest WiFi failures on my end (running
CeroWRT 3.10.44-6) -- recovered within two minutes:
[14378.023437] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15130.140625] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.164062] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.179687] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024
AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0
[15349.191406] ath: phy0: Could not stop RX, we could be confusing the
DMA engine when we start RX up
[16886.886718] ath: phy0: Failed to stop TX DMA, queues=0x004!
[19839.468750] ath: phy0: Failed to stop TX DMA, queues=0x005!
[20286.019531] ath: phy0: Failed to stop TX DMA, queues=0x004!
[20825.996093] ath: phy0: Failed to stop TX DMA, queues=0x005!
[48749.316406] ath: phy0: Failed to stop TX DMA, queues=0x004!
[48749.433593] ath: phy0: Failed to stop TX DMA, queues=0x004!
-------------
root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset
Baseband Hang: 0
Baseband Watchdog: 0
Fatal HW Error: 0
TX HW error: 0
Transmit timeout: 0
TX Path Hang: 1
PLL RX Hang: 0
MAC Hang: 13
Stuck Beacon: 7
MCI Reset: 0
Calibration error: 1
-------------
Reply from 74.125.225.112: bytes=32 time=39ms TTL=50
Reply from 74.125.225.112: bytes=32 time=114ms TTL=50
Reply from 74.125.225.112: bytes=32 time=42ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=1323ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 74.125.225.112: bytes=32 time=258ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=1043ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=206ms TTL=50
Reply from 74.125.225.112: bytes=32 time=91ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
On Mon, Jul 14, 2014 at 2:51 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> I think the stock netgear firmware has similar issues, 2G wireless is
> flaky when the Mac's are using it.
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Cerowrt-devel] periodic hang of ath9k
2014-07-13 19:18 Dave Taht
2014-07-14 4:25 ` Sujith Manoharan
@ 2014-07-14 18:51 ` Stephen Hemminger
2014-07-14 23:02 ` R.
1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2014-07-14 18:51 UTC (permalink / raw)
To: Dave Taht; +Cc: ath9k-devel, cerowrt-devel
I think the stock netgear firmware has similar issues, 2G wireless is
flaky when the Mac's are using it.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Cerowrt-devel] periodic hang of ath9k
2014-07-13 19:18 Dave Taht
@ 2014-07-14 4:25 ` Sujith Manoharan
2014-07-14 18:51 ` Stephen Hemminger
1 sibling, 0 replies; 5+ messages in thread
From: Sujith Manoharan @ 2014-07-14 4:25 UTC (permalink / raw)
To: Dave Taht; +Cc: ath9k-devel, cerowrt-devel
Dave Taht wrote:
> cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
>
> this bug, which some people (usually on macs with low signal strength)
> can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
> crazy...
Does stock OpenWrt also have this bug, or is this specific to Cerowrt ?
Sujith
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cerowrt-devel] periodic hang of ath9k
@ 2014-07-13 19:18 Dave Taht
2014-07-14 4:25 ` Sujith Manoharan
2014-07-14 18:51 ` Stephen Hemminger
0 siblings, 2 replies; 5+ messages in thread
From: Dave Taht @ 2014-07-13 19:18 UTC (permalink / raw)
To: Sebastian Moeller; +Cc: ath9k-devel, cerowrt-devel
cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...
some new details below
On Sun, Jul 13, 2014 at 10:44 AM, Sebastian Moeller <moeller0@gmx.de> wrote:
> Hi List, hi Dave,
>
> I just had a case of devices on the 2.4GHz radio not connecting anymore (the 5GHz radio still worked).
>
> This output was stable while the devices failed to obtain IP addresses:
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> \(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 13 stopped: 1
Yes, that is the one sure symptom of something like bug 442. As the
default in cerowrt is a queue depth of 12 for BE, this kind of implies
we have some sort of off-by-one or atomic update/race condition
problem somewhere.
Maybe setting it even lower (say, 4) and doing exaustive tests will
trigger the bug more often?
In the next release, I've given up on setting the queue depth
entirely, and have it set to the default of 123 (123, I think) - and I
HAVE seen it hang with it set that high too, and more recent
benchmarks show 48 as offering more throughput than 12, and the usual
terrible latency regardless.
(the right thing here is to rework queue handling entirely to work on
a per-sta basis, incorporate fq_codel like ideas down there, etc, but
we're
a long way from doing that as yet)
> (BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
>
> Note one of the devices was connected to the same radio before and got somehow forced to reconnect and failed to actually do so…
>
> Here is what I saw from log read:
>
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: authenticated
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: associated (aid 2)
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 WPA: pairwise key handshake completed (RSN)
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
>
>
> I reconnected the 2.4GHz radio from https://gw.home.lan:81/cgi-bin/luci/;stok=64f33ba722ed8a68b13ad5644e60629b/admin/network/network (by hitting the connect button for sw00)
>
> Now I see:
>
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> (VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
> (CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
>
> So this seems to be “fixed” without having to reboot cerowrt. So we might consider working aroung this bug by checking /sys/kernel/debug/ieee80211/phy0/ath9k/queues repeatedly and reconnecting the 2.4GHz radio if the queue seems stopped?
I don't know what "reconnecting" actually triggers in the backend uci code.
(anyone? a network reload? what?)
Certainly if we could check for "stuckness" and then do the right
thing that might be a workaround.
>Now I would like to know whether this actually is bug 442 (I seem to recall that other afflicted users needed to reboot cerowrt to get the radio back, or did they simply not try to just get sw00 unstuck in a less drastic manner?
No, most people just hit it with a large hammer and start over,
without letting me get a crack at it.
>And/or maybe the re-connect only fixes some symptoms and the router will wedge layer on for good?) I attached the output of cerostats.sh just in case someone has an idea what to try next...
Damned if I know, but this is progress of a sort.
> Best Regards
> Sebastian
>
>
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
--
Dave Täht
NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-07-14 23:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-14 23:10 [Cerowrt-devel] periodic hang of ath9k Dave Taht
-- strict thread matches above, loose matches on Subject: below --
2014-07-13 19:18 Dave Taht
2014-07-14 4:25 ` Sujith Manoharan
2014-07-14 18:51 ` Stephen Hemminger
2014-07-14 23:02 ` R.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox