Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed
* [Cerowrt-devel] periodic hang of ath9k
@ 2014-07-13 19:18 Dave Taht
  2014-07-13 19:44 ` [Cerowrt-devel] [ath9k-devel] " Ben Greear
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Dave Taht @ 2014-07-13 19:18 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: ath9k-devel, cerowrt-devel

cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442

this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...

some new details below

On Sun, Jul 13, 2014 at 10:44 AM, Sebastian Moeller <moeller0@gmx.de> wrote:
> Hi List, hi Dave,
>
> I just had a case of devices on the 2.4GHz radio not connecting anymore (the 5GHz radio still worked).
>
> This output was stable while the devices failed to obtain IP addresses:
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> \(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:  13 stopped: 1

Yes, that is the one sure symptom of something like bug 442. As the
default in cerowrt is a queue depth of 12 for BE, this kind of implies
we have some sort of off-by-one or atomic update/race condition
problem somewhere.

Maybe setting it even lower (say, 4) and doing exaustive tests will
trigger the bug more often?

In the next release, I've given up on setting the queue depth
entirely, and have it set to the default of 123 (123, I think) - and I
HAVE seen it hang with it set that high too, and more recent
benchmarks show 48 as offering more throughput than 12, and the usual
terrible latency regardless.

 (the right thing here is to rework queue handling entirely to work on
a per-sta basis, incorporate fq_codel like ideas down there, etc, but
we're
a long way from doing that as yet)

> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> Note one of the devices was connected to the same radio before and got somehow forced to reconnect and failed to actually do so…
>
> Here is what I saw from log read:
>
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: authenticated
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: associated (aid 2)
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 WPA: pairwise key handshake completed (RSN)
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
>
>
>         I reconnected the 2.4GHz radio from https://gw.home.lan:81/cgi-bin/luci/;stok=64f33ba722ed8a68b13ad5644e60629b/admin/network/network (by hitting the connect button for sw00)
>
> Now I see:
>
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
> (VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> So this seems to be “fixed” without having to reboot cerowrt. So we might consider working aroung this bug by checking  /sys/kernel/debug/ieee80211/phy0/ath9k/queues repeatedly and reconnecting the 2.4GHz radio if the queue seems stopped?

I don't know what "reconnecting" actually triggers in the backend uci code.
(anyone? a network reload? what?)

Certainly if we could check for "stuckness" and then do the right
thing that might be a workaround.

>Now I would like to know whether this actually is bug 442 (I seem to recall that other afflicted users needed to reboot cerowrt to get the radio back, or did they simply not try to just get sw00 unstuck in a less drastic manner?

No, most people just hit it with a large hammer and start over,
without letting me get a crack at it.

>And/or maybe the re-connect only fixes some symptoms and the router will wedge layer on for good?) I attached the output of cerostats.sh just in case someone has an idea what to try next...

Damned if I know, but this is progress of a sort.

> Best Regards
>         Sebastian
>
>
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] [ath9k-devel] periodic hang of ath9k
  2014-07-13 19:18 [Cerowrt-devel] periodic hang of ath9k Dave Taht
@ 2014-07-13 19:44 ` Ben Greear
  2014-07-14  4:25 ` [Cerowrt-devel] " Sujith Manoharan
  2014-07-14 18:51 ` [Cerowrt-devel] " Stephen Hemminger
  2 siblings, 0 replies; 6+ messages in thread
From: Ben Greear @ 2014-07-13 19:44 UTC (permalink / raw)
  To: Dave Taht, Sebastian Moeller; +Cc: ath9k-devel, cerowrt-devel

On 07/13/2014 12:18 PM, Dave Taht wrote:
> cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
>
> this bug, which some people (usually on macs with low signal strength)
> can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
> crazy...

I've mentioned this before..but I have a patch that works around this
problem:

I had merge issues, so now it's two patches until I have time to clean up
my tree:

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=commit;h=579035d4b2b7110d67ff451e728709f5b2510ca0

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=commit;h=ae6f95e8a35578e0668e47becb053b29009eec9a

I suspect this may cause memory leaks, as a test that causes these work-arounds to hit
often will run out of memory after a few days (200+ station vifs constantly doing traffic).
Could be lots of other code of course, and kmemleak shows no leaks, so I'm at a loss on
the mem leak issue.

If you can, try patching these in and see if that works around your tx hang problem.

I'd love to have a better fix, btw.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] periodic hang of ath9k
  2014-07-13 19:18 [Cerowrt-devel] periodic hang of ath9k Dave Taht
  2014-07-13 19:44 ` [Cerowrt-devel] [ath9k-devel] " Ben Greear
@ 2014-07-14  4:25 ` Sujith Manoharan
  2014-07-14 10:48   ` [Cerowrt-devel] [ath9k-devel] " Felix Fietkau
  2014-07-14 18:51 ` [Cerowrt-devel] " Stephen Hemminger
  2 siblings, 1 reply; 6+ messages in thread
From: Sujith Manoharan @ 2014-07-14  4:25 UTC (permalink / raw)
  To: Dave Taht; +Cc: ath9k-devel, cerowrt-devel

Dave Taht wrote:
> cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
> 
> this bug, which some people (usually on macs with low signal strength)
> can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
> crazy...

Does stock OpenWrt also have this bug, or is this specific to Cerowrt ?

Sujith

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] [ath9k-devel]  periodic hang of ath9k
  2014-07-14  4:25 ` [Cerowrt-devel] " Sujith Manoharan
@ 2014-07-14 10:48   ` Felix Fietkau
  0 siblings, 0 replies; 6+ messages in thread
From: Felix Fietkau @ 2014-07-14 10:48 UTC (permalink / raw)
  To: Sujith Manoharan, Dave Taht; +Cc: ath9k-devel, Antonio Quartulli, cerowrt-devel

On 2014-07-14 06:25, Sujith Manoharan wrote:
> Dave Taht wrote:
>> cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
>> 
>> this bug, which some people (usually on macs with low signal strength)
>> can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
>> crazy...
> 
> Does stock OpenWrt also have this bug, or is this specific to Cerowrt ?
After receiving some useful debug output from Antonio Quartulli (who
was able to reproduce it easily), I believe that I have tracked down
this issue to some bugs in counting pending tx frames. When frames get
pushed through the U-APSD queue for PS responses and dropped there due
to retransmissions, the counter probably does not get decremented
properly.

I've come up with an untested patch that should fix this codepath
and make it easier to verify.

If you're affected by the bug, please test this patch:
---
--- a/drivers/net/wireless/ath/ath9k/ath9k.h
+++ b/drivers/net/wireless/ath/ath9k/ath9k.h
@@ -185,7 +185,8 @@ struct ath_atx_ac {
 
 struct ath_frame_info {
 	struct ath_buf *bf;
-	int framelen;
+	u16 framelen;
+	s8 txq;
 	enum ath9k_key_type keytype;
 	u8 keyix;
 	u8 rtscts_rate;
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -147,15 +147,13 @@ static void ath_set_rates(struct ieee802
 static void ath_txq_skb_done(struct ath_softc *sc, struct ath_txq *txq,
 			     struct sk_buff *skb)
 {
-	int q;
-
-	q = skb_get_queue_mapping(skb);
-	if (txq == sc->tx.uapsdq)
-		txq = sc->tx.txq_map[q];
+	struct ath_frame_info *fi = get_frame_info(skb);
+	int q = fi->txq;
 
-	if (txq != sc->tx.txq_map[q])
+	if (q < 0)
 		return;
 
+	txq = sc->tx.txq_map[q];
 	if (WARN_ON(--txq->pending_frames < 0))
 		txq->pending_frames = 0;
 
@@ -1999,6 +1997,7 @@ static void setup_frame_info(struct ieee
 		an = (struct ath_node *) sta->drv_priv;
 
 	memset(fi, 0, sizeof(*fi));
+	fi->txq = -1;
 	if (hw_key)
 		fi->keyix = hw_key->hw_key_idx;
 	else if (an && ieee80211_is_data(hdr->frame_control) && an->ps_key > 0)
@@ -2150,6 +2149,7 @@ int ath_tx_start(struct ieee80211_hw *hw
 	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
 	struct ieee80211_sta *sta = txctl->sta;
 	struct ieee80211_vif *vif = info->control.vif;
+	struct ath_frame_info *fi = get_frame_info(skb);
 	struct ath_softc *sc = hw->priv;
 	struct ath_txq *txq = txctl->txq;
 	struct ath_atx_tid *tid = NULL;
@@ -2170,11 +2170,13 @@ int ath_tx_start(struct ieee80211_hw *hw
 	q = skb_get_queue_mapping(skb);
 
 	ath_txq_lock(sc, txq);
-	if (txq == sc->tx.txq_map[q] &&
-	    ++txq->pending_frames > sc->tx.txq_max_pending[q] &&
-	    !txq->stopped) {
-		ieee80211_stop_queue(sc->hw, q);
-		txq->stopped = true;
+	if (txq == sc->tx.txq_map[q]) {
+		fi->txq = q;
+		if (++txq->pending_frames > sc->tx.txq_max_pending[q] &&
+		    !txq->stopped) {
+			ieee80211_stop_queue(sc->hw, q);
+			txq->stopped = true;
+		}
 	}
 
 	if (txctl->an && ieee80211_is_data_present(hdr->frame_control))

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] periodic hang of ath9k
  2014-07-13 19:18 [Cerowrt-devel] periodic hang of ath9k Dave Taht
  2014-07-13 19:44 ` [Cerowrt-devel] [ath9k-devel] " Ben Greear
  2014-07-14  4:25 ` [Cerowrt-devel] " Sujith Manoharan
@ 2014-07-14 18:51 ` Stephen Hemminger
  2014-07-14 23:02   ` R.
  2 siblings, 1 reply; 6+ messages in thread
From: Stephen Hemminger @ 2014-07-14 18:51 UTC (permalink / raw)
  To: Dave Taht; +Cc: ath9k-devel, cerowrt-devel

I think the stock netgear firmware has similar issues, 2G wireless is
flaky when the Mac's are using it.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] periodic hang of ath9k
  2014-07-14 18:51 ` [Cerowrt-devel] " Stephen Hemminger
@ 2014-07-14 23:02   ` R.
  0 siblings, 0 replies; 6+ messages in thread
From: R. @ 2014-07-14 23:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: cerowrt-devel

Hello David & list,

Do you think we could get a build with CONFIG_PACKAGE_ATH_DEBUG
enabled? I've been following OpenWRT ticket #15320 and it looks like
this would be the way to go in order to get field logs of WiFi issues
from end-users.

In the meantime, here's the latest WiFi failures on my end (running
CeroWRT 3.10.44-6) -- recovered within two minutes:

[14378.023437] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15130.140625] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.164062] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.179687] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024
AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0
[15349.191406] ath: phy0: Could not stop RX, we could be confusing the
DMA engine when we start RX up
[16886.886718] ath: phy0: Failed to stop TX DMA, queues=0x004!
[19839.468750] ath: phy0: Failed to stop TX DMA, queues=0x005!
[20286.019531] ath: phy0: Failed to stop TX DMA, queues=0x004!
[20825.996093] ath: phy0: Failed to stop TX DMA, queues=0x005!
[48749.316406] ath: phy0: Failed to stop TX DMA, queues=0x004!
[48749.433593] ath: phy0: Failed to stop TX DMA, queues=0x004!
-------------
root@cerowrt:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset
    Baseband Hang:  0
Baseband Watchdog:  0
   Fatal HW Error:  0
      TX HW error:  0
 Transmit timeout:  0
     TX Path Hang:  1
      PLL RX Hang:  0
         MAC Hang: 13
     Stuck Beacon:  7
        MCI Reset:  0
Calibration error:  1
-------------
Reply from 74.125.225.112: bytes=32 time=39ms TTL=50
Reply from 74.125.225.112: bytes=32 time=114ms TTL=50
Reply from 74.125.225.112: bytes=32 time=42ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=1323ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 74.125.225.112: bytes=32 time=258ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=1043ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=206ms TTL=50
Reply from 74.125.225.112: bytes=32 time=91ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50

On Mon, Jul 14, 2014 at 2:51 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> I think the stock netgear firmware has similar issues, 2G wireless is
> flaky when the Mac's are using it.
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-07-14 23:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-13 19:18 [Cerowrt-devel] periodic hang of ath9k Dave Taht
2014-07-13 19:44 ` [Cerowrt-devel] [ath9k-devel] " Ben Greear
2014-07-14  4:25 ` [Cerowrt-devel] " Sujith Manoharan
2014-07-14 10:48   ` [Cerowrt-devel] [ath9k-devel] " Felix Fietkau
2014-07-14 18:51 ` [Cerowrt-devel] " Stephen Hemminger
2014-07-14 23:02   ` R.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox