From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com
	[IPv6:2607:f8b0:4003:c01::231])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 4F2F121F23E
	for <cerowrt-devel@lists.bufferbloat.net>;
	Sun, 13 Jul 2014 12:18:47 -0700 (PDT)
Received: by mail-ob0-f177.google.com with SMTP id wp18so3204630obc.22
	for <cerowrt-devel@lists.bufferbloat.net>;
	Sun, 13 Jul 2014 12:18:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:date:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=hvgD4rv/REfGEBC3dATDi83rXHYbN+4x9qBYxIJpgQE=;
	b=dAFS/7tvQBvx4/+5CR/dPCnuPPh44LkDdihBB5QvHTxqei570CzSHe4JnqM42ctaYe
	rtQGFvqYcKx2TOm7L+9P3CtZ7rJFqqHPjZcd95kGF40XVNJvqVP8BCepzFqVTYJDr4c5
	dJI/O37G438+ROQZEqDD/IJOOzL2toJJZpzHKiCByec/Fe85shZPmlHV10D3V9tvUbqP
	n1rGHlHPIM5ezjvKM2kmxkAcuJ5TqECNwjc1fw4YpFHm7/gAnpKfDnGmzR/gvdHkVFHH
	txwO4IZfh4uoZsi0rG68vkaBPP/4Q7pkO9kOSjqKM1ug4eqBQNc2UOZuXhhAhXHzHwvq
	/a6g==
MIME-Version: 1.0
X-Received: by 10.182.138.103 with SMTP id qp7mr13381273obb.56.1405279126201; 
	Sun, 13 Jul 2014 12:18:46 -0700 (PDT)
Received: by 10.202.93.195 with HTTP; Sun, 13 Jul 2014 12:18:46 -0700 (PDT)
Date: Sun, 13 Jul 2014 12:18:46 -0700
Message-ID: <CAA93jw7DyKexOwqFBnT_1Z1REXi0UV8JtzPMC7WmQUGOx_+UXg@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: Sebastian Moeller <moeller0@gmx.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "ath9k-devel@lists.ath9k.org" <ath9k-devel@lists.ath9k.org>,
	cerowrt-devel <cerowrt-devel@lists.bufferbloat.net>
Subject: [Cerowrt-devel] periodic hang of ath9k
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sun, 13 Jul 2014 19:18:47 -0000

cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442

this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...

some new details below

On Sun, Jul 13, 2014 at 10:44 AM, Sebastian Moeller <moeller0@gmx.de> wrote=
:
> Hi List, hi Dave,
>
> I just had a case of devices on the 2.4GHz radio not connecting anymore (=
the 5GHz radio still worked).
>
> This output was stable while the devices failed to obtain IP addresses:
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/=
phy0/ath9k/queues
> \(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:  13 stopped: 1

Yes, that is the one sure symptom of something like bug 442. As the
default in cerowrt is a queue depth of 12 for BE, this kind of implies
we have some sort of off-by-one or atomic update/race condition
problem somewhere.

Maybe setting it even lower (say, 4) and doing exaustive tests will
trigger the bug more often?

In the next release, I've given up on setting the queue depth
entirely, and have it set to the default of 123 (123, I think) - and I
HAVE seen it hang with it set that high too, and more recent
benchmarks show 48 as offering more throughput than 12, and the usual
terrible latency regardless.

 (the right thing here is to rework queue handling entirely to work on
a per-sta basis, incorporate fq_codel like ideas down there, etc, but
we're
a long way from doing that as yet)

> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> Note one of the devices was connected to the same radio before and got so=
mehow forced to reconnect and failed to actually do so=E2=80=A6
>
> Here is what I saw from log read:
>
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48=
 IEEE 802.11: authenticated
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48=
 IEEE 802.11: associated (aid 2)
> Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48=
 WPA: pairwise key handshake completed (RSN)
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00=
) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 19=
2.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00=
) 192.168.2.107 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 19=
2.168.2.107 10:68:3f:4b:0b:48 wrong address
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw0=
0) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) =
172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw0=
0) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) =
172.30.42.90 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw0=
0) 10:68:3f:4b:0b:48
> Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) =
172.30.42.90 10:68:3f:4b:0b:48
>
>
>         I reconnected the 2.4GHz radio from https://gw.home.lan:81/cgi-bi=
n/luci/;stok=3D64f33ba722ed8a68b13ad5644e60629b/admin/network/network (by h=
itting the connect button for sw00)
>
> Now I see:
>
> root@nacktmulle:/usr/lib/CeroWrtScripts# cat /sys/kernel/debug/ieee80211/=
phy0/ath9k/queues
> (VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
> (CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
>
> So this seems to be =E2=80=9Cfixed=E2=80=9D without having to reboot cero=
wrt. So we might consider working aroung this bug by checking  /sys/kernel/=
debug/ieee80211/phy0/ath9k/queues repeatedly and reconnecting the 2.4GHz ra=
dio if the queue seems stopped?

I don't know what "reconnecting" actually triggers in the backend uci code.
(anyone? a network reload? what?)

Certainly if we could check for "stuckness" and then do the right
thing that might be a workaround.

>Now I would like to know whether this actually is bug 442 (I seem to recal=
l that other afflicted users needed to reboot cerowrt to get the radio back=
, or did they simply not try to just get sw00 unstuck in a less drastic man=
ner?

No, most people just hit it with a large hammer and start over,
without letting me get a crack at it.

>And/or maybe the re-connect only fixes some symptoms and the router will w=
edge layer on for good?) I attached the output of cerostats.sh just in case=
 someone has an idea what to try next...

Damned if I know, but this is progress of a sort.

> Best Regards
>         Sebastian
>
>
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>


--=20
Dave T=C3=A4ht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_=
indecent.article