From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-x234.google.com (mail-wi0-x234.google.com [IPv6:2a00:1450:400c:c05::234]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 2C13921F249 for ; Wed, 9 Apr 2014 14:44:55 -0700 (PDT) Received: by mail-wi0-f180.google.com with SMTP id q5so3984660wiv.1 for ; Wed, 09 Apr 2014 14:44:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=1IuqzSDnCeKdfkGDMd6i4MMH7eQahJJbHfMidAsQJIc=; b=J+IxbmipEUf2reJ1lbvL5LyL6Sw0C0nHN5Bz23ySgjwJNEBU4KhDO9MdtLV/W9zpIL dkSyUj/WskHribXKh41tP650s+kjZ2NuIHEQRQ8bz+rCWlxrYf9RcKkQ/CNKjWd6xL6z gjwRlEBQRL3ZaNl9KS7emd4Vu6pHln9wXefjzQ+l2zk1QHvb0DxybvdfB4tRigTWNxoY FML6Iwud+V3McGDtPJNtpyGTaEnzr182OVYemuJ8OVVaEgfAZQsb6QHqfY2pkhUMBk+9 HmFhWu8ro09Zg2yVKm5epYaOpRgXRbPkZZQzFANhWqyqkMMlmwQuT5LnBUshMhCEU9ku IgrA== MIME-Version: 1.0 X-Received: by 10.194.173.193 with SMTP id bm1mr5038032wjc.55.1397079893238; Wed, 09 Apr 2014 14:44:53 -0700 (PDT) Received: by 10.216.177.10 with HTTP; Wed, 9 Apr 2014 14:44:53 -0700 (PDT) Date: Wed, 9 Apr 2014 14:44:53 -0700 Message-ID: From: Dave Taht To: "cerowrt-devel@lists.bufferbloat.net" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: [Cerowrt-devel] Fwd: [Bug #442] Possible workaround for the wireless hangs X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Apr 2014 21:44:55 -0000 See also: http://www.bufferbloat.net/issues/442#note-16 1) It's still uncertain that we have only been dealing with one wireless bu= g... ...but we can narrow down the jg was seeing to if - after a failure happens and you can login on another radio or via ethernet - if you see frames "pending", that stay pending, in the "queues" debug file: root@comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues (VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 151 stopped: 0 (CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 (CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0 you've hit the bug. Nothing short of a reboot will clear it, presently. Felix is looking into i= t. In the interim there are two things you can do to make hitting it a LOT more difficult, at least so far, in testing 20+ hours we haven't hit it again A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo. comment out line 1977 of /usr/sbin/debloat ... local function wireless(model) print(model) if WCALLBACKS[model] ~=3D nil then -- wireless_qlen() -- comment out this call return WCALLBACKS[model]() else usage("AQM model not found") end return nil end ... and reboot. This will return the qlen's to very large values that are nearly impossible to hit. While this will have a negative effect on latency, it will improve single station bandwidth somewhat, and make it much harder to hang the queue. (I think/hope) I will argue - at this point - it is better to have a slower box that stays up for weeks than one that has core functionality crash after a few hours or days. Those of you that have been experiencing the wifi hangs, please make this change, and check in daily? If anyone has a hang, please post the ath9 queues status as per above, and tc -s qdisc output to bug 442. B) Mash incoming diffserv traffic down to BE only. I have some patches almost ready for sqm-scripts for this, partially tested= . I've pushed them to the ceropackages github repository for review and testi= ng. see commit log message here. https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b= 3fff0eaa990cd27 I am pretty sure fixing only fix "A" is need for working around the bug her= e - B might make bi-directional over-the-internet-through-wifi tests work better in that the BE queue is used more often - but both hacks are in place on the box we're testing. -- Dave T=E4ht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_= indecent.article --=20 Dave T=E4ht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_= indecent.article