[Cerowrt-devel] Fwd: [Bug #442] Possible workaround for the wireless hangs

Dave Taht dave.taht at gmail.com
Wed Apr 9 17:44:53 EDT 2014


See also: http://www.bufferbloat.net/issues/442#note-16

1) It's still uncertain that we have only been dealing with one wireless bug...

...but we can narrow down the jg was seeing to if - after a failure
happens and you can login on another radio or via ethernet - if you
see frames "pending", that stay pending, in
the "queues" debug file:

root at comcast-gw:~# cat /sys/kernel/debug/ieee80211/phy*/ath9k/queues

(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   151 stopped: 0
(CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(CAB): qnum: 8 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0

you've hit the bug.

Nothing short of a reboot will clear it, presently. Felix is looking into it.

In the interim there are two things you can do to make hitting it a
LOT more difficult,
at least so far, in testing 20+ hours we haven't hit it again

A) Stop reducing qlen_be, qlen_bk, qlen_vi, & qlen_vo.

comment out line 1977 of /usr/sbin/debloat

...

local function wireless(model)
   print(model)
   if WCALLBACKS[model] ~= nil then
      -- wireless_qlen() -- comment out this call
      return WCALLBACKS[model]()
   else
      usage("AQM model not found")
   end
   return nil
end

...

and reboot.

This will return the qlen's to very large values that are nearly
impossible to hit.

While this will have a negative effect on latency, it will improve
single station bandwidth somewhat, and make it much harder to hang the
queue. (I think/hope)

I will argue - at this point - it is better to have a slower box that
stays up for weeks than one that has core functionality crash after a
few hours or days.

Those of you that have been experiencing the wifi hangs, please make
this change,
and check in daily?

If anyone has a hang, please post the ath9 queues status as per above,
and tc -s qdisc output to bug 442.

B) Mash incoming diffserv traffic down to BE only.

I have some patches almost ready for sqm-scripts for this, partially tested.

I've pushed them to the ceropackages github repository for review and testing.

see commit log message here.

https://github.com/dtaht/ceropackages-3.10/commit/27eed160a67700caae85a4c8b3fff0eaa990cd27

I am pretty sure fixing only fix "A" is need for working around the bug here

- B might make bi-directional over-the-internet-through-wifi tests
work better in that the BE queue is used more often - but both hacks
are in place on the box we're testing.



--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article


-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article



More information about the Cerowrt-devel mailing list