Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed
* [Cerowrt-devel] cerowrt-3.10.44-6 report
@ 2014-07-09 21:44 Dave Taht
  2014-07-09 23:23 ` Michael Richardson
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Taht @ 2014-07-09 21:44 UTC (permalink / raw)
  To: cerowrt-devel

I have been pounding several cerowrt boxes utterly flat for 13 days now.

root@davedesk:~# uptime
 20:54:29 up 13 days, 13:28,  load average: 0.04, 0.04, 0.04 (well,
formerly flat prior to this email)

Aside from seeing one kernel trap (see bug #442) for it, it's stayed
up on wifi, reliably for 10s of thousands of tests... for me.

I have - along the way - collected gigabytes of useless packet
captures, crashed every serial dongle I own, the 802.11ac ap I'm
working on, windows multiple times, and linux on a pair of laptops,
and reduced multiple beaglebones with the edimax 802.11ac to
gibbering, crashed hysteria unrecoverable even with a usb serial
connection, needing a reflash.

but never cerowrt. So I'm happy about that, and depressed about the
sad state of wifi on *everything else*. (the rtl8812au driver has to
be seen to be believed). I've wasted some time trying to sort through
that and glad that the ath10k driver has been getting some serious
love of late....

But I've had several reports that cerowrt 3.10.44-6 fails for others
with bug 442-like symptoms. It does seem that the one near-constant in
all reports are "osx", and "poor signal strength", so over the weekend
I finally got a macbook with 802.11ac to drive more tests. This brings
up the number of stations I have to 1 windows, 1 mac, 2 linux laptops,
6 beaglebones, and a bunch of APs (that I have been connecting to each
other in adhoc and sta mode) to try and blow up EVERYTHING.

One useful bit of fallout from all this has been being able to test
multi-station wifi performance, which is predictably horrible, but
given all the other stuff I've been testing simultaneously, the data
is hard to sort through or publish - goal here is to crash stuff, and
do forensics, not science.

I am encouraged by some of the bugs denton gentry has been fixing over
on the ath10k mailing list, and wonder if some of the same things
happen on ath9k and other drivers. I'm still pretty convinced the 442
problem is generic to the darn ath9k but being unable to duplicate the
problem is a problem.

His methods are extreme! and he keeps finding, interesting, subtle
problems, with hugely bad side-effects, like this one:

http://lists.infradead.org/pipermail/ath10k/2014-July/002606.html

Which I'm sure exist, in device drivers everywhere.

So, next up, for me - is to keep testing 802.11ac and n on the device
I'm getting paid to beat up - while exercising cero as hard as
possible (I'm using it to drive the ac box as one example). I will be
adding impairments this time to the non-mac boxes, and if things keep
working, add impairments to the mac box, while capturing as much as
possible.

A decision to make is whether or not to refresh 3.10.44-6 with openwrt
head. If I don't do it, it will be another 14 days before I stop
testing and can refresh, and ietf will be upon me.

I really, really, really, really wanted a stable cerowrt release, and
then to move on. I'd hoped that 3.10.44-6 would have been it. I've
thought about putting out a bug bounty for it, if that would find
someone with the wherewithal to nail this !@#! thing to the floor.

In the interim, I'd like to make clear to everyone that I regard bug
442 as the only thing holding up a general stable release, and there
have been a couple updates to it.

http://www.bufferbloat.net/issues/442

and anything you can do to beat up your boxes, while capturing
traffic, and the failure event(s), will help.

Of HUGE interest is getting raw captures from a wifi monitoring
interface on a regular basis from someone (anyone!) experiencing this
problem, and thus capture exactly when and why it happens. Not all
wifi chips support it wifi monitor mode, but if you install
aircrack-ng, enabling it is straightforward:

sudo airmon-ng start wlan0

and then wireshark can see the interface and capture/decode raw wifi
packets , as can tcpdump -I.

I just nuked a bunch of captures and tests to get some disk space back
and am setting up the full monty again, while coaxing this mac to have
a decent compiler and setup. Actually, I think I'm going to go get a
2TB disk for the monitor box.

I incidentally just stumbled across a 1998-2000 history of wireless
development I started to write 4 years ago, and, well, I'd forgotten
the pain of the 7 months of initial development, and the years of
trouble we still had with it after. I really don't enjoy the low level
driver stuff.

http://www.teklibre.com/~d/elwr/wireless_2.html (_1, _3, _4, etc, it
gets up to 9)

According to this,

http://www.teklibre.com/~d/elwr/emails.html

My first documented encounter with the need for aqm and packet
scheduling on wireless was:

Mon, 19 Oct 1998 19:18:09



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Cerowrt-devel] cerowrt-3.10.44-6 report
  2014-07-09 21:44 [Cerowrt-devel] cerowrt-3.10.44-6 report Dave Taht
@ 2014-07-09 23:23 ` Michael Richardson
  2014-09-16 12:23   ` Michael Richardson
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Richardson @ 2014-07-09 23:23 UTC (permalink / raw)
  To: Dave Taht; +Cc: cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 2186 bytes --]


Dave Taht <dave.taht@gmail.com> wrote:
    > I have been pounding several cerowrt boxes utterly flat for 13 days now.

    > root@davedesk:~# uptime
    > 20:54:29 up 13 days, 13:28,  load average: 0.04, 0.04, 0.04 (well,
    > formerly flat prior to this email)

    > Aside from seeing one kernel trap (see bug #442) for it, it's stayed
    > up on wifi, reliably for 10s of thousands of tests... for me.

    > I have - along the way - collected gigabytes of useless packet
    > captures, crashed every serial dongle I own, the 802.11ac ap I'm
    > working on, windows multiple times, and linux on a pair of laptops,
    > and reduced multiple beaglebones with the edimax 802.11ac to
    > gibbering, crashed hysteria unrecoverable even with a usb serial
    > connection, needing a reflash.

    > but never cerowrt. So I'm happy about that, 

YEAH! KUDOS

    > I really, really, really, really wanted a stable cerowrt release, and
    > then to move on. I'd hoped that 3.10.44-6 would have been it. I've
    > thought about putting out a bug bounty for it, if that would find
    > someone with the wherewithal to nail this !@#! thing to the floor.

    > In the interim, I'd like to make clear to everyone that I regard bug
    > 442 as the only thing holding up a general stable release, and there
    > have been a couple updates to it.

I have no OSX at my house... it all... well, it's all Linux
(debian/Android/Chrome), with some cisco phones and switches.
I've never seen 442-type thing on any release.

My understanding is that a power cycle of the cerowrt fixes the 442 problem?
Lots of "factory ROMs" need to be power cycled weekly.

So my take is to go forward like this.

    > http://www.teklibre.com/~d/elwr/emails.html

    > My first documented encounter with the need for aqm and packet
    > scheduling on wireless was:

    > Mon, 19 Oct 1998 19:18:09

ha.

-- 
]               Never tell me the odds!                 | ipv6 mesh networks [ 
]   Michael Richardson, Sandelman Software Works        | network architect  [ 
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [ 
	

[-- Attachment #2: Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Cerowrt-devel] cerowrt-3.10.44-6 report
  2014-07-09 23:23 ` Michael Richardson
@ 2014-09-16 12:23   ` Michael Richardson
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Richardson @ 2014-09-16 12:23 UTC (permalink / raw)
  Cc: cerowrt-devel


Awhile ago I reported problems where my pppoe link did not come up, and
sometimes not all of the wifi interfaces would come up.  It seemed like it
was some kind of limit on number of interfaces, and started looking at netifd
for some clue what was going on, but I didn't see any obvious struct
interface[16] or some such.

Last night, after my wife complained again, I rebooted the 3800 after 60 days
of uptime... I had placed numifX=0 somewhere so that I wouldn't instantiate 8
useless interfaces.  Having done that, I notice that my two guest wifi
interfaces got brought up properly, and the PPPoE link came up on it's
own (I think, I do have some hacks to force it up).   I still had to poke
some things.
In conversations at IETF90, I think someone said that netifd does something
like:
        (1 << ifindex)
in order to mark something about interfaces...  That would certainly explain
the problem, because ifindex can march upwards quite easily on systems with
PPP interfaces coming/going.  I note that on my freshly booted system,
all ifindex are < 32.

I haven't located any code in netifd that does this, but haven't looked
that much yet.

What is the plan now that BB is about to be fully released?  I think that all
CeroWRT code for the 3800 is now upstream.... it seems that the next CeroWRT
release should simply point at openwrt BB?

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        | network architect  [
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-09-16 12:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-09 21:44 [Cerowrt-devel] cerowrt-3.10.44-6 report Dave Taht
2014-07-09 23:23 ` Michael Richardson
2014-09-16 12:23   ` Michael Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox