[Cerowrt-devel] Fwd: 3.3.6-2

Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed

* [Cerowrt-devel] Fwd:  3.3.6-2
       [not found] <00404BC8-3761-409D-A1C8-9213D7D9A3DF@gmx.de>
@ 2012-05-24  3:48 ` Sebastian Moeller
  2012-05-24 15:44   ` Robert Bradley
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-24  3:48 UTC (permalink / raw)
  To: <cerowrt-devel@lists.bufferbloat.net>; +Cc: codel

Dear All,

since Dave asked me to post out in the open here it goes:


Hi Dave,

hope you have had a great weekend, Maker Fair sounds sweet. And vacation sounds even better, I hope you have/had a great time there.
	I managed to give 3.3.6-2 a small test drive by now on my wndr3700v2. I have some observations I would like to report just to document them. All my tests are using a single 5GHz wireless client (running macosx 10.7.4) going to test sites on the internet over 30/4 MBit cable internet.
A) under moderate wireless stress I get a lot of allocation failures from slub, like:
[ 1221.664062] ath: skbuff alloc of size 1926 failed
In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
	It seems that the UDP probes used by http://loki10.mpi-sws.mpg.de/bb/bb.php (short "bb") are quite likely to produce those skbuff failures and also occasionally cause the crashes. If I understand correctly this tool explicitly tries to overload the bottleneck link with UDP packages so it can estimate the worst-case buffering. Alas, the tool is rate limited to a few invocations per day, so I can not really test this hypothesis in any meaningful way. Interestingly both bb and netalyzr start reporting about 3 seconds upstream buffering on a fresh booted router which will change to around 38ms upstream buffering over the course of a day. And after that the router is prone to actually reboot during a run of the 20mbit alternative option of the bb test.  Yet concurrent interactive sessions are reactive no matter whether the reported queue is 3000 or 38 ms, so fq_codel is pure magic.
	My layman's hypothesis is that somehow the UDP stream reveals a bug in the atheros wireless driver, that occasionally takes down the router. So this might be a different aspect of bug 379?
	I will try to understand netsurf better and setup a UDP stream to see whether I can force the router to reboot reliably, as it is all I can report is a spurious reboot. Once I have a robust reproducer I will see whether I can make a recent openwork snapshot crash the same way.

	P.S.: I am currently reading up a bit on IPv6 and home security and it seems things are more complicated than I had hoped...

Best
	Sebastian


P.P.S.: I still wrangling netperf sources to hopefully be able to reproduce the issue (and test the UDP hypothesis)

best
	Sebastian


> 
> On May 14, 2012, at 1:59 PM, Dave Taht wrote:
> 
>> A test release of CeroWrt is now available that has support for Kathie
>> Nichols' and Van Jacobson's new AQM, Codel , and Eric Dumazet's new
>> fair queuing implementation on top of that, fq_codel.
>> 
>> fq_codel is enabled on all interfaces by default. It is vastly simpler
>> than what we were using before (sfqred) and draws upon and improves on
>> the same body of ideas (head drop, fq, timestamping) but is now tied
>> to Kathie and Van's blinding insights as to a good drop strategy, and
>> Eric's successor-to-sfqred ideas as towards head of queue behavior,
>> modern amounts of flows, and cache line optimizations.
>> 
>> There is a simple_qos.sh script that can be set to your uplink and
>> downlink speeds, but no uci interface for it as yet, nor gui. (help on
>> finishing aqm-scripts and the luci interface gladly accepted)
>> 
>> To see all the chocolately goodness of what fq_codel can do to wired
>> and wireless latency, it would be good for more to play with it.
>> 
>> Benchmarks have been very good thus far, and more benchmarks and
>> analysis are highly desired.
>> 
>> Caveat:
>> 
>> This release suffers from an unrelated bug ( #379 ) and should NOT be
>> installed as your main router. I would love to beat this bug because
>> it's the only prio 1 remaining but thus far, no luck. Under lighter
>> loads CeroWrt appears to work just fine, but that's for me. YMMV.
>> 
>> Get it here: http://huchra.bufferbloat.net/~cero1/3.3/3.3.6-2/
>> 
>> 
>> -- 
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> http://www.bufferbloat.net
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24  3:48 ` [Cerowrt-devel] Fwd: 3.3.6-2 Sebastian Moeller
@ 2012-05-24 15:44   ` Robert Bradley
  2012-05-24 16:18     ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bradley @ 2012-05-24 15:44 UTC (permalink / raw)
  To: cerowrt-devel

On 24/05/12 04:48, Sebastian Moeller wrote:
> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
> [ 1221.664062] ath: skbuff alloc of size 1926 failed
> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)

This looks to me like a possible memory leak somewhere, but I'm no 
expert.  (Unless cerowrt is using tmpfs and filling up memory with logs, 
of course.)  Is UDP from the wired side to the Internet also OK?  I'm 
assuming it is, but it would be nice to prove that it is actually a leak 
in ath9k and/or the wireless stack first!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 15:44   ` Robert Bradley
@ 2012-05-24 16:18     ` Sebastian Moeller
  2012-05-24 16:32       ` Jim Gettys
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-24 16:18 UTC (permalink / raw)
  To: Robert Bradley; +Cc: cerowrt-devel

Hi Robert,

On May 24, 2012, at 8:44 AM, Robert Bradley wrote:

> On 24/05/12 04:48, Sebastian Moeller wrote:
>> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
>> [ 1221.664062] ath: skbuff alloc of size 1926 failed
>> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
> 
> This looks to me like a possible memory leak somewhere, but I'm no expert.

	Not being an expert I concur.

>  (Unless cerowrt is using tmpfs and filling up memory with logs, of course.)  

	I tried to check that, but since I can nor reproduce the crashes easily yet I have not been able to test that hypothesis (when I checked "df -h"  on the router there always was some room left, but heck for all I know it might be the log entries for the allocation failures that quickly eat up all the remaining memory) I will try to test this hypothesis. Currently I tried to check dmesg and free in rapid succession during the test runs that are prone to cause the crash free memory fluctuates some but I never saw it reach 0 just before crashing.

> Is UDP from the wired side to the Internet also OK?  I'm assuming it is, but it would be nice to prove that it is actually a leak in ath9k and/or the wireless stack first!

	Actually I have not tested that yet (again with the crash somewhat hard to reproduce I will have to take the wireless out of use for 24 to 48 hours to be reasonably sure that the issue does not occur under wired connections). That said, I will go and work on that. So I have my testing work charted out and will post again once I have more data.

Best
	Sebastian




> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 16:18     ` Sebastian Moeller
@ 2012-05-24 16:32       ` Jim Gettys
  2012-05-24 18:12         ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Gettys @ 2012-05-24 16:32 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 05/24/2012 12:18 PM, Sebastian Moeller wrote:
> Hi Robert,
>
> On May 24, 2012, at 8:44 AM, Robert Bradley wrote:
>
>> On 24/05/12 04:48, Sebastian Moeller wrote:
>>> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
>>> [ 1221.664062] ath: skbuff alloc of size 1926 failed
>>> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
>> This looks to me like a possible memory leak somewhere, but I'm no expert.
> 	Not being an expert I concur.

My router's /tmp/log/babeld.log had grown to almost 256k. (and my router
had been flaky).

So I suspect that's making grim trouble as /tmp is a tmpfs: e.g. coming
out of ram.
-rw-r--r--    1 root     root        247936 May 24 12:27 babeld.log

Tail on the babeld file had:

Couldn't determine channel of interface gw00: Invalid argument.
Couldn't determine channel of interface gw10: Invalid argument.
Couldn't determine channel of interface gw00: Invalid argument.
Couldn't determine channel of interface gw10: Invalid argument.
Couldn't determine channel of interface gw00: Invalid argument.
Couldn't determine channel of interface gw10: Invalid argument.
Couldn't determine channel of interface gw00: Invalid argument.
Couldn't determine channel of interface gw10: Invalid argument.
Couldn't determine channel of interface gw00: Invalid argument.
Couldn't determine channel of interface gw10: Invalid argument.

I should probably have grabbed a copy before nuking the file.  /me bad....

Will put into redmine...

                          - Jim

>>  (Unless cerowrt is using tmpfs and filling up memory with logs, of course.)  
> 	I tried to check that, but since I can nor reproduce the crashes easily yet I have not been able to test that hypothesis (when I checked "df -h"  on the router there always was some room left, but heck for all I know it might be the log entries for the allocation failures that quickly eat up all the remaining memory) I will try to test this hypothesis. Currently I tried to check dmesg and free in rapid succession during the test runs that are prone to cause the crash free memory fluctuates some but I never saw it reach 0 just before crashing.
>
>> Is UDP from the wired side to the Internet also OK?  I'm assuming it is, but it would be nice to prove that it is actually a leak in ath9k and/or the wireless stack first!
> 	Actually I have not tested that yet (again with the crash somewhat hard to reproduce I will have to take the wireless out of use for 24 to 48 hours to be reasonably sure that the issue does not occur under wired connections). That said, I will go and work on that. So I have my testing work charted out and will post again once I have more data.
>
> Best
> 	Sebastian
>
>
>
>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 16:32       ` Jim Gettys
@ 2012-05-24 18:12         ` Sebastian Moeller
  2012-05-24 18:15           ` Jim Gettys
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-24 18:12 UTC (permalink / raw)
  To: Jim Gettys; +Cc: cerowrt-devel

Hi Jim,

good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…

best
	Sebastian



On May 24, 2012, at 9:32 AM, Jim Gettys wrote:

> On 05/24/2012 12:18 PM, Sebastian Moeller wrote:
>> Hi Robert,
>> 
>> On May 24, 2012, at 8:44 AM, Robert Bradley wrote:
>> 
>>> On 24/05/12 04:48, Sebastian Moeller wrote:
>>>> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
>>>> [ 1221.664062] ath: skbuff alloc of size 1926 failed
>>>> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
>>> This looks to me like a possible memory leak somewhere, but I'm no expert.
>> 	Not being an expert I concur.
> 
> My router's /tmp/log/babeld.log had grown to almost 256k. (and my router
> had been flaky).
> 
> So I suspect that's making grim trouble as /tmp is a tmpfs: e.g. coming
> out of ram.
> -rw-r--r--    1 root     root        247936 May 24 12:27 babeld.log
> 
> Tail on the babeld file had:
> 
> Couldn't determine channel of interface gw00: Invalid argument.
> Couldn't determine channel of interface gw10: Invalid argument.
> Couldn't determine channel of interface gw00: Invalid argument.
> Couldn't determine channel of interface gw10: Invalid argument.
> Couldn't determine channel of interface gw00: Invalid argument.
> Couldn't determine channel of interface gw10: Invalid argument.
> Couldn't determine channel of interface gw00: Invalid argument.
> Couldn't determine channel of interface gw10: Invalid argument.
> Couldn't determine channel of interface gw00: Invalid argument.
> Couldn't determine channel of interface gw10: Invalid argument.
> 
> I should probably have grabbed a copy before nuking the file.  /me bad....
> 
> Will put into redmine...
> 
>                          - Jim
> 
>>> (Unless cerowrt is using tmpfs and filling up memory with logs, of course.)  
>> 	I tried to check that, but since I can nor reproduce the crashes easily yet I have not been able to test that hypothesis (when I checked "df -h"  on the router there always was some room left, but heck for all I know it might be the log entries for the allocation failures that quickly eat up all the remaining memory) I will try to test this hypothesis. Currently I tried to check dmesg and free in rapid succession during the test runs that are prone to cause the crash free memory fluctuates some but I never saw it reach 0 just before crashing.
>> 
>>> Is UDP from the wired side to the Internet also OK?  I'm assuming it is, but it would be nice to prove that it is actually a leak in ath9k and/or the wireless stack first!
>> 	Actually I have not tested that yet (again with the crash somewhat hard to reproduce I will have to take the wireless out of use for 24 to 48 hours to be reasonably sure that the issue does not occur under wired connections). That said, I will go and work on that. So I have my testing work charted out and will post again once I have more data.
>> 
>> Best
>> 	Sebastian
>> 
>> 
>> 
>> 
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-devel@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 18:12         ` Sebastian Moeller
@ 2012-05-24 18:15           ` Jim Gettys
  2012-05-24 18:58             ` Robert Bradley
  2012-05-25  0:04             ` [Cerowrt-devel] Fwd: 3.3.6-2 Sebastian Moeller
  0 siblings, 2 replies; 16+ messages in thread
From: Jim Gettys @ 2012-05-24 18:15 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 05/24/2012 02:12 PM, Sebastian Moeller wrote:
> Hi Jim,
>
> good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…

If you do, see if you can grab the babeld.conf file and add it to:
https://www.bufferbloat.net/issues/392

> best
> 	Sebastian
>
>
>
> On May 24, 2012, at 9:32 AM, Jim Gettys wrote:
>
>> On 05/24/2012 12:18 PM, Sebastian Moeller wrote:
>>> Hi Robert,
>>>
>>> On May 24, 2012, at 8:44 AM, Robert Bradley wrote:
>>>
>>>> On 24/05/12 04:48, Sebastian Moeller wrote:
>>>>> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
>>>>> [ 1221.664062] ath: skbuff alloc of size 1926 failed
>>>>> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
>>>> This looks to me like a possible memory leak somewhere, but I'm no expert.
>>> 	Not being an expert I concur.
>> My router's /tmp/log/babeld.log had grown to almost 256k. (and my router
>> had been flaky).
>>
>> So I suspect that's making grim trouble as /tmp is a tmpfs: e.g. coming
>> out of ram.
>> -rw-r--r--    1 root     root        247936 May 24 12:27 babeld.log
>>
>> Tail on the babeld file had:
>>
>> Couldn't determine channel of interface gw00: Invalid argument.
>> Couldn't determine channel of interface gw10: Invalid argument.
>> Couldn't determine channel of interface gw00: Invalid argument.
>> Couldn't determine channel of interface gw10: Invalid argument.
>> Couldn't determine channel of interface gw00: Invalid argument.
>> Couldn't determine channel of interface gw10: Invalid argument.
>> Couldn't determine channel of interface gw00: Invalid argument.
>> Couldn't determine channel of interface gw10: Invalid argument.
>> Couldn't determine channel of interface gw00: Invalid argument.
>> Couldn't determine channel of interface gw10: Invalid argument.
>>
>> I should probably have grabbed a copy before nuking the file.  /me bad....
>>
>> Will put into redmine...
>>
>>                          - Jim
>>
>>>> (Unless cerowrt is using tmpfs and filling up memory with logs, of course.)  
>>> 	I tried to check that, but since I can nor reproduce the crashes easily yet I have not been able to test that hypothesis (when I checked "df -h"  on the router there always was some room left, but heck for all I know it might be the log entries for the allocation failures that quickly eat up all the remaining memory) I will try to test this hypothesis. Currently I tried to check dmesg and free in rapid succession during the test runs that are prone to cause the crash free memory fluctuates some but I never saw it reach 0 just before crashing.
>>>
>>>> Is UDP from the wired side to the Internet also OK?  I'm assuming it is, but it would be nice to prove that it is actually a leak in ath9k and/or the wireless stack first!
>>> 	Actually I have not tested that yet (again with the crash somewhat hard to reproduce I will have to take the wireless out of use for 24 to 48 hours to be reasonably sure that the issue does not occur under wired connections). That said, I will go and work on that. So I have my testing work charted out and will post again once I have more data.
>>>
>>> Best
>>> 	Sebastian
>>>
>>>
>>>
>>>
>>>> _______________________________________________
>>>> Cerowrt-devel mailing list
>>>> Cerowrt-devel@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-devel@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 18:15           ` Jim Gettys
@ 2012-05-24 18:58             ` Robert Bradley
  2012-05-25  6:41               ` [Cerowrt-devel] 3.3.6-2 Sebastian Moeller
  2012-05-25  0:04             ` [Cerowrt-devel] Fwd: 3.3.6-2 Sebastian Moeller
  1 sibling, 1 reply; 16+ messages in thread
From: Robert Bradley @ 2012-05-24 18:58 UTC (permalink / raw)
  To: cerowrt-devel

On 24/05/12 19:15, Jim Gettys wrote:
> On 05/24/2012 02:12 PM, Sebastian Moeller wrote:
>> Hi Jim,
>>
>> good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…
> If you do, see if you can grab the babeld.conf file and add it to:
> https://www.bufferbloat.net/issues/392
>

I don't know if it helps at all, but it looks like Babel's failing to 
obtain channel information for the guest interfaces (gw00 and gw10).  
Are these disabled on your routers at the moment?  I suppose in the 
worst case you could try setting an explicit channel for both of the 
non-mesh guest interfaces and see if the logs clear up (or somehow pass 
"-L /dev/null" to babeld).

I'm assuming the ad-hoc mesh links are working fine, since gw01/gw11 
aren't present in the log fragment.
-- 
Robert Bradley

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] Fwd:  3.3.6-2
  2012-05-24 18:15           ` Jim Gettys
  2012-05-24 18:58             ` Robert Bradley
@ 2012-05-25  0:04             ` Sebastian Moeller
  1 sibling, 0 replies; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-25  0:04 UTC (permalink / raw)
  To: Jim Gettys; +Cc: cerowrt-devel

Hi Jim,



On May 24, 2012, at 11:15 AM, Jim Gettys wrote:

> On 05/24/2012 02:12 PM, Sebastian Moeller wrote:
>> Hi Jim,
>> 
>> good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…
> 
> If you do, see if you can grab the babeld.conf file and add it to:
> https://www.bufferbloat.net/issues/392

	Done, attached to your issue.
	Turns out my babeld.log has grown to a similar size over 16:38 hours uptime. But:
root@nacktmulle:~# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                    5.8M    940.0K      4.9M  16% /
/dev/root                 8.8M      8.8M         0 100% /rom
tmpfs                    30.1M    688.0K     29.4M   2% /tmp
tmpfs                   512.0K         0    512.0K   0% /dev
/dev/mtdblock4            5.8M    940.0K      4.9M  16% /overlay
overlayfs:/overlay        5.8M    940.0K      4.9M  16% /

root@nacktmulle:~# free
             total         used         free       shared      buffers
Mem:         61676        59868         1808            0         6388
-/+ buffers:              53480         8196
Swap:            0            0            0

(No allocation failure logged yet)

Best
	Sebastian


> 
>> best
>> 	Sebastian
>> 
>> 
>> 
>> On May 24, 2012, at 9:32 AM, Jim Gettys wrote:
>> 
>>> On 05/24/2012 12:18 PM, Sebastian Moeller wrote:
>>>> Hi Robert,
>>>> 
>>>> On May 24, 2012, at 8:44 AM, Robert Bradley wrote:
>>>> 
>>>>> On 24/05/12 04:48, Sebastian Moeller wrote:
>>>>>> A) under moderate wireless stress I get a lot of allocation failures from slub, like:
>>>>>> [ 1221.664062] ath: skbuff alloc of size 1926 failed
>>>>>> In the routers dmesg. And every now and then the router crashes and reboots (I have not yet found a way to make this happen reliably, it seems to require some uptime)
>>>>> This looks to me like a possible memory leak somewhere, but I'm no expert.
>>>> 	Not being an expert I concur.
>>> My router's /tmp/log/babeld.log had grown to almost 256k. (and my router
>>> had been flaky).
>>> 
>>> So I suspect that's making grim trouble as /tmp is a tmpfs: e.g. coming
>>> out of ram.
>>> -rw-r--r--    1 root     root        247936 May 24 12:27 babeld.log
>>> 
>>> Tail on the babeld file had:
>>> 
>>> Couldn't determine channel of interface gw00: Invalid argument.
>>> Couldn't determine channel of interface gw10: Invalid argument.
>>> Couldn't determine channel of interface gw00: Invalid argument.
>>> Couldn't determine channel of interface gw10: Invalid argument.
>>> Couldn't determine channel of interface gw00: Invalid argument.
>>> Couldn't determine channel of interface gw10: Invalid argument.
>>> Couldn't determine channel of interface gw00: Invalid argument.
>>> Couldn't determine channel of interface gw10: Invalid argument.
>>> Couldn't determine channel of interface gw00: Invalid argument.
>>> Couldn't determine channel of interface gw10: Invalid argument.
>>> 
>>> I should probably have grabbed a copy before nuking the file.  /me bad....
>>> 
>>> Will put into redmine...
>>> 
>>>                         - Jim
>>> 
>>>>> (Unless cerowrt is using tmpfs and filling up memory with logs, of course.)  
>>>> 	I tried to check that, but since I can nor reproduce the crashes easily yet I have not been able to test that hypothesis (when I checked "df -h"  on the router there always was some room left, but heck for all I know it might be the log entries for the allocation failures that quickly eat up all the remaining memory) I will try to test this hypothesis. Currently I tried to check dmesg and free in rapid succession during the test runs that are prone to cause the crash free memory fluctuates some but I never saw it reach 0 just before crashing.
>>>> 
>>>>> Is UDP from the wired side to the Internet also OK?  I'm assuming it is, but it would be nice to prove that it is actually a leak in ath9k and/or the wireless stack first!
>>>> 	Actually I have not tested that yet (again with the crash somewhat hard to reproduce I will have to take the wireless out of use for 24 to 48 hours to be reasonably sure that the issue does not occur under wired connections). That said, I will go and work on that. So I have my testing work charted out and will post again once I have more data.
>>>> 
>>>> Best
>>>> 	Sebastian
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> _______________________________________________
>>>>> Cerowrt-devel mailing list
>>>>> Cerowrt-devel@lists.bufferbloat.net
>>>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>>>> _______________________________________________
>>>> Cerowrt-devel mailing list
>>>> Cerowrt-devel@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-24 18:58             ` Robert Bradley
@ 2012-05-25  6:41               ` Sebastian Moeller
  2012-05-25  7:02                 ` Dave Taht
  2012-05-25 11:11                 ` Robert Bradley
  0 siblings, 2 replies; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-25  6:41 UTC (permalink / raw)
  To: Robert Bradley; +Cc: cerowrt-devel

Hi Robert,

since I see the same log file on my router as Jim, I just want to report my observations below.

On May 24, 2012, at 11:58 AM, Robert Bradley wrote:

> On 24/05/12 19:15, Jim Gettys wrote:
>> On 05/24/2012 02:12 PM, Sebastian Moeller wrote:
>>> Hi Jim,
>>> 
>>> good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…
>> If you do, see if you can grab the babeld.conf file and add it to:
>> https://www.bufferbloat.net/issues/392
>> 
> 
> I don't know if it helps at all, but it looks like Babel's failing to obtain channel information for the guest interfaces (gw00 and gw10).  

	True, in my case I had set the 2.4GHz radio to auto channel select, which does not seem to work well with either babeld or its specific configuration.

> Are these disabled on your routers at the moment?  I suppose in the worst case you could try setting an explicit channel for both of the non-mesh guest interfaces and see if the logs clear up (or somehow pass "-L /dev/null" to babeld).

	After setting the 2.4GHz channel to 1 instead of auto /tmp/babeld.log still grows with the same entries. And on a WNDR3700v2 there are 30840 KB of tmpfs on /tmp so the babeld.log size of 256KB should not by itself cause the router to crash. That said, while testing this hypothesis by filling most of /tmp (dd if=/dev/zero of=/tmp/delete_me bs=1024 count=30000, so that around 340KB stayed free) the router reliably went first into OOM and the rebooted itself. Might it be that the size of the /tmp filesystem is too large if actually used? If I naively add the VSZs of most processes I end up at around 90% of available memory, so worst case there actually only seems to be room for a much smaller /tmp than 30MB. . Maybe restricting /tmp to 6000 KB might make this problem go away (or hooking up a swap device). Does this reasoning sound sane? Once I figure out how to reduce the size of /tmp I will test this.


> 
> I'm assuming the ad-hoc mesh links are working fine, since gw01/gw11 aren't present in the log fragment.

	In my case I do not know as I never tried to test with a mesh client.

best
	Sebastian

> -- 
> Robert Bradley
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-25  6:41               ` [Cerowrt-devel] 3.3.6-2 Sebastian Moeller
@ 2012-05-25  7:02                 ` Dave Taht
  2012-05-25 11:11                 ` Robert Bradley
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Taht @ 2012-05-25  7:02 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: babel-users, cerowrt-devel

On Fri, May 25, 2012 at 7:41 AM, Sebastian Moeller <moeller0@gmx.de> wrote:
> Hi Robert,
>
> since I see the same log file on my router as Jim, I just want to report my observations below.
>
> On May 24, 2012, at 11:58 AM, Robert Bradley wrote:
>
>> On 24/05/12 19:15, Jim Gettys wrote:
>>> On 05/24/2012 02:12 PM, Sebastian Moeller wrote:
>>>> Hi Jim,
>>>>
>>>> good point, I will go and see whether that is the cause for my crashes… Will return to this post if/when I have new data in either direction…
>>> If you do, see if you can grab the babeld.conf file and add it to:
>>> https://www.bufferbloat.net/issues/392
>>>
>>
>> I don't know if it helps at all, but it looks like Babel's failing to obtain channel information for the guest interfaces (gw00 and gw10).

I am cc'ing the babel-users list. So it appears that the second
interface on a wireless radio,
does not report channel information reliably, OR babeld is not getting
it on the second interface
for some reason.

...sensing the channel is important so that diversity routing works.

Going back to vacation now.

>
>        True, in my case I had set the 2.4GHz radio to auto channel select, which does not seem to work well with either babeld or its specific configuration.
>
>> Are these disabled on your routers at the moment?  I suppose in the worst case you could try setting an explicit channel for both of the non-mesh guest interfaces and see if the logs clear up (or somehow pass "-L /dev/null" to babeld).
>
>        After setting the 2.4GHz channel to 1 instead of auto /tmp/babeld.log still grows with the same entries. And on a WNDR3700v2 there are 30840 KB of tmpfs on /tmp so the babeld.log size of 256KB should not by itself cause the router to crash. That said, while testing this hypothesis by filling most of /tmp (dd if=/dev/zero of=/tmp/delete_me bs=1024 count=30000, so that around 340KB stayed free) the router reliably went first into OOM and the rebooted itself. Might it be that the size of the /tmp filesystem is too large if actually used? If I naively add the VSZs of most processes I end up at around 90% of available memory, so worst case there actually only seems to be room for a much smaller /tmp than 30MB. . Maybe restricting /tmp to 6000 KB might make this problem go away (or hooking up a swap device). Does this reasoning sound sane? Once I figure out how to reduce the size of /tmp I will test this.
>
>
>>
>> I'm assuming the ad-hoc mesh links are working fine, since gw01/gw11 aren't present in the log fragment.
>
>        In my case I do not know as I never tried to test with a mesh client.

kill babel if you aren't using it, see what happens.

/etc/init.d/babeld disable
/etc/init.d/babeld stop


>
> best
>        Sebastian
>
>> --
>> Robert Bradley
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-25  6:41               ` [Cerowrt-devel] 3.3.6-2 Sebastian Moeller
  2012-05-25  7:02                 ` Dave Taht
@ 2012-05-25 11:11                 ` Robert Bradley
  2012-05-25 18:25                   ` Sebastian Moeller
  1 sibling, 1 reply; 16+ messages in thread
From: Robert Bradley @ 2012-05-25 11:11 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 25 May 2012 07:41, Sebastian Moeller <moeller0@gmx.de> wrote:
>
> Hi Robert,
>
> since I see the same log file on my router as Jim, I just want to report
> my observations below.
>
> On May 24, 2012, at 11:58 AM, Robert Bradley wrote:
<snip>
(re. guest interfaces on wireless)
> > Are these disabled on your routers at the moment?  I suppose in the
> > worst case you could try setting an explicit channel for both of the
> > non-mesh guest interfaces and see if the logs clear up (or somehow pass "-L
> > /dev/null" to babeld).
>
>        After setting the 2.4GHz channel to 1 instead of auto
> /tmp/babeld.log still grows with the same entries. And on a WNDR3700v2 there
> are 30840 KB of tmpfs on /tmp so the babeld.log size of 256KB should not by
> itself cause the router to crash. That said, while testing this hypothesis
> by filling most of /tmp (dd if=/dev/zero of=/tmp/delete_me bs=1024
> count=30000, so that around 340KB stayed free) the router reliably went
> first into OOM and the rebooted itself. Might it be that the size of the
> /tmp filesystem is too large if actually used? If I naively add the VSZs of
> most processes I end up at around 90% of available memory, so worst case
> there actually only seems to be room for a much smaller /tmp than 30MB. .
> Maybe restricting /tmp to 6000 KB might make this problem go away (or
> hooking up a swap device). Does this reasoning sound sane? Once I figure out
> how to reduce the size of /tmp I will test this.
>

Using "mount -o remount -o size=6000k /tmp" should apparently work for
that.  The reasoning sounds good to me, too.  That said, unless we can
find an obvious reason for /tmp overfilling, I'm not sure we should do
that, since it will cause problems upgrading.  There's also the issue
that in bug #379, only wireless traffic caused problems.  I think that
even if excessive logs are the problem, the real issue must be
somewhere within the wireless driver, but I could well be wrong...

I'm thinking that maybe flooding wireless->wired with UDP traffic for
5-10 minutes is the right approach, and then vice-versa (restarting
the router inbetween?).  If there are problems like infinite retries
or packet memory leaks, that might show them up quickly.

--
Robert Bradley

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-25 11:11                 ` Robert Bradley
@ 2012-05-25 18:25                   ` Sebastian Moeller
  2012-05-25 22:38                     ` Robert Bradley
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2012-05-25 18:25 UTC (permalink / raw)
  To: Robert Bradley; +Cc: cerowrt-devel

Hi Robert,


On May 25, 2012, at 4:11 AM, Robert Bradley wrote:

> On 25 May 2012 07:41, Sebastian Moeller <moeller0@gmx.de> wrote:
>> 
>> Hi Robert,
>> 
>> since I see the same log file on my router as Jim, I just want to report
>> my observations below.
>> 
>> On May 24, 2012, at 11:58 AM, Robert Bradley wrote:
> <snip>
> (re. guest interfaces on wireless)
>>> Are these disabled on your routers at the moment?  I suppose in the
>>> worst case you could try setting an explicit channel for both of the
>>> non-mesh guest interfaces and see if the logs clear up (or somehow pass "-L
>>> /dev/null" to babeld).
>> 
>>        After setting the 2.4GHz channel to 1 instead of auto
>> /tmp/babeld.log still grows with the same entries. And on a WNDR3700v2 there
>> are 30840 KB of tmpfs on /tmp so the babeld.log size of 256KB should not by
>> itself cause the router to crash. That said, while testing this hypothesis
>> by filling most of /tmp (dd if=/dev/zero of=/tmp/delete_me bs=1024
>> count=30000, so that around 340KB stayed free) the router reliably went
>> first into OOM and the rebooted itself. Might it be that the size of the
>> /tmp filesystem is too large if actually used? If I naively add the VSZs of
>> most processes I end up at around 90% of available memory, so worst case
>> there actually only seems to be room for a much smaller /tmp than 30MB. .
>> Maybe restricting /tmp to 6000 KB might make this problem go away (or
>> hooking up a swap device). Does this reasoning sound sane? Once I figure out
>> how to reduce the size of /tmp I will test this.
>> 
> 
> Using "mount -o remount -o size=6000k /tmp" should apparently work for
> that.  The reasoning sounds good to me, too.  

	I will go and test that.

> That said, unless we can
> find an obvious reason for /tmp overfilling, I'm not sure we should do
> that, since it will cause problems upgrading.  

	But if I create a file of 30000 1KB blocks in /tmp (so that around 400 KB stay available), the router goes into OOM, so I do not think that upgrading would work well if it really needs so much memory? I have a hunch that the openwork base under cerowrt does not assume something as big and demanding as the 11MB bind9 named process running :)

> There's also the issue
> that in bug #379, only wireless traffic caused problems.  I think that
> even if excessive logs are the problem, the real issue must be
> somewhere within the wireless driver, but I could well be wrong…

	Oh I agree the /tmp issue is a tangent, but it does not seem healthy that the router spirals into reboot once /tmp fills up (BTW if I remove my 30000KB file from /tmp while the first OOM is in process the router recovers) My hunch is that the falmost fully instantiated tmpfs takes to o much memory from the system for it to handle its usual business.
	On top of that are the wireless issues, say what about a kernel memory leak caused by ath wireless that grows and grows until the problematic /tmp size is in the single digit MBs that starts the spiral to reboot?

> 
> I'm thinking that maybe flooding wireless->wired with UDP traffic for
> 5-10 minutes is the right approach, and then vice-versa (restarting
> the router inbetween?).  If there are problems like infinite retries
> or packet memory leaks, that might show them up quickly.

	That sounds like the right way to process, except I am no expert at setting netsurf up so that might take a while until I get around to actually test that hypothesis. (Do you by any chance know a publicly available net server process running in the internets to which I could point a local netperf, and do you have any recommendations how to create the UDP flood with netperf ?)

Best
	Sebastian

> 
> --
> Robert Bradley


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-25 18:25                   ` Sebastian Moeller
@ 2012-05-25 22:38                     ` Robert Bradley
  2012-06-02  7:03                       ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bradley @ 2012-05-25 22:38 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 25/05/12 19:25, Sebastian Moeller wrote:
> Hi Robert,
>
>
> On May 25, 2012, at 4:11 AM, Robert Bradley wrote:
>
>> That said, unless we can
>> find an obvious reason for /tmp overfilling, I'm not sure we should do
>> that, since it will cause problems upgrading.
> 	But if I create a file of 30000 1KB blocks in /tmp (so that around 400 KB stay available), the router goes into OOM, so I do not think that upgrading would work well if it really needs so much memory? I have a hunch that the openwork base under cerowrt does not assume something as big and demanding as the 11MB bind9 named process running :)
The flash memory size is about 16MB for the WNDR3700, so it's probably 
ok for normal use.  It's less certain with BIND and everything else 
running, although it'd be possible to restart the router, stop BIND and 
then update.

> 	Oh I agree the /tmp issue is a tangent, but it does not seem healthy that the router spirals into reboot once /tmp fills up (BTW if I remove my 30000KB file from /tmp while the first OOM is in process the router recovers) My hunch is that the falmost fully instantiated tmpfs takes to o much memory from the system for it to handle its usual business.
> 	On top of that are the wireless issues, say what about a kernel memory leak caused by ath wireless that grows and grows until the problematic /tmp size is in the single digit MBs that starts the spiral to reboot?

No, definitely not healthy!  I'm thinking that maybe setting tmpfs to 
20MB would be a good compromise, at least until the presumed memory leak 
can be tracked down.

>> I'm thinking that maybe flooding wireless->wired with UDP traffic for
>> 5-10 minutes is the right approach, and then vice-versa (restarting
>> the router inbetween?).  If there are problems like infinite retries
>> or packet memory leaks, that might show them up quickly.
> 	That sounds like the right way to process, except I am no expert at setting netsurf up so that might take a while until I get around to actually test that hypothesis. (Do you by any chance know a publicly available net server process running in the internets to which I could point a local netperf, and do you have any recommendations how to create the UDP flood with netperf ?)
>
>

I don't know of any myself.  There's a possible tutorial on setting it 
up at http://www.tonymacx86.com/viewtopic.php?t=5700, but assuming you 
have it installed on two computers already, it should just be a case of 
running:

user@computer1$ netperf -t UDP_STREAM -H computer2

and possibly running "netserver -p 12865" on computer2 if necessary.  
(It should in theory be started via inetd.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-05-25 22:38                     ` Robert Bradley
@ 2012-06-02  7:03                       ` Sebastian Moeller
  2012-06-03 22:24                         ` Robert Bradley
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2012-06-02  7:03 UTC (permalink / raw)
  To: Robert Bradley; +Cc: cerowrt-devel

Hi Robert,

tool me some time to get a bit further with more testing...

On May 25, 2012, at 3:38 PM, Robert Bradley wrote:

> On 25/05/12 19:25, Sebastian Moeller wrote:
>> Hi Robert,
>> 
>> 
>> On May 25, 2012, at 4:11 AM, Robert Bradley wrote:
>> 
>>> That said, unless we can
>>> find an obvious reason for /tmp overfilling, I'm not sure we should do
>>> that, since it will cause problems upgrading.
>> 	But if I create a file of 30000 1KB blocks in /tmp (so that around 400 KB stay available), the router goes into OOM, so I do not think that upgrading would work well if it really needs so much memory? I have a hunch that the openwork base under cerowrt does not assume something as big and demanding as the 11MB bind9 named process running :)
> The flash memory size is about 16MB for the WNDR3700, so it's probably ok for normal use.  It's less certain with BIND and everything else running, although it'd be possible to restart the router, stop BIND and then update.

	From my totally unscientific testing I am quite convinced that even 16MB of /tmp used will make the router spiral into reboot if used over the 5GHz radio to the wan port. However, if I use one of the wired ports I get plenty of the following (not always hostapd):


Jun  1 23:41:08 nacktmulle kern.warn kernel: [185428.417968] hostapd: page allocation failure: order:0, mode:0x4020
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] Call Trace:
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<802850a4>] dump_stack+0x8/0x34
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b4548>] warn_alloc_failed+0xe8/0x10c
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800da070>] new_slab+0xa8/0x280
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800dba48>] __kmalloc_track_caller+0x88/0x140
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0854>] __alloc_skb+0x80/0x140
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0930>] dev_alloc_skb+0x1c/0x48
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801d0c74>] ag71xx_poll+0x430/0x65c
Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e8c10>] net_rx_action+0x88/0x1c8
Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] hostapd: page allocation failure: order:0, mode:0x4020
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Call Trace:
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<802850a4>] dump_stack+0x8/0x34
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b4548>] warn_alloc_failed+0xe8/0x10c
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800da070>] new_slab+0xa8/0x280
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800dba48>] __kmalloc_track_caller+0x88/0x140
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0854>] __alloc_skb+0x80/0x140
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0930>] dev_alloc_skb+0x1c/0x48
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801d0c74>] ag71xx_poll+0x430/0x65c
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Mem-Info:
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal per-cpu:
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] CPU    0: hi:   18, btch:   3 usd:  18
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] active_anon:3826 inactive_anon:63 isolated_anon:0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  active_file:683 inactive_file:561 isolated_file:0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  unevictable:0 dirty:0 writeback:0 unstable:0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  free:96 slab_reclaimable:408 slab_unreclaimable:7706
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  mapped:501 shmem:109 pagetables:142 bounce:0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal free:384kB min:1016kB low:1268kB high:1524kB active_anon:15304kB inactive_anon:252kB active_file:2732kB inactive_file:2244kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65024kB mlocked:0k
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] lowmem_reserve[]: 0 0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal: 42*4kB 15*8kB 0*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 384kB
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1353 total pagecache pages
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 0 pages in swap cache
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Swap cache stats: add 0, delete 0, find 0/0
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Free swap  = 0kB
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Total swap = 0kB
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 16384 pages RAM
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 965 pages reserved
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1399 pages shared
Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 14306 pages non-shared
Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 2, min order: 0
Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   node 0: slabs: 0, objs: 0, free: 0

But the box seems to survive this… Heck this even survives my test case with 16000 KB used of /tmp. Under that amount of memory pressure named and ntpd get killed but the router does go into automatically reboot, it just stays up and running albeit somewhat useless without named.



> 
>> 	Oh I agree the /tmp issue is a tangent, but it does not seem healthy that the router spirals into reboot once /tmp fills up (BTW if I remove my 30000KB file from /tmp while the first OOM is in process the router recovers) My hunch is that the falmost fully instantiated tmpfs takes to o much memory from the system for it to handle its usual business.
>> 	On top of that are the wireless issues, say what about a kernel memory leak caused by ath wireless that grows and grows until the problematic /tmp size is in the single digit MBs that starts the spiral to reboot?
> 
> No, definitely not healthy!  I'm thinking that maybe setting tmpfs to 20MB would be a good compromise, at least until the presumed memory leak can be tracked down.

	The way I interpret my latest test results is that the "assumed leak" should be restricted to the wireless driver, does that sound right to you? Also with cerowrt 3.3.6-2 even 16MB seem to much for /tmp. I will see what happens if I add some swap space to the router, I hope it will be quite happy with 31MB /tmp and actual usage of that space :). Since Dave only recommends full tftp reflashes  maybe the update scenario might not be such a big issue for cerowrt?

> 
>>> I'm thinking that maybe flooding wireless->wired with UDP traffic for
>>> 5-10 minutes is the right approach, and then vice-versa (restarting
>>> the router inbetween?).  If there are problems like infinite retries
>>> or packet memory leaks, that might show them up quickly.
>> 	That sounds like the right way to process, except I am no expert at setting netsurf up so that might take a while until I get around to actually test that hypothesis. (Do you by any chance know a publicly available net server process running in the internets to which I could point a local netperf, and do you have any recommendations how to create the UDP flood with netperf ?)
>> 
>> 
> 
> I don't know of any myself.  There's a possible tutorial on setting it up at http://www.tonymacx86.com/viewtopic.php?t=5700, but assuming you have it installed on two computers already, it should just be a case of running:
> 
> user@computer1$ netperf -t UDP_STREAM -H computer2
> 
> and possibly running "netserver -p 12865" on computer2 if necessary.  (It should in theory be started via inetd.)


	I am still trying to get a second machine on my network so I can test the UDP hypothesis, but that will take a while longer…

Best
	Sebastian


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-06-02  7:03                       ` Sebastian Moeller
@ 2012-06-03 22:24                         ` Robert Bradley
  2012-06-06 23:03                           ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Robert Bradley @ 2012-06-03 22:24 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 02/06/12 08:03, Sebastian Moeller wrote:
> 	 From my totally unscientific testing I am quite convinced that even 16MB of /tmp used will make the router spiral into reboot if used over the 5GHz radio to the wan port. However, if I use one of the wired ports I get plenty of the following (not always hostapd):
>
>
> Jun  1 23:41:08 nacktmulle kern.warn kernel: [185428.417968] hostapd: page allocation failure: order:0, mode:0x4020
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] Call Trace:
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<802850a4>] dump_stack+0x8/0x34
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b4548>] warn_alloc_failed+0xe8/0x10c
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800da070>] new_slab+0xa8/0x280
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800dba48>] __kmalloc_track_caller+0x88/0x140
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0854>] __alloc_skb+0x80/0x140
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0930>] dev_alloc_skb+0x1c/0x48
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801d0c74>] ag71xx_poll+0x430/0x65c
> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e8c10>] net_rx_action+0x88/0x1c8
> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] hostapd: page allocation failure: order:0, mode:0x4020
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Call Trace:
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<802850a4>] dump_stack+0x8/0x34
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b4548>] warn_alloc_failed+0xe8/0x10c
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800da070>] new_slab+0xa8/0x280
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800dba48>] __kmalloc_track_caller+0x88/0x140
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0854>] __alloc_skb+0x80/0x140
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0930>] dev_alloc_skb+0x1c/0x48
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801d0c74>] ag71xx_poll+0x430/0x65c
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Mem-Info:
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal per-cpu:
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] CPU    0: hi:   18, btch:   3 usd:  18
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] active_anon:3826 inactive_anon:63 isolated_anon:0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  active_file:683 inactive_file:561 isolated_file:0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  unevictable:0 dirty:0 writeback:0 unstable:0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  free:96 slab_reclaimable:408 slab_unreclaimable:7706
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  mapped:501 shmem:109 pagetables:142 bounce:0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal free:384kB min:1016kB low:1268kB high:1524kB active_anon:15304kB inactive_anon:252kB active_file:2732kB inactive_file:2244kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65024kB mlocked:0k
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] lowmem_reserve[]: 0 0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal: 42*4kB 15*8kB 0*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 384kB
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1353 total pagecache pages
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 0 pages in swap cache
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Swap cache stats: add 0, delete 0, find 0/0
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Free swap  = 0kB
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Total swap = 0kB
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 16384 pages RAM
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 965 pages reserved
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1399 pages shared
> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 14306 pages non-shared
> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 2, min order: 0
> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   node 0: slabs: 0, objs: 0, free: 0
>
> But the box seems to survive this… Heck this even survives my test case with 16000 KB used of /tmp. Under that amount of memory pressure named and ntpd get killed but the router does go into automatically reboot, it just stays up and running albeit somewhat useless without named.
>

Yes - that stack trace is because the ag71xx driver can't allocate the 
memory for a skb structure.  Unlike the wireless driver though, the 
ag71xx_poll function simply returns immediately with ENOMEM.  I had no 
real success in tracing what the equivalent is in ath9k.

I noticed a possible issue in ath9k_rx_tasklet, since if 
bf->bf_mpdu=NULL (bf being an Atheros-specific buffer type) you could 
potentially get an infinite loop.  I can't see though if that can ever 
occur in reality.  I *think* it uses a list of skb structures 
preallocated at init-time for incoming frames, but I'm still trying to 
interpret that part of the code.  (The exact behaviour is 
hardware-dependent.)

> 	The way I interpret my latest test results is that the "assumed leak" should be restricted to the wireless driver, does that sound right to you? Also with cerowrt 3.3.6-2 even 16MB seem to much for /tmp. I will see what happens if I add some swap space to the router, I hope it will be quite happy with 31MB /tmp and actual usage of that space :). Since Dave only recommends full tftp reflashes  maybe the update scenario might not be such a big issue for cerowrt?
>

I'll leave that to Dave to say - I was assuming that the firmware would 
be stored in memory first and then flashed.  (There's always tftp at 
boot time as an alternative flashing method.)
-- 
Robert Bradley

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] 3.3.6-2
  2012-06-03 22:24                         ` Robert Bradley
@ 2012-06-06 23:03                           ` Sebastian Moeller
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Moeller @ 2012-06-06 23:03 UTC (permalink / raw)
  To: Robert Bradley; +Cc: cerowrt-devel

Hi Robert,


On Jun 3, 2012, at 3:24 PM, Robert Bradley wrote:

> On 02/06/12 08:03, Sebastian Moeller wrote:
>> 	 From my totally unscientific testing I am quite convinced that even 16MB of /tmp used will make the router spiral into reboot if used over the 5GHz radio to the wan port. However, if I use one of the wired ports I get plenty of the following (not always hostapd):
>> 
>> 
>> Jun  1 23:41:08 nacktmulle kern.warn kernel: [185428.417968] hostapd: page allocation failure: order:0, mode:0x4020
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] Call Trace:
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<802850a4>] dump_stack+0x8/0x34
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b4548>] warn_alloc_failed+0xe8/0x10c
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800da070>] new_slab+0xa8/0x280
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<800dba48>] __kmalloc_track_caller+0x88/0x140
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0854>] __alloc_skb+0x80/0x140
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e0930>] dev_alloc_skb+0x1c/0x48
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801d0c74>] ag71xx_poll+0x430/0x65c
>> Jun  1 23:41:08 nacktmulle kern.alert kernel: [185428.417968] [<801e8c10>] net_rx_action+0x88/0x1c8
>> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] hostapd: page allocation failure: order:0, mode:0x4020
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Call Trace:
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<802850a4>] dump_stack+0x8/0x34
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b4548>] warn_alloc_failed+0xe8/0x10c
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800b684c>] __alloc_pages_nodemask+0x5a0/0x600
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800da070>] new_slab+0xa8/0x280
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<80286b18>] __slab_alloc.isra.60.constprop.63+0x25c/0x2fc
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<800dba48>] __kmalloc_track_caller+0x88/0x140
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0854>] __alloc_skb+0x80/0x140
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801e0930>] dev_alloc_skb+0x1c/0x48
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] [<801d0c74>] ag71xx_poll+0x430/0x65c
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Mem-Info:
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal per-cpu:
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] CPU    0: hi:   18, btch:   3 usd:  18
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] active_anon:3826 inactive_anon:63 isolated_anon:0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  active_file:683 inactive_file:561 isolated_file:0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  unevictable:0 dirty:0 writeback:0 unstable:0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  free:96 slab_reclaimable:408 slab_unreclaimable:7706
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375]  mapped:501 shmem:109 pagetables:142 bounce:0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal free:384kB min:1016kB low:1268kB high:1524kB active_anon:15304kB inactive_anon:252kB active_file:2732kB inactive_file:2244kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65024kB mlocked:0k
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] lowmem_reserve[]: 0 0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Normal: 42*4kB 15*8kB 0*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 384kB
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1353 total pagecache pages
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 0 pages in swap cache
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Swap cache stats: add 0, delete 0, find 0/0
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Free swap  = 0kB
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] Total swap = 0kB
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 16384 pages RAM
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 965 pages reserved
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 1399 pages shared
>> Jun  1 23:41:09 nacktmulle kern.alert kernel: [185429.484375] 14306 pages non-shared
>> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
>> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 2, min order: 0
>> Jun  1 23:41:09 nacktmulle kern.warn kernel: [185429.484375]   node 0: slabs: 0, objs: 0, free: 0
>> 
>> But the box seems to survive this… Heck this even survives my test case with 16000 KB used of /tmp. Under that amount of memory pressure named and ntpd get killed but the router does go into automatically reboot, it just stays up and running albeit somewhat useless without named.
>> 
> 
> Yes - that stack trace is because the ag71xx driver can't allocate the memory for a skb structure.  Unlike the wireless driver though, the ag71xx_poll function simply returns immediately with ENOMEM.  I had no real success in tracing what the equivalent is in ath9k.
> 
> I noticed a possible issue in ath9k_rx_tasklet, since if bf->bf_mpdu=NULL (bf being an Atheros-specific buffer type) you could potentially get an infinite loop.  I can't see though if that can ever occur in reality.  I *think* it uses a list of skb structures preallocated at init-time for incoming frames, but I'm still trying to interpret that part of the code.  (The exact behaviour is hardware-dependent.)

	I see, that is out of my league then (I can read C badly, so I will not necessarily recognize a bug if I look at it); unless I can run some (simple) tests I do not see how I can actually help fix this… (That said, I will try to get the proper kernel sources and start digging through ath9k driver if just to learn something new).
	Also I will try to repeat my simplistic tests with some swap space hooked up to see whether that ameliorates the issue out of existence. (In my view it is quite acceptable to require swap to be present for a fully "Tricked out" router distribution like cerowrt).

> 
>> 	The way I interpret my latest test results is that the "assumed leak" should be restricted to the wireless driver, does that sound right to you? Also with cerowrt 3.3.6-2 even 16MB seem to much for /tmp. I will see what happens if I add some swap space to the router, I hope it will be quite happy with 31MB /tmp and actual usage of that space :). Since Dave only recommends full tftp reflashes  maybe the update scenario might not be such a big issue for cerowrt?
>> 
> 
> I'll leave that to Dave to say - I was assuming that the firmware would be stored in memory first and then flashed.  (There's always tftp at boot time as an alternative flashing method.)

	Well, maybe the next kernel base for cerowrt will be more forgiving :)

> -- 
> Robert Bradley


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-06-06 23:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <00404BC8-3761-409D-A1C8-9213D7D9A3DF@gmx.de>
2012-05-24  3:48 ` [Cerowrt-devel] Fwd: 3.3.6-2 Sebastian Moeller
2012-05-24 15:44   ` Robert Bradley
2012-05-24 16:18     ` Sebastian Moeller
2012-05-24 16:32       ` Jim Gettys
2012-05-24 18:12         ` Sebastian Moeller
2012-05-24 18:15           ` Jim Gettys
2012-05-24 18:58             ` Robert Bradley
2012-05-25  6:41               ` [Cerowrt-devel] 3.3.6-2 Sebastian Moeller
2012-05-25  7:02                 ` Dave Taht
2012-05-25 11:11                 ` Robert Bradley
2012-05-25 18:25                   ` Sebastian Moeller
2012-05-25 22:38                     ` Robert Bradley
2012-06-02  7:03                       ` Sebastian Moeller
2012-06-03 22:24                         ` Robert Bradley
2012-06-06 23:03                           ` Sebastian Moeller
2012-05-25  0:04             ` [Cerowrt-devel] Fwd: 3.3.6-2 Sebastian Moeller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox