On Fri, Jan 24, 2014 at 3:23 PM, Dave Taht <dave.taht@gmail.com> wrote:
On Fri, Jan 24, 2014 at 6:08 PM, Steve Jenson <stevej@fruitless.org> wrote:
> Hi everybody,
>
> I've been using cerowrt as a secondary wifi network (just a single AP for
> now) for a few weeks now. Recently, my wndr3800 got stuck in a bad state and
> eventually rebooted. I've had this happen a few times now and am looking for
> ways to debug the issue. I'm new to cerowrt and openwrt so any advice is
> appreciated.
>
> Since I use it as a secondary network, this is no way critical.

Yea! I appreciate caution before putting alpha software on your gw.

> I'm not
> looking for free tech support but I couldn't find anything on the wiki about
> troubleshooting. I'd love to start a page and write some shell scripts to
> diagnose and report issues. I know that a cerowrt router is meant to be a
> research project rather a consumer device but these things seem helpful
> regardless.

Sure, let me know your wiki account. I have been lax about granting
access of late as the signup process is overrun by spammers.

My username is stevej on the wiki. Thanks! 

 
> Please let me know if you'd prefer I not email the list with these issues or
> if you'd rather I used trac or a different forum.
>

The list is where most stuff happens. Also in the irc channel.

If it gets to where it needs to be tracked we have a bugtracker at

http://www.bufferbloat.net/projects/cerowrt/issues

The first question I have is: Are you on comcast? Cerowrt
had a dhcpv6-pd implementation that "just worked" from feburary through
december. Regrettably they changed the RA announcement interval
to a really low number around then... and this triggers a firewall reload
every minute on everything prior to the release I point to below.

If there is a memory leak somewhere that would have triggered it.

I am on AT&T ADSL2+ with a Motorola NVG510 modem. 

 
> In this state, I can connect to the cerowrt base station via wifi but am
> unable to route packets to the internet. I can connect to :81 and see the
> login page but logging in results in a lua error at `/cgi-bin/luci`
>
>
>   /usr/lib/lua/luci/dispatcher.lua:448: Failed to execute function
> dispatcher target for entry '/'.
>   The called action terminated with an exception:
>   /usr/lib/lua/luci/sauth.lua:87: Session data invalid!
>   stack traceback:
>   [C]: in function 'assert'
>   /usr/lib/lua/luci/dispatcher.lua:448: in function 'dispatch'
>   /usr/lib/lua/luci/dispatcher.lua:195: in function
> </usr/lib/lua/luci/dispatcher.lua:194>
>
> I can ssh into the device and cat various log files until the router hangs
> and reboots. here's a few relevant lines from my terminal history before the
> device rebooted (I'm assuming a watchdog kicked in and rebooted it).
>
> root@buffy2-1:~# ping google.com
> ping: bad address 'google.com'
> root@buffy2-1:~# free
>              total         used         free       shared      buffers
> Mem:        126336       110332        16004            0         5616
> -/+ buffers:             104716        21620
> Swap:            0            0            0
> root@buffy2-1:~# uptime
>  02:08:54 up 2 days,  1:26,  load average: 0.10, 0.21, 0.17
> root@buffy2-1:~# dmesg
> [    0.000000] Linux version 3.10.24 (cero2@snapon) (gcc version 4.6.4
> (OpenWrt/Linaro GCC 4.6-2013.05 r38226) ) #1 Tue Dec 24
> 10:50:15 PST 2013
> [skipping some lines]
>
> [   13.156250] Error: Driver 'gpio-keys-polled' is already registered,
> aborting...
> [   19.414062] IPv6: ADDRCONF(NETDEV_UP): ge00: link is not ready
> [   19.421875] ar71xx: pll_reg 0xb8050010: 0x11110000
> [   19.429687] se00: link up (1000Mbps/Full duplex)
> [   22.140625] IPv6: ADDRCONF(NETDEV_UP): sw00: link is not ready
> [   23.351562] IPv6: ADDRCONF(NETDEV_CHANGE): sw00: link becomes ready
> [   23.757812] ar71xx: pll_reg 0xb8050014: 0x11110000
> [   23.757812] ge00: link up (1000Mbps/Full duplex)
> [   23.773437] IPv6: ADDRCONF(NETDEV_CHANGE): ge00: link becomes ready
>
> root@buffy2-1:~# ifconfig
> ge00      Link encap:Ethernet  HWaddr 2C:B0:5D:A0:C5:B1
>           inet addr:192.168.1.138  Bcast:192.168.1.255  Mask:255.255.255.0
>           inet6 addr: fe80::2eb0:5dff:fea0:c5b1/64 Scope:Link
>           inet6 addr: 2602:30a:2cdb:330:2eb0:5dff:fea0:c5b1/64 Scope:Global
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1469670 errors:0 dropped:8 overruns:0 frame:0
>           TX packets:547733 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:229243410 (218.6 MiB)  TX bytes:57304808 (54.6 MiB)
>           Interrupt:5
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           RX packets:23689 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:23689 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:2612713 (2.4 MiB)  TX bytes:2612713 (2.4 MiB)
>
> pimreg    Link encap:UNSPEC  HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>           UP RUNNING NOARP  MTU:1472  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> se00      Link encap:Ethernet  HWaddr 2E:B0:5D:A0:C5:B0
>           inet addr:172.30.42.1  Bcast:172.30.42.31  Mask:255.255.255.224
>           inet6 addr: 2602:30a:2cdb:330:2eb0:5dff:fea0:c5b1/64 Scope:Global

How are you assigning your ipv6 addresses?

It's been a while since I messed with this but I think IPv6 is assigned thanks to 6relayd? My modem has IPv6 enabled but no DHCPv6 options that I can find. Here's how cerowrt is configured.

root@buffy2-1:/overlay/etc/config# cat 6relayd
config server 'default'
  option fallback_relay 'rd dhcpv6 ndp'
  list network 'ge00'
  list network 'ge01'
  list network 'gw00'
  list network 'gw01'
  list network 'gw10'
  list network 'gw11'
  list network 'se00'
  list network 'sw00'
  list network 'sw10'
  option rd 'relay'
  option dhcpv6 'relay'
  option ndp 'relay'
  option master 'ge00'


>           inet6 addr: fe80::2cb0:5dff:fea0:c5b0/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:191740 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:0 (0.0 B)  TX bytes:42184988 (40.2 MiB)
>           Interrupt:4
>
> sw00      Link encap:Ethernet  HWaddr 2C:B0:5D:A0:C5:B0
>           inet addr:172.30.42.65  Bcast:172.30.42.95  Mask:255.255.255.224
>           inet6 addr: 2602:30a:2cdb:330:2eb0:5dff:fea0:c5b1/64 Scope:Global
>           inet6 addr: fe80::2eb0:5dff:fea0:c5b0/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:70239 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:286967 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:15590189 (14.8 MiB)  TX bytes:127357293 (121.4 MiB)
>
> root@buffy2-1:~# less /var/log/babeld.log
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> send: Cannot assign requested address
> send: Cannot assign requested address
> send: Cannot assign requested address
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> netlink_read: recvmsg(): No buffer space available
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.
> Couldn't determine channel of interface sw00: Invalid argument.

This is a problem in babel detecting the channel on a "normal"
rather than a mesh interface. It's bugged me a long while, but
haven't got around to finding what triggers it. Might "fix" it by
acquiring the channel at babel start time from /etc/config/wireless.

It messes up the diversity routing calculation, grump.

There is a possibility a logfile got really big, but this one
generally doesn't, but I should turn off logging in some
future release...

I believe I've tracked down part of what's going on. It looks like my tmpfs is filling up 100% and then the device enters a bad state:

After 24 hours, with tmpfs at 50%, babeld.log is the largest file by far in tmpfs and the only file that appears to be growing (based on `du`). It takes about 48 hours from reboot to fill up tmpfs on my device.

#  sort babeld.log | uniq -c |sort -rn |head

 503236 Couldn't determine channel of interface sw00: Invalid argument.

   1376 netlink_read: recvmsg(): No buffer space available

      3 send: Cannot assign requested address

# wc -l babeld.log

504617 babeld.log

I sped up system failure by using `dd` to fill up tmpfs and the system became immediately unusable.

This also explains the luci session store errors as sessions are stored in tmpfs.

The other buffer issues may or may not be related to this.

Best,
Steve