[Cerowrt-devel] TFO crashes cerowrt 3.7.1-1

Eric Dumazet edumazet at google.com
Sun Jan 13 13:03:23 EST 2013


I suspect a bug in the spin_is_locked() implementation on your arch, as he
socket lock should be held at this point.



On Sun, Jan 13, 2013 at 9:01 AM, Ketan Kulkarni <ketkulka at gmail.com> wrote:

> I could get a chance to get the backtrace from serial port. I didnt do the
> kgdb session yet.
> To iterate, the crash occurs on TFO server on mips platform.
>
> The call trace looks like this
> [ 1024.530000] Call Trace: [ 1024.530000] [<801fc7f4>]
> reqsk_fastopen_remove+0x30/0x17c [ 1024.530000] [<8024a36c>]
> tcp_rcv_state_process+0x7b4/0xc28 [ 1024.530000] [<802516ec>]
> tcp_v4_do_rcv+0x21c/0x274 [ 1024.530000] [<80253c74>]
> tcp_v4_rcv+0x5b4/0x974 [ 1024.530000] [<802320f0>]
> ip_local_deliver_finish+0x168/0x29c [ 1024.530000] [<80207100>]
> __netif_receive_skb+0x63c/0x6c0 [ 1024.530000] [<c060b2e8>]
> ieee80211_deliver_skb+0x1b8/0x220 [mac80211] [ 1024.530000] [<c060cc70>]
> ieee80211_rx_handlers.part.12+0x1654/0x23e0 [mac80211] [ 1024.530000]
> [<c060e468>] ieee80211_prepare_and_rx_handle+0xa6c/0xaf0 [mac80211] [
> 1024.530000] [<c060ecfc>] ieee80211_rx+0x810/0x8d8 [mac80211] [
> 1024.530000] [<c078651c>] ath_rx_tasklet+0xf4c/0x10a4 [ath9k] [
> 1024.530000] [<c078437c>] ath9k_tasklet+0x104/0x174 [ath9k] [ 1024.530000]
> [<800793b8>] tasklet_action+0x78/0xc8 [ 1024.530000] [<80078c08>]
> __do_softirq+0xb0/0x184 [ 1024.530000] [<80078d8c>] do_softirq+0x48/0x68 [
> 1024.530000] [<80078fa8>] irq_exit+0x4c/0x7c [ 1024.530000] [<8006330c>]
> ret_from_irq+0x0/0x4 [ 1024.530000] [ 1024.530000] Code: 8e510208 30d300ff
> 2c420001 <00028036> 0c01e2a7 ac80048c 8e220008 2442ffff ae220008 [
> 1024.940000] ---[ end trace a47ff22dd20a96c1 ]---[ 1024.950000] Kernel
> panic - not syncing: Fatal exception in interrupt
>
> I suspect this is the line responsible for this crash
>
> void reqsk_fastopen_remove(struct sock *sk, struct request_sock *req, bool
> reset) { struct sock *lsk = tcp_rsk(req)->listener; struct fastopen_queue
> *fastopenq = inet_csk(lsk)->icsk_accept_ queue.fastopenq;
>
> >>>>> BUG_ON(!spin_is_locked(&sk-> sk_lock.slock) &&
> !sock_owned_by_user(sk));
>
> tcp_sk(sk)->fastopen_rsk = NULL; spin_lock_bh(&fastopenq->lock);
> fastopenq->qlen--; tcp_rsk(req)->listener = NULL;
>
> Please see more details here
> http://www.bufferbloat.net/issues/418#change-1706
>
> Thanks,
> Ketan
>
> On Jan 6, 2013 12:43 AM, "Ketan Kulkarni" <ketkulka at gmail.com> wrote:
> >
> > Disabling ECN on cero box has no effect.
> > The box crashed with with ECN disabled.
> > Also tried enabling ECN on x86 and it didnt crash in either case. The
> > tcpdump on cero lo is updated at -
> > https://www.bufferbloat.net/issues/418#change-1703
> > It is exactly similar to the previously attached "lo_capture.txt" but
> > with ECN disabled.
> >
> > I might try getting serial cable on Sunday to get the crash details.
> > Till then probably I cannot provide the crash logs as logread/dmesg
> > does not print anything.
> >
> > Thanks,
> > Ketan
> >
> > On Sat, Jan 5, 2013 at 8:32 AM, Ketan Kulkarni <ketkulka at gmail.com>
> wrote:
> > > Without TFO all worked fine.
> > > The problem is when tfo server is on cero box.
> > > I will try both ECN on on laptop and disabling ECN on cero with TFO
> on. Will
> > > report the behavior seen.
> > >
> > > Thanks,
> > > Ketan.
> > >
> > > On Jan 5, 2013 7:50 AM, "Yuchung Cheng" <ycheng at google.com> wrote:
> > >>
> > >> On Fri, Jan 4, 2013 at 5:59 PM, Ketan Kulkarni <ketkulka at gmail.com>
> wrote:
> > >> > Well, I was trying polipo server on cero box and httping from
> laptop. On
> > >> > both the boxes I set 3 in tcp_fastopen.
> > >> >
> > >> > The panic is seen only when server is on cero box.
> > >> > If I run server on my laptop and httping from cero all TFO
> connections
> > >> > are
> > >> > successful.
> > >> > So I doubt its the only problem is SYN+DATA.
> > >> Just to confirm: you meant the problem is SYN/data processing on the
> > >> server side?
> > >>
> > >> Maybe we hit some ECN / TFO bug. Some crash log would be great. Thanks
> > >> for trying TFO!
> > >>
> > >> >
> > >> > Unfortunately I don't have the serial cable right now, and logread
> or
> > >> > dmesg
> > >> > didn't print any logs before the cero router  restarted.
> > >> >
> > >> > Attached is the tcpdump capture on lo when client and server both
> run on
> > >> > cero box.
> > >> > HTH!
> > >> >
> > >> > If you (or anyone) can suggest more diagnostics, I will be glad to
> > >> > provide.
> > >> >
> > >> > On Jan 5, 2013 2:49 AM, "Jerry Chu" <hkchu at google.com> wrote:
> > >> >>
> > >> >> +ycheng
> > >> >>
> > >> >>
> > >> >> On Fri, Jan 4, 2013 at 1:11 PM, Dave Taht <dave.taht at gmail.com>
> wrote:
> > >> >>>
> > >> >>> Hmm. I would lean towards there being an issue with the new
> (freshly
> > >> >>> ported forward to 3.7.1) unaligned checksum code for mips based on
> > >> >>> what you say here. Or an offload...
> > >> >>>
> > >> >>> As for the 239.x multicast issue, hmm... separate issue entirely.
> > >> >>> Probably...
> > >> >>>
> > >> >>> And then there's TFO. I note that in order to use it properly you
> need
> > >> >>> to turn it on in proc. Last I remember that was
> > >> >>>
> > >> >>> echo 3 > /proc/sys/net/ipv4/tcp_fastopen
> > >> >>
> > >> >>
> > >> >> Correct - to enable the normal use of TFO for both client and
> server.
> > >> >> There are other flags for advanced usage:
> > >> >>  /* Bit Flags for sysctl_tcp_fastopen */
> > >> >> #define TFO_CLIENT_ENABLE       1
> > >> >> #define TFO_SERVER_ENABLE       2
> > >> >> #define TFO_CLIENT_NO_COOKIE    4 /* Send data-in-SYN w/o cookie */
> > >> >>
> > >> >> /* Process SYN data but skip cookie validation */
> > >> >> #define TFO_SERVER_COOKIE_NOT_CHKED     0x100
> > >> >> /* Accept SYN data w/o any cookie option */
> > >> >> #define TFO_SERVER_COOKIE_NOT_REQD      0x200
> > >> >>
> > >> >> /* Force enable TFO on all listeners, i.e., not requiring the
> > >> >>  * TCP_FASTOPEN socket option. SOCKOPT1/2 determine how to set
> > >> >> max_qlen.
> > >> >>  */
> > >> >> #define TFO_SERVER_WO_SOCKOPT1  0x400
> > >> >> #define TFO_SERVER_WO_SOCKOPT2  0x800
> > >> >> /* Always create TFO child sockets on a TFO listener even when
> > >> >>  * cookie/data not present. (For testing purpose!)
> > >> >>  */
> > >> >> #define TFO_SERVER_ALWAYS       0x1000
> > >> >>
> > >> >>>
> > >> >>> However that's an old memory and there is this tcp_fastopen_key
> file I
> > >> >>> don't know anything about yet (this is such bleeding edge stuff!)
> > >> >>>
> > >> >>> ... and with tcp_fastopen disabled things should still work
> right...
> > >> >>> so I'm thinking something else is busted in the stack.
> > >> >>>
> > >> >>> I've also observed a dns slowdown in what I've been testing but
> hadn't
> > >> >>> dug into packet dumps. (and was assuming, until now, it was due
> to me
> > >> >>> fiddling with ULAs inside the network) Thanks for digging this
> deep!
> > >> >>>
> > >> >>> I never said this first attempt at 3.7 for cero was going to be
> > >> >>> perfect, but we've entered a new age of subtle problems here.
> > >> >>>
> > >> >>> I strongly suggest nobody else try this dev build as a default
> gw, and
> > >> >>> that the TFO folk ignore the noise for now.
> > >> >>
> > >> >>
> > >> >> SG.
> > >> >>
> > >> >> Jerry
> > >> >>
> > >> >>>
> > >> >>>
> > >> >>> I just got a 3.7.1 box built on x86_64 so as to a/b some captures.
> > >> >>> Regrettably I'm short on time through the weekend...
> > >> >>>
> > >> >>> On Fri, Jan 4, 2013 at 12:42 PM, Maciej Soltysiak
> > >> >>> <maciej at soltysiak.com>
> > >> >>> wrote:
> > >> >>> > I am seeing something strange here, with polipo related to TFO
> but
> > >> >>> > also
> > >> >>> > DNS.
> > >> >>> > When I just took 3.7.1-1 and set my windows 7 laptop to use
> > >> >>> > gw.home.lan:8123
> > >> >>> > as http proxy it didn't work. What I observed was:
> > >> >>> > A) after quite a while polipo's response to browser was 504 Host
> > >> >>> > www.osnews.com lookup failed: Timeout
> > >> >>> > b) this error in ssh console: Host osnews.com lookup failed:
> Timeout
> > >> >>> > (131072)
> > >> >>> > c) Disabling TFO by adding option useTCPFastOpen 'false' to
> config
> > >> >>> > 'polipo'
> > >> >>> > 'general' works around the problem
> > >> >>> > d) Alternatively, you can keep TFO enabled in polipo but change
> > >> >>> > option
> > >> >>> > 'dnsUseGethostbyname' from 'reluctantly' to 'true' (!)
> > >> >>> > This is very weird, because TFO is TCP and the DNS queries
> fired off
> > >> >>> > by
> > >> >>> > polipo are UDP:
> > >> >>> > root at OpenWrt:/tmp/log# tcpdump -n -v -vv -vvv -x -X -s 1500 -i
> lo
> > >> >>> > 20:21:56.160245 IP (tos 0x0, ttl 64, id 50129, offset 0, flags
> [DF],
> > >> >>> > proto
> > >> >>> > UDP (17), length 60)
> > >> >>> > 127.0.0.1.47304 > 127.0.0.1.53: [bad udp cksum 0xfe3b ->
> 0xd17f!]
> > >> >>> > 55396+ A?
> > >> >>> > www.osnews.com. (32)
> > >> >>> > 0x0000: 4500 003c c3d1 4000 4011 78dd 7f00 0001 E..<.. at .@.x.....
> > >> >>> > 0x0010: 7f00 0001 b8c8 0035 0028 fe3b d864 0100 .......5.(.;.d..
> > >> >>> > 0x0020: 0001 0000 0000 0000 0377 7777 066f 736e .........www.osn
> > >> >>> > 0x0030: 6577 7303 636f 6d00 0001 0001 ews.com.....
> > >> >>> > 20:21:56.160319 IP (tos 0x0, ttl 64, id 50130, offset 0, flags
> [DF],
> > >> >>> > proto
> > >> >>> > UDP (17), length 60)
> > >> >>> > 127.0.0.1.47304 > 127.0.0.1.53: [bad udp cksum 0xfe3b ->
> 0xd164!]
> > >> >>> > 55396+
> > >> >>> > AAAA? www.osnews.com. (32)
> > >> >>> > 0x0000: 4500 003c c3d2 4000 4011 78dc 7f00 0001 E..<.. at .@.x.....
> > >> >>> > 0x0010: 7f00 0001 b8c8 0035 0028 fe3b d864 0100 .......5.(.;.d..
> > >> >>> > 0x0020: 0001 0000 0000 0000 0377 7777 066f 736e .........www.osn
> > >> >>> > 0x0030: 6577 7303 636f 6d00 001c 0001 ews.com.....
> > >> >>> > 20:21:56.169942 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > >> >>> > proto
> > >> >>> > UDP
> > >> >>> > (17), length 123)
> > >> >>> > 127.0.0.1.53 > 127.0.0.1.47304: [bad udp cksum 0xfe7a ->
> 0x5f73!]
> > >> >>> > 55396
> > >> >>> > q:
> > >> >>> > A? www.osnews.com. 1/2/0 www.osnews.com. [29m3s] A
> 74.86.31.159 ns:
> > >> >>> > osnews.com. [29m3s] NS ns2.swelter.net., osnews.com. [29m3s] NS
> > >> >>> > ns1.swelter.net. (95)
> > >> >>> > 0x0000: 4500 007b 0000 4000 4011 3c70 7f00 0001 E..{..@
> . at .<p....
> > >> >>> > 0x0010: 7f00 0001 0035 b8c8 0067 fe7a d864 8180 .....5...g.z.d..
> > >> >>> > 0x0020: 0001 0001 0002 0000 0377 7777 066f 736e .........www.osn
> > >> >>> > 0x0030: 6577 7303 636f 6d00 0001 0001 c00c 0001 ews.com.........
> > >> >>> > 0x0040: 0001 0000 06cf 0004 4a56 1f9f c010 0002 ........JV......
> > >> >>> > 0x0050: 0001 0000 06cf 0011 036e 7332 0773 7765 .........ns2.swe
> > >> >>> > 0x0060: 6c74 6572 036e 6574 00c0 1000 0200 0100 lter.net........
> > >> >>> > 0x0070: 0006 cf00 0603 6e73 31c0 40 ......ns1.@
> > >> >>> > 20:21:56.173901 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > >> >>> > proto
> > >> >>> > UDP
> > >> >>> > (17), length 135)
> > >> >>> > 127.0.0.1.53 > 127.0.0.1.47304: [bad udp cksum 0xfe86 ->
> 0x8ecb!]
> > >> >>> > 55396
> > >> >>> > q:
> > >> >>> > AAAA? www.osnews.com. 1/2/0 www.osnews.com. [54m44s] AAAA
> > >> >>> > 2607:f0d0:1002:62::3 ns: osnews.com. [29m3s] NS ns1.swelter.net
> .,
> > >> >>> > osnews.com. [29m3s] NS ns2.swelter.net. (107)
> > >> >>> > 0x0000: 4500 0087 0000 4000 4011 3c64 7f00 0001 E.....@
> . at .<d....
> > >> >>> > 0x0010: 7f00 0001 0035 b8c8 0073 fe86 d864 8180 .....5...s...d..
> > >> >>> > 0x0020: 0001 0001 0002 0000 0377 7777 066f 736e .........www.osn
> > >> >>> > 0x0030: 6577 7303 636f 6d00 001c 0001 c00c 001c ews.com.........
> > >> >>> > 0x0040: 0001 0000 0cd4 0010 2607 f0d0 1002 0062 ........&......b
> > >> >>> > 0x0050: 0000 0000 0000 0003 c010 0002 0001 0000 ................
> > >> >>> > 0x0060: 06cf 0011 036e 7331 0773 7765 6c74 6572 .....ns1.swelter
> > >> >>> > 0x0070: 036e 6574 00c0 1000 0200 0100 0006 cf00 .net............
> > >> >>> > 0x0080: 0603 6e73 32c0 4c ..ns2.L
> > >> >>> > This is the only DNS traffic I saw during the attempts. The
> tcpdumps
> > >> >>> > have
> > >> >>> > udp bad checksum but when I disabled TFO in polipo, the UDP
> where
> > >> >>> > still
> > >> >>> > bad
> > >> >>> > checksum but they worked.
> > >> >>> > Really weird.
> > >> >>> > p.s. UPNP still works for port forwarding negotiation as it did
> in
> > >> >>> > 3.6.11-4
> > >> >>> > I still couldn't get the UPNP/SSDP broadcasts (udp to
> > >> >>> > 239.255.255.250)
> > >> >>> > to
> > >> >>> > being forwarded between se00 and sw00/sw10. Last time it worked
> was
> > >> >>> > ~3.3.8.
> > >> >>> > I'm starting not to question why it doesn't work, I'm starting
> to
> > >> >>> > wonder why
> > >> >>> > it did work then ;-)
> > >> >>> > Regards,
> > >> >>> > Maciej
> > >> >>> > On Fri, Jan 4, 2013 at 6:33 PM, Dave Taht <dave.taht at gmail.com>
> > >> >>> > wrote:
> > >> >>> >>
> > >> >>> >> On Fri, Jan 4, 2013 at 9:27 AM, Eric Dumazet <
> edumazet at google.com>
> > >> >>> >> wrote:
> > >> >>> >> > Sorry, could you give us a copy of the panic stack trace ?
> > >> >>> >>
> > >> >>> >> I will get a serial console up on a wndr3800 by sunday. (sorry,
> > >> >>> >> just
> > >> >>> >> landed in california, am in disarray)
> > >> >>> >>
> > >> >>> >> The latest dev build of cero for the wndr3800 and wndr3700v2
> is at:
> > >> >>> >>
> > >> >>> >> http://snapon.lab.bufferbloat.net/~cero2/cerowrt/wndr/3.7.1-1/
> > >> >>> >>
> > >> >>> >> --
> > >> >>> >> Dave Täht
> > >> >>> >>
> > >> >>> >> Fixing bufferbloat with cerowrt:
> > >> >>> >> http://www.teklibre.com/cerowrt/subscribe.html
> > >> >>> >> _______________________________________________
> > >> >>> >> Cerowrt-devel mailing list
> > >> >>> >> Cerowrt-devel at lists.bufferbloat.net
> > >> >>> >> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> --
> > >> >>> Dave Täht
> > >> >>>
> > >> >>> Fixing bufferbloat with cerowrt:
> > >> >>> http://www.teklibre.com/cerowrt/subscribe.html
> > >> >>
> > >> >>
> > >> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/cerowrt-devel/attachments/20130113/e981ee9f/attachment-0002.html>


More information about the Cerowrt-devel mailing list