From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-x234.google.com (we-in-x0234.1e100.net [IPv6:2a00:1450:400c:c03::234]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id B607321F144 for ; Mon, 28 Jan 2013 05:43:37 -0800 (PST) Received: by mail-we0-f180.google.com with SMTP id k14so1464248wer.39 for ; Mon, 28 Jan 2013 05:43:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=gYHda1IHbuPnkSCY6gfhv8qocri5QJXJktohHoUwkl4=; b=H/QPcU7YW05Zor+Op4cnLsLZ4roMKDFZOHKkxhVYjpxA7MPeNiaYa61XWlZS571QL9 mmBMRaNBPOhhLtmPLbg1h+2xF+iQh3mw7VNa4+Y25mJncLB/QpdBZ0ZbbIkp5Y1XoRIw V4fIDWj+dIKacQ18VF3h5sbaWVZ5KpZmMfHZNhX8Gy1ZSl5B6TNnkvSkpVRkPHL4AnZi 5+ibFLGBoMOdR3qQDo4ZiytfgMGuouHxg1AQZpABG0qfCxBQ8bMEUHsmhHHMB/smIst6 7vt0BoBQqA2J9/d9Jwcv5lW19eJvJJwOc+0amIOr9ryGbJiJ5KC47O4QFrOHu45WTe1H 1K2A== MIME-Version: 1.0 X-Received: by 10.194.78.207 with SMTP id d15mr21319720wjx.52.1359380614948; Mon, 28 Jan 2013 05:43:34 -0800 (PST) Received: by 10.194.88.197 with HTTP; Mon, 28 Jan 2013 05:43:34 -0800 (PST) In-Reply-To: References: Date: Mon, 28 Jan 2013 13:43:34 +0000 Message-ID: From: Robert Bradley To: Dave Taht Content-Type: multipart/alternative; boundary=047d7bfcf91aab255e04d459762c Cc: "" , Felix Fietkau Subject: Re: [Cerowrt-devel] deployed some cero this weekend, chasing checksums X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2013 13:43:38 -0000 --047d7bfcf91aab255e04d459762c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable It looks more like data corruption of various forms as opposed to a fault in checksumming: - Truncation of some layer-4 data including headers to 75 octets - Some bad TCP packets have stored header lengths of 0 octets - I often see lines of incrementing bytes (30 31 32 etc.). For example, packet 962 has a train of values from 0x10 to 0x2f, starting at position 0x003a (the TCP timestamps). I think these are meant to be fragments from the ping packets (which contain 8 octets then values 0x10 to 0x37), but these are straying into non-ICMP packets. - There are pieces of HTTP in non-HTTP protocols. For example, packet 1394 is supposed to be UDP, but looks like it is really TCP traffic with the wrong protocol number. The checksum is still invalid in either case. - It is possible to corrupt layer-4 checksums only, leaving the IP layer untouched. On 28 January 2013 07:52, Dave Taht wrote: > Put up a pic http://snapon.lab.bufferbloat.net/~d/yurt > > they aren't bad all the time, but when they go bad, bad things happen. > > > On Sun, Jan 27, 2013 at 11:41 PM, Dave Taht wrote: > >> >> I have been debugging some weirdness for a while. You might want to do >> some captures on the latest cero and look at checksums. >> >> An unreasonably high number of checksum issues seem to be happening, but >> there doesn't appear to be a whole lot of pattern to it, as yet. >> >> I will simplify. I pinged locally and 8.8.8.8 and surfed the web, and a >> symptom is that some other routers can't ping sometimes nor access much = of >> the internet beyond the gateway. They can always reach the gateway. >> >> in the interim, the topology on this capture are >> >> 172.30.102.17 - laptop via ethernet to >> 172.20.102.1 - cerowrt 3.7.4-4 via ethernet to >> 172.20.6.1 - ubnt 3.3.8-26 via mesh to >> 172.20.142.11 - ubnt 3.7.4-4 via ethernet to >> * 192.168.100.1 - cerowrt 3.7.2 capture point (yes, updating that) >> 10.0.10.1 - comcast box (yes, double nat, fixing that) >> >> I took a capture on the se00 interface >> >> tcpdump -i se00 -w/tmp/yurt.cap host 172.20.102.17 >> >> and stuck that capture there: >> >> http://snapon.lab.bufferbloat.net/~d/yurt/yurt.cap >> >> and then looked at it with wireshark with this filter >> >> ip.checksum_bad =3D=3D 1 >> >> and scratched my head at the error rate (about 1%) and the pattern (lack >> thereof) >> >> I will simplify in the mroning >> >> -- >> Dave T=C3=A4ht >> >> Fixing bufferbloat with cerowrt: >> http://www.teklibre.com/cerowrt/subscribe.html > > > > > -- > Dave T=C3=A4ht > > Fixing bufferbloat with cerowrt: > http://www.teklibre.com/cerowrt/subscribe.html > --=20 Robert Bradley --047d7bfcf91aab255e04d459762c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
It looks more like data corruption of = various forms as opposed to a fault in checksumming:

- Truncat= ion of some layer-4 data including headers to 75 octets
- Some bad= TCP packets have stored header lengths of 0 octets
- I often see lines of incrementing bytes (30 31 32 etc.).=C2=A0 For = example, packet 962 has a train of values from 0x10 to 0x2f, starting at po= sition 0x003a (the TCP timestamps).=C2=A0 I think these are meant to be fra= gments from the ping packets (which contain 8 octets then values 0x10 to 0x= 37), but these are straying into non-ICMP packets.
- There are pieces of HTTP in non-HTTP protocols.=C2=A0 For exam= ple, packet 1394 is supposed to be UDP, but looks like it is really TCP tra= ffic with the wrong protocol number.=C2=A0 The checksum is still invalid in= either case.
- It is possible to corrupt layer-4 checksums only, leaving the = IP layer untouched.


On 28 January 2013 07:52, Dave Taht <= ;dave.taht@gmail.c= om> wrote:
Put up a pic http://snapon.lab.bufferbloat.net/= ~d/yurt

they aren't bad all the time, but when they go bad, bad things happ= en.


On Sun, Jan 27, 2013 at 11:41 PM, Dave Taht <dave.taht@gmail.com&g= t; wrote:

I have been debugging some weirdn= ess for a while. You might want to do some captures on the latest cero and = look at checksums.

An unreasonably high number of checksum issues s= eem to be happening, but there doesn't appear to be a whole lot of patt= ern to it, as yet.

I will simplify. I pinged locally and 8.8.8.8 and surfed the web, and a= symptom is that some other routers can't ping sometimes nor access muc= h of the internet beyond the gateway. They can always reach the gateway.=C2= =A0

in the interim, the topology on this capture are

172.30.102.17 -= laptop via ethernet to
172.20.102.1 - cerowrt 3.7.4-4 via ethernet to172.20.6.1 - ubnt 3.3.8-26 via mesh to
172.20.142.11 - ubnt 3.7.4-4 vi= a ethernet to
* 192.168.100.1 - cerowrt 3.7.2 capture point (yes, updating that)
10.0.= 10.1 - comcast box (yes, double nat, fixing that)

I took a capture o= n the se00 interface

tcpdump -i se00 -w/tmp/yurt.cap host 172.20.102= .17

and stuck that capture there:

http://snapon.lab.bufferbloa= t.net/~d/yurt/yurt.cap

and then looked at it with wireshark with= this filter

ip.checksum_bad =3D=3D 1

and scratched my head at the error rate= (about 1%) and the pattern (lack thereof)

I will simplify in the mr= oning

--
Dave T=C3=A4ht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subs= cribe.html=20



--
Dave T=C3= =A4ht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cer= owrt/subscribe.html=20



--
Robert Brad= ley
--047d7bfcf91aab255e04d459762c--