From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.toke.dk (mail.toke.dk [52.28.52.200]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 3C6D43B29D for ; Wed, 13 Nov 2019 05:43:24 -0500 (EST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=toke.dk; s=20161023; t=1573641802; bh=a8KhVuNVxh1BF+8fYbhe6PMjpc4Tp5c+i3che1gbP+o=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=lY22ZSxrfIXIwsO+OCjF9wjH4R3VDmroL2iiKsVzWI5jb+Wip84cna1lFeYLG5c6u jVrbZbz3QNkYw/ntfylivJKTTPc60TgsGTgab2BLcnzUFnwSgjuSqemz4xDv3uulLu 97/nLmi1eB6Leb/QPtQF393KZ0D902CeNmzUTRdaLYbn3oVdNCZfJ4PpYOvKviRA09 7ODrshDiZkRThKT0RqaM1OcNlyPDmW39tGfw2UxGgH6sEpHWMnKRrXvINKe3vs8/oo xDZqiBH/ebFKIGHHK4+TVp8/Sc6948Xh+XX8LWtFEA6gRIjFDxfUbhoj7v74/YTgIF FCsdlxMABByXA== To: "Rodney W. Grimes" <4bone@gndrsh.dnsmgr.net> Cc: Luca Muscariello , Rich Brown , ECN-Sane In-Reply-To: <201911130004.xAD04Vx6041534@gndrsh.dnsmgr.net> References: <201911130004.xAD04Vx6041534@gndrsh.dnsmgr.net> Date: Wed, 13 Nov 2019 11:43:21 +0100 X-Clacks-Overhead: GNU Terry Pratchett Message-ID: <875zjooy46.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Ecn-sane] Meanwhile, over on NANOG... X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Nov 2019 10:43:24 -0000 "Rodney W. Grimes" <4bone@gndrsh.dnsmgr.net> writes: >> Toke H?iland-J?rgensen writes: >> >> > Luca Muscariello writes: >> > >> >> On Tue, Nov 12, 2019 at 2:02 PM Toke H?iland-J?rgensen wrote: >> >> >> >>> Mikael Abrahamsson writes: >> >>> >> >>> > On Tue, 12 Nov 2019, Toke H?iland-J?rgensen wrote: >> >>> > >> >>> >> I'm not on the nanog list, but feel free to cross-post; would be good >> >>> to >> >>> >> actually get to the bottom of this issue! Marek and I already had an >> >>> >> off-list back-and-forth after that original thread, and we couldn't >> >>> find >> >>> >> anything wrong on the Cloudflare side. And the RSTs have a higher TTL >> >>> >> than the actual traffic, indicating an in-path problem... >> >>> > >> >>> > tcptraceroute supports setting/clearing ECN bits (-E), would be very >> >>> > interesting to see difference between those tcptraceroutes? >> >>> >> >>> No difference. But the RST is not being sent as a response to the SYN; >> >>> it is sent in response to the first data packet... >> >>> >> >>> ... and now that I'm re-testing, things were working for a little while, >> >>> but now the bug is back. I got an intermittent successful connection >> >>> with the same TTL that I was previously getting the RST from. And now >> >>> I'm back to getting RSTed. >> >>> >> >>> So I guess there's some kind of multipath issue here; ECMP path, >> >>> multiple routing upstreams, or a broken load balancer? Any other ideas? >> >>> >> >> >> >> >> >> It makes me think of some usage of anycast TCP on the cloudflare side. >> >> What service is this Toke? >> > >> > Yeah, I did also think about anycast when I said "multiple routing >> > upstreams". For testing I've just been doing 'curl 1.1.1.1'. But >> > Cloudflare-hosted sites in general seem to have this problem; for >> > instance, 'curl -4 bufferbloat.net' also fails (but IPv6 is fine). >> >> Right, so I've played around with tcptraceroute a bit more, and looked >> at some more packet dumps, and I think I'm starting to form a theory: >> >> I get two different traceroutes; this was from running two traceroutes >> right after one another: >> >> $ sudo tcptraceroute 1.1.1.1 >> Selected device eth0, address 10.42.3.130, port 42177 for outgoing packets >> Tracing the path to 1.1.1.1 on TCP port 80 (http), 30 hops max >> 1 10.42.3.1 0.318 ms 0.325 ms 0.321 ms >> 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 1.337 ms 5.390 ms 3.194 ms >> 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.319 ms 1.120 ms 1.256 ms >> 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.533 ms 1.612 ms 1.392 ms >> 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.787 ms 6.822 ms 6.721 ms >> 6 149.6.142.130 7.000 ms 6.939 ms 6.948 ms >> 7 one.one.one.one (1.1.1.1) [open] 6.957 ms 6.967 ms 6.893 ms >> >> $ sudo tcptraceroute 1.1.1.1 >> Selected device eth0, address 10.42.3.130, port 38681 for outgoing packets >> Tracing the path to 1.1.1.1 on TCP port 80 (http), 30 hops max >> 1 10.42.3.1 0.290 ms 0.287 ms 0.292 ms >> 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 1.857 ms 5.382 ms 18.654 ms >> 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.249 ms 1.121 ms 1.521 ms >> 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.375 ms 2.495 ms 1.440 ms >> 5 dix.as13335.net (192.38.7.70) 2.093 ms 1.895 ms 1.790 ms >> 6 one.one.one.one (1.1.1.1) [open] 1.783 ms 1.861 ms 1.817 ms >> >> >> Notice how one is one hop longer than the other. > > Worse than just longer, it appears as if the exit hop from gigabit.dk > goes to 2 different providers (hop 4 above). If these are packets towards > an anycast address that is going to cause exactly what you see. ECMP > accross multiple AS's towards anycast is.. umm.. very fragile and your > seeing one of the problems with anycast. > > It is very unlikely that he.net and cogentco.com end up at the same > 1.1.1.1 box. Yeah, did notice it was two different upstreams :) >> So definitely something >> to do with anycast; maybe ECMP over both paths since it's changing >> pretty often? > > And the multipath is set to round robin perhaps? Not round-robin. That it was changing simply at random turns out to be my mistake; by default tcptraceroute will pick a new source port each time. If I fix the source port I get the same path each time, so it looks like it's hashing on headers. Going back to regular UDP-based trace route I finally found what looks to be the smoking gun: $ traceroute 1.1.1.1 -q 1 --sport=10000 -t 1 traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.304 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.935 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.005 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.361 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.250 ms 6 one.one.one.one (1.1.1.1) 1.380 ms $ traceroute 1.1.1.1 -q 1 --sport=10000 -t 2 traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.236 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 53.833 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.195 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.979 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.851 ms 6 149.6.142.130 (149.6.142.130) 13.081 ms 7 one.one.one.one (1.1.1.1) 1.842 ms -t is the TOS value; so those two happen to correspond to ECT(1) and ECT(0); and as you can see they go two different paths. Which would be consistent with the SYN going one way and the data packets going another. -Toke