From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.perfora.net (mout.perfora.net [74.208.4.194]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mout.perfora.net", Issuer "Thawte SSL CA" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 5493F21F71F for ; Thu, 28 Aug 2014 10:20:13 -0700 (PDT) Received: from J4 (c-68-50-226-187.hsd1.md.comcast.net [68.50.226.187]) by mrelay.perfora.net (node=mreueus002) with ESMTP (Nemesis) id 0M0jHc-1WSTu32MEY-00utJV; Thu, 28 Aug 2014 19:20:12 +0200 From: "Jerry Jongerius" To: "'Jonathan Morton'" References: <000001cfbefe$69194c70$3b4be550$@duckware.com> <000901cfc2c2$c21ae460$4650ad20$@duckware.com> <4A89264B-36C5-4D1F-9E5E-33F2B42C364E@gmail.com> In-Reply-To: <4A89264B-36C5-4D1F-9E5E-33F2B42C364E@gmail.com> Date: Thu, 28 Aug 2014 13:20:13 -0400 Message-ID: <002201cfc2e4$565c1100$03143300$@duckware.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQFFVy+FAO2HJAXNGbNNrLY/R/b/6gGg5QYjATtzUJsCM/8plgJDZN5dnMBrwPA= Content-Language: en-us X-Provags-ID: V02:K0:HWKNKGOfhjmEzK30ibFUjen7LBLPswEURbqkb51du/5 j5hpb/WHiszRUhwUYKFzZTJDu2AhCsRyks5sBswF3dFaR724vv Pfdyb/8t1eybBut/sVj9OC0I5RSUrlK+RANcoW+dSnubv035fH J3ha/Y+X6Q9g9kcUJ4wVj5EUBwqYuhSIxJRD8dHgrmyeetj02M vIkLBOFNtooh5lbXOMeFOBgW09X/ObKxtTHxzKsM5VTWuYNlwc VjNxB+9ZilNckvpf6rUr7+V1WRoyo2lCN2CY55ztVH3KKVwlNA j7oplo8RVCVMQgiTNjwg9VIa0Xq84I7VmAvsZgXgbIoiZyV2vc eUgstQ4jmdIS5JN6t30i5h0A0prT5CYLgunUKcNDc X-UI-Out-Filterresults: notjunk:1; Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Aug 2014 17:20:13 -0000 Jonathan, Yes, WireShark shows that *only* one packet gets lost. Regardless of RWIN size. The RWIN size can be below the BDP (no measurable queuing within the CMTS). Or, the RWIN size can be very large, causing significant queuing within the CMTS. With a larger RWIN value, the single dropped packet typically happens sooner in the download, rather than later. The fact there is no "burst loss" is a significant clue. The graph is fully explained by the Westwood+ algorithm that the server is using. If you input the data observed into the Westwood+ bandwidth estimator, you end up with the rate seen in the graph after the packet loss event. The reason the rate gets limited (no ramp up) is due to Westwood+ behavior on a RTO. And the reason there is the RTO is due the bufferbloat, and the timing of the lost packet in relation to when the bufferbloat starts. When there is no RTO, I see the expected drop (to the Westwood+ bandwidth estimate) and ramp back up. On a RTO, Westwood+ sets both ssthresh and cwnd to its bandwidth estimate. The PC does SACK, the server does not, so not used. Timestamps off. - Jerry -----Original Message----- From: Jonathan Morton [mailto:chromatix99@gmail.com] Sent: Thursday, August 28, 2014 10:08 AM To: Jerry Jongerius Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote: > AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? We don't plan to do that. Not from the outside. Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss. But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time. You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it. > The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm - and not power boost. This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks. I'd be surprised if, in fact, *only* one packet was lost. The more usual case is of "burst loss", where several packets are lost in quick succession, and not necessarily consecutive packets. This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way). Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it. The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum. This maximises the time before your TCP gets to even notice the lost packet's nonexistence, during which the sender keeps the buffer full because it still thinks everything's fine. What is probably happening is that the bottleneck queue, being so large, delays the retransmission of the lost packet until the Retransmit Timer expires. This will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this case) that the characteristics of the channel have changed. You can see that it takes most of the first second for the sender to ramp up to full speed, and nearly as long to ramp back up to the reduced speed, both of which are characteristic of slow-start at WAN latencies. NB: during slow-start, the buffer remains empty as long as the incoming data rate is less than the output capacity, so latency is at a minimum. Do you have TCP SACK and timestamps turned on? Those usually allow minor losses like that to be handled more gracefully - the sending TCP gets a better idea of the RTT (allowing it to set the Retransmit Timer more intelligently), and would be able to see that progress is still being made with the backlog of buffered packets, even though the core TCP ACK is not advancing. In the event of burst loss, it would also be able to retransmit the correct set of packets straight away. What AQM would do for you here - if your ISP implemented it properly - is to eliminate the negative effects of filling that massive buffer at your ISP. It would allow the sending TCP to detect and recover from any packet loss more quickly, and with ECN turned on you probably wouldn't even get any packet loss. What's also interesting is that, after recovering from the change in bandwidth, you get smaller bursts of about 15-40KB arriving at roughly half-second intervals, mixed in with the relatively steady 1-, 2- and 3-packet stream. That is characteristic of low-level packet loss with a low-latency recovery. This either implies that your ISP has stuck you on a much shorter buffer for the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is enforcing a smaller congestion window on you after having suffered a slow-start recovery. The latter restricts your bandwidth to match the delay-bandwidth product, but happily the "delay" in that equation is at a minimum if it keeps your buffer empty. And frankly, you're still getting 45Mbps under those conditions. Many people would kill for that sort of performance - although they'd probably then want to kill everyone in the Comcast call centre later on. - Jonathan Morton