[Bloat] The Dark Problem with AQM in the Internet?

Thu Aug 28 21:59:25 EDT 2014

On Thu, 28 Aug 2014, Jerry Jongerius wrote:

> Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
> size.  The RWIN size can be below the BDP (no measurable queuing within the
> CMTS).  Or, the RWIN size can be very large, causing significant queuing
> within the CMTS.  With a larger RWIN value, the single dropped packet
> typically happens sooner in the download, rather than later.  The fact there
> is no "burst loss" is a significant clue.

did you check to see if packets were re-sent even if they weren't lost? on of 
the side effects of excessive buffering is that it's possible for a packet to be 
held in the buffer long enough that the sender thinks that it's been lost and 
retransmits it, so the packet is effectivly 'lost' even if it actually arrives 
at it's destination.

David Lang

> The graph is fully explained by the Westwood+ algorithm that the server is
> using.  If you input the data observed into the Westwood+ bandwidth
> estimator, you end up with the rate seen in the graph after the packet loss
> event.  The reason the rate gets limited (no ramp up) is due to Westwood+
> behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
> and the timing of the lost packet in relation to when the bufferbloat
> starts.  When there is no RTO, I see the expected drop (to the Westwood+
> bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
> ssthresh and cwnd to its bandwidth estimate.
>
> The PC does SACK, the server does not, so not used.  Timestamps off.
>
> - Jerry
>
>
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99 at gmail.com]
> Sent: Thursday, August 28, 2014 10:08 AM
> To: Jerry Jongerius
> Cc: 'Greg White'; 'Sebastian Moeller'; bloat at lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>
>
> On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:
>
>> AQM is a great solution for bufferbloat.  End of story.  But if you want
> to track down which device in the network intentionally dropped a packet
> (when many devices in the network path will be running AQM), how are you
> going to do that?  Or how do you propose to do that?
>
> We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
> tell which routers drop packets today, when AQM is not at all widely
> deployed, so that's no great loss.
>
> But if ECN finally gets deployed, AQM can set the Congestion Experienced
> flag instead of dropping packets, most of the time.  You still don't get to
> see which router did it, but the packet still gets through and the TCP
> session knows what to do about it.
>
>> The graph presented is caused the interaction of a single dropped packet,
> bufferbloat, and the Westwood+ congestion control algorithm - and not power
> boost.
>
> This surprises me somewhat - Westwood+ is supposed to be deliberately
> tolerant of single packet losses, since it was designed explicitly to get
> around the problem of slight random loss on wireless networks.
>
> I'd be surprised if, in fact, *only* one packet was lost.  The more usual
> case is of "burst loss", where several packets are lost in quick succession,
> and not necessarily consecutive packets.  This tends to happen repeatedly on
> dump drop-tail queues, unless the buffer is so large that it accommodates
> the entire receive window (which, for modern OSes, is quite impressive in a
> dark sort of way).  Burst loss is characteristic of congestion, whereas
> random loss tends to lose isolated packets, so it would be much less
> surprising for Westwood+ to react to it.
>
> The packets were lost in the first place because the queue became
> chock-full, probably at just about the exact moment when the PowerBoost
> allowance ran out and the bandwidth came down (which tends to cause the
> buffer to fill rapidly), so you get the worst-case scenario: the buffer at
> its fullest, and the bandwidth draining it at its minimum.  This maximises
> the time before your TCP gets to even notice the lost packet's nonexistence,
> during which the sender keeps the buffer full because it still thinks
> everything's fine.
>
> What is probably happening is that the bottleneck queue, being so large,
> delays the retransmission of the lost packet until the Retransmit Timer
> expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
> (rightly in this case) that the characteristics of the channel have changed.
> You can see that it takes most of the first second for the sender to ramp up
> to full speed, and nearly as long to ramp back up to the reduced speed, both
> of which are characteristic of slow-start at WAN latencies.  NB: during
> slow-start, the buffer remains empty as long as the incoming data rate is
> less than the output capacity, so latency is at a minimum.
>
> Do you have TCP SACK and timestamps turned on?  Those usually allow minor
> losses like that to be handled more gracefully - the sending TCP gets a
> better idea of the RTT (allowing it to set the Retransmit Timer more
> intelligently), and would be able to see that progress is still being made
> with the backlog of buffered packets, even though the core TCP ACK is not
> advancing.  In the event of burst loss, it would also be able to retransmit
> the correct set of packets straight away.
>
> What AQM would do for you here - if your ISP implemented it properly - is to
> eliminate the negative effects of filling that massive buffer at your ISP.
> It would allow the sending TCP to detect and recover from any packet loss
> more quickly, and with ECN turned on you probably wouldn't even get any
> packet loss.
>
> What's also interesting is that, after recovering from the change in
> bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
> half-second intervals, mixed in with the relatively steady 1-, 2- and
> 3-packet stream.  That is characteristic of low-level packet loss with a
> low-latency recovery.
>
> This either implies that your ISP has stuck you on a much shorter buffer for
> the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
> enforcing a smaller congestion window on you after having suffered a
> slow-start recovery.  The latter restricts your bandwidth to match the
> delay-bandwidth product, but happily the "delay" in that equation is at a
> minimum if it keeps your buffer empty.
>
> And frankly, you're still getting 45Mbps under those conditions.  Many
> people would kill for that sort of performance - although they'd probably
> then want to kill everyone in the Comcast call centre later on.
>
> - Jonathan Morton
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>