[Bloat] The Dark Problem with AQM in the Internet?

Mon Sep 1 13:30:06 EDT 2014

Westwood+, as described in published researched papers, does not fully
explain the graph that was seen.  However, Westwood+, as implemented in
Linux, DOES fully explain the graph that was seen.  One place to review the
source code is here:

http://lxr.free-electrons.com/source/net/ipv4/tcp_westwood.c?v=3.2

Some observations about this code:

1. The bandwidth estimate is run through a (7×prev+new)/8 filter TWICE
[see lines 93-94].
2. The units of time for all objects in the code (rtt, bwe, delta, etc) is
jiffies, not milliseconds, nor microseconds [see line 108].
3. The bandwidth estimate is updated every rtt with the test in the code
(line 139) essentially: delta>rtt.  However, rtt is the last unsmoothed
rtt seen on the link (and increasing during bufferbloat).  When rtt
increases, the frequency of bandwidth updates drops.
4. The server is Linux 3.2 with HZ=100 (meaning jiffies increases every
10ms).

When you graph some of the raw data observed (see
http://www.duckware.com/blog/the-dark-problem-with-aqm-in-the-internet/image
s/chart.gif), the Westwood+ bandwidth estimate takes significant time to
ramp up.

For the first 0.84 seconds of the download, we expect the Westwood+ code to
update the bandwidth estimate around 14 times, or once every 60ms or so. 
However, after this, we know there is a bufferbloat episode, with RTT times
increasing (decreasing the frequency of bandwidth updates).  The red line in
the graph above suggests that Westwood might have only updated the bandwidth
estimate around 9-10 more times, before using it to set cwnd/ssthresh.

- Jerry

-----Original Message-----
From: Jonathan Morton [mailto:chromatix99 at gmail.com] 
Sent: Saturday, August 30, 2014 2:46 AM
To: Stephen Hemminger
Cc: Jerry Jongerius; bloat at lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

On 30 Aug, 2014, at 9:28 am, Stephen Hemminger wrote:

> On Sat, 30 Aug 2014 09:05:58 +0300
> Jonathan Morton <chromatix99 at gmail.com> wrote:
> 
>> 
>> On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:
>> 
>>>> did you check to see if packets were re-sent even if they weren't 
>>>> lost? on of the side effects of excessive buffering is that it's 
>>>> possible for a packet to be held in the buffer long enough that the 
>>>> sender thinks that it's been lost and retransmits it, so the packet 
>>>> is effectivly 'lost' even if it actually arrives at it's destination.
>>> 
>>> Yes.  A duplicate packet for the missing packet is not seen.
>>> 
>>> The receiver 'misses' a packet; starts sending out tons of dup acks 
>>> (for all packets in flight and queued up due to bufferbloat), and 
>>> then way later, the packet does come in (after the RTT caused by 
>>> bufferbloat; indicating it is the 'resent' packet).
>> 
>> I think I've cracked this one - the cause, if not the solution.
>> 
>> Let's assume, for the moment, that Jerry is correct and PowerBoost plays
no part in this.  That implies that the flow is not using the full bandwidth
after the loss, *and* that the additive increase of cwnd isn't sufficient to
recover to that point within the test period.
>> 
>> There *is* a sequence of events that can lead to that happening:
>> 
>> 1) Packet is lost, at the tail end of the bottleneck queue.
>> 
>> 2) Eventually, receiver sees the loss and starts sending duplicate acks
(each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running
Westwood+) assumes that each of these represents a received, full-size
packet, for bandwidth estimation purposes.
>> 
>> 3) The receiver doesn't send, or the sender doesn't receive, a duplicate
ack for every packet actually received.  Maybe some firewall sees a large
number of identical packets arriving - without SACK or timestamps, they
*would* be identical - and filters some of them.  The bandwidth estimate
therefore becomes significantly lower than the true value, and additionally
the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).
>> 
>> 4) The retransmitted packet finally reaches the receiver, and the ack it
sends includes all the data received in the meantime (about 3.5MB).  This is
not sufficient to immediately reset the bandwidth estimate to the true
value, because the BWE is sampled at RTT intervals, and also includes
low-pass filtering.
>> 
>> 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender
resets the slow-start threshold to correspond to the estimated
delay-bandwidth product (MinRTT * BWE) at that moment.
>> 
>> 6) This estimated DBP is lower than the true value, so the subsequent
slow-start phase ends with the cwnd inadequately sized.  Additive increase
would eventually correct that - but the key word is *eventually*.
>> 
>> - Jonathan Morton
> 
> Bandwidth estimates by ack RTT is fraught with problems. The returning 
> ACK can be delayed for any number of reasons such as other traffic or 
> aggregation. This kind of delay based congestion control suffers badly
from any latency induced in the network.
> So instead of causing bloat, it gets hit by bloat.

In this case, the TCP is actually tracking RTT surprisingly well, but the
bandwidth estimate goes wrong because the duplicate ACKs go missing.  Note
that if the MinRTT was estimated too high (which is the only direction it
could go), this would result in the slow-start threshold being *higher* than
required, and the symptoms observed would not occur, since the cwnd would
grow to the required value after recovery.

This is the opposite effect from what happens to TCP Vegas in a bloated
environment.  Vegas stops increasing cwnd when the estimated RTT is
noticeably higher than MinRTT, but if the true MinRTT changes (or it has to
compete with a non-Vegas TCP flow), it has trouble tracking that fact.

There is another possibility:  that the assumption of non-queue RTT being
constant against varying bandwidth is incorrect.  If that is the case, then
the observed behaviour can be explained without recourse to lost duplicate
ACKs - so Westwood+ is correctly tracking both MinRTT and BWE - but (MinRTT
* BWE) turns out to be a poor estimate of the true BDP.  I think this still
fails to explain why the cwnd is reset (which should occur only on RTO), but
everything else potentially fits.

I think we can distinguish the two theories by running tests against a
server that supports SACK and timestamps, and where ideally we can capture
packet traces at both ends.

- Jonathan Morton