[Bloat] The Dark Problem with AQM in the Internet?

Thu Aug 28 13:41:37 EDT 2014

On Thu, Aug 28, 2014 at 10:20 AM, Jerry Jongerius <jerryj at duckware.com> wrote:
> Jonathan,
>
> Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
> size.  The RWIN size can be below the BDP (no measurable queuing within the
> CMTS).  Or, the RWIN size can be very large, causing significant queuing
> within the CMTS.  With a larger RWIN value, the single dropped packet
> typically happens sooner in the download, rather than later.  The fact there
> is no "burst loss" is a significant clue.
>
> The graph is fully explained by the Westwood+ algorithm that the server is
> using.  If you input the data observed into the Westwood+ bandwidth
> estimator, you end up with the rate seen in the graph after the packet loss
> event.  The reason the rate gets limited (no ramp up) is due to Westwood+
> behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
> and the timing of the lost packet in relation to when the bufferbloat
> starts.  When there is no RTO, I see the expected drop (to the Westwood+
> bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
> ssthresh and cwnd to its bandwidth estimate.

On the same network, what does cubic do?

> The PC does SACK, the server does not, so not used.  Timestamps off.

Timestamps are *critical* for good tcp performance above 5-10mbit on
most cc algos.

I note that the netperf-wrapper test has the ability to test multiple
variants of
TCP, if enabled on the server (basically you need to modprobe the needed
algorithms, enable them in /proc/sys/net/ipv4/tcp_allowed_congestion_control,
and select them in the test tool (iperf and netperf have support)).

Everyone here has installed netperf-wrapper already, yes?

Very fast to generate a good test and a variety of plots like those shown
here:  http://burntchrome.blogspot.com/2014_05_01_archive.html

(in reading that over, does anyone have any news on CMTS aqm or packet
scheduling systems? It's the bulk of the problem there...)

 netperf-wrapper is easy to bring up
on linux, on osx it needs macports, and the only way I've come up to test
windows behavior is using windows as a netperf client rather than server.

I haven't looked into westwood+'s behavior much of late, I will try to add it
and a few other tcps to some future tests. I do have some old plots showing
it misbehaving relative to other TCPs, but that was before many fixes landed
in the kernel.

Note: I keep hoping to find a correctly working ledbat module, the one
I have doesn't look correct (and needs
to be updated to linux 3.15's change to us based timestamping.)

>
> - Jerry
>
>
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99 at gmail.com]
> Sent: Thursday, August 28, 2014 10:08 AM
> To: Jerry Jongerius
> Cc: 'Greg White'; 'Sebastian Moeller'; bloat at lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>
>
> On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:
>
>> AQM is a great solution for bufferbloat.  End of story.  But if you want
> to track down which device in the network intentionally dropped a packet
> (when many devices in the network path will be running AQM), how are you
> going to do that?  Or how do you propose to do that?
>
> We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
> tell which routers drop packets today, when AQM is not at all widely
> deployed, so that's no great loss.
>
> But if ECN finally gets deployed, AQM can set the Congestion Experienced
> flag instead of dropping packets, most of the time.  You still don't get to
> see which router did it, but the packet still gets through and the TCP
> session knows what to do about it.
>
>> The graph presented is caused the interaction of a single dropped packet,
> bufferbloat, and the Westwood+ congestion control algorithm - and not power
> boost.
>
> This surprises me somewhat - Westwood+ is supposed to be deliberately
> tolerant of single packet losses, since it was designed explicitly to get
> around the problem of slight random loss on wireless networks.
>
> I'd be surprised if, in fact, *only* one packet was lost.  The more usual
> case is of "burst loss", where several packets are lost in quick succession,
> and not necessarily consecutive packets.  This tends to happen repeatedly on
> dump drop-tail queues, unless the buffer is so large that it accommodates
> the entire receive window (which, for modern OSes, is quite impressive in a
> dark sort of way).  Burst loss is characteristic of congestion, whereas
> random loss tends to lose isolated packets, so it would be much less
> surprising for Westwood+ to react to it.
>
> The packets were lost in the first place because the queue became
> chock-full, probably at just about the exact moment when the PowerBoost
> allowance ran out and the bandwidth came down (which tends to cause the
> buffer to fill rapidly), so you get the worst-case scenario: the buffer at
> its fullest, and the bandwidth draining it at its minimum.  This maximises
> the time before your TCP gets to even notice the lost packet's nonexistence,
> during which the sender keeps the buffer full because it still thinks
> everything's fine.
>
> What is probably happening is that the bottleneck queue, being so large,
> delays the retransmission of the lost packet until the Retransmit Timer
> expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
> (rightly in this case) that the characteristics of the channel have changed.
> You can see that it takes most of the first second for the sender to ramp up
> to full speed, and nearly as long to ramp back up to the reduced speed, both
> of which are characteristic of slow-start at WAN latencies.  NB: during
> slow-start, the buffer remains empty as long as the incoming data rate is
> less than the output capacity, so latency is at a minimum.
>
> Do you have TCP SACK and timestamps turned on?  Those usually allow minor
> losses like that to be handled more gracefully - the sending TCP gets a
> better idea of the RTT (allowing it to set the Retransmit Timer more
> intelligently), and would be able to see that progress is still being made
> with the backlog of buffered packets, even though the core TCP ACK is not
> advancing.  In the event of burst loss, it would also be able to retransmit
> the correct set of packets straight away.
>
> What AQM would do for you here - if your ISP implemented it properly - is to
> eliminate the negative effects of filling that massive buffer at your ISP.
> It would allow the sending TCP to detect and recover from any packet loss
> more quickly, and with ECN turned on you probably wouldn't even get any
> packet loss.
>
> What's also interesting is that, after recovering from the change in
> bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
> half-second intervals, mixed in with the relatively steady 1-, 2- and
> 3-packet stream.  That is characteristic of low-level packet loss with a
> low-latency recovery.
>
> This either implies that your ISP has stuck you on a much shorter buffer for
> the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
> enforcing a smaller congestion window on you after having suffered a
> slow-start recovery.  The latter restricts your bandwidth to match the
> delay-bandwidth product, but happily the "delay" in that equation is at a
> minimum if it keeps your buffer empty.
>
> And frankly, you're still getting 45Mbps under those conditions.  Many
> people would kill for that sort of performance - although they'd probably
> then want to kill everyone in the Comcast call centre later on.
>
>  - Jonathan Morton
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article