[Bloat] The Dark Problem with AQM in the Internet?

General list for discussing Bufferbloat
 help / color / mirror / Atom feed

* [Bloat] The Dark Problem with AQM in the Internet?
@ 2014-08-23 18:16 Jerry Jongerius
  2014-08-23 19:30 ` Jonathan Morton
  2014-08-23 20:01 ` Sebastian Moeller
  0 siblings, 2 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-23 18:16 UTC (permalink / raw)
  To: bloat

Request for comments on: www.duckware.com/darkaqm

The bottom line: How do you know which AQM device in a network intentionally
drops a packet, without cooperation from AQM?

Or is this in AQM somewhere and I just missed it?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-23 18:16 [Bloat] The Dark Problem with AQM in the Internet? Jerry Jongerius
@ 2014-08-23 19:30 ` Jonathan Morton
  2014-08-23 20:01 ` Sebastian Moeller
  1 sibling, 0 replies; 38+ messages in thread
From: Jonathan Morton @ 2014-08-23 19:30 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

There is no such indication, unless you examine the packets before and
after each potential point. But you don't generally need one. It is enough
to know that congestion exists somewhere on the path.

- Jonathan Morton
 On 23 Aug 2014 21:17, "Jerry Jongerius" <jerryj@duckware.com> wrote:

> Request for comments on: www.duckware.com/darkaqm
>
> The bottom line: How do you know which AQM device in a network
> intentionally
> drops a packet, without cooperation from AQM?
>
> Or is this in AQM somewhere and I just missed it?
>
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 1176 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-23 18:16 [Bloat] The Dark Problem with AQM in the Internet? Jerry Jongerius
  2014-08-23 19:30 ` Jonathan Morton
@ 2014-08-23 20:01 ` Sebastian Moeller
  2014-08-25 17:13   ` Greg White
  1 sibling, 1 reply; 38+ messages in thread
From: Sebastian Moeller @ 2014-08-23 20:01 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

Hi Jerry,

On Aug 23, 2014, at 20:16 , Jerry Jongerius <jerryj@duckware.com> wrote:

> Request for comments on: www.duckware.com/darkaqm
> 
> The bottom line: How do you know which AQM device in a network intentionally
> drops a packet, without cooperation from AQM?
> 
> Or is this in AQM somewhere and I just missed it?

I am sure you will get more expert responses later, but let me try to comment.

Paragraph 1:

I think you hit the nail on the head with your observation:

The average user can not figure out what AQM device intentionally dropped packets

Only, I might add, this does not depend on AQM, the user can not figure out where packets where dropped in the case that not all involved network hops are under said user’s control ;) So move on, nothing to see here ;)

Paragraph 2:

There is no guarantee that any network equipment responds to ICMP requests at all (for example my DSLAM does not). What about pinging a host further away and look at that hosts RTT development over time? (Minor clarification: its the load dependent increase of ping RTT to the CMTS that would be diagnostic of a queue, not the RTT per se). No increase of ICMP RTT could also mean there is no AQM involved ;)

	I used to think along similar lines, but reading https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf made me realize that my assumptions about ping and trace route were not really backed up by reality. Notably traceroute will not necessarily show the real data's path and latencies or drop probability.

Paragraph 3

What is the advertised bandwidth of your link? To my naive eye this looks a bit like power boosting (the cable company allowing you higher than advertised bandwidth for a short time that is later reduced to the advertised speed). Your plot needs a better legend, BTW, what is the blue line showing? When you say that neither ping nor trace route showed anything, I assumed that you measured concurrently to your download. It would be really great if you could netperf-wrapper to get comparable data (see the link on http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferbloat ) There the latency is not only assessed by ICMP echo requests but also by UDP packets, and it is very unlikely that your ISP can special case these in any tricky way, short of giving priority to sparse flows (which is pretty much what you would like your ISP to do in the first place ;) )

	Here is where I reveal that I am just a layman, but you complain about the loss of one packet, but how do you assume does a (TCP) settle on its transfer speed? Exactly it keeps increasing until it looses a packet, then reduces its speed to 50% or so and slowly ramps up again until the next packet loss. So unless your test data is not TCP I see no way to avoid packet loss (and no reason why it is harmful). Now if my power boost intuition should prove right I can explain the massive drop quite well, TCP had ramped up to above the long-term stable and suffers several packet losses in a short time, basically resetting it to 0 or so, therefore the new ramping to 40Mbps looks pretty similar to the initial ramping to 110Mbps...

Paragraph 4:

I guess, ECN, explicit congestion notification is the best you can expect, or routers will initially set a mark on a packet to notify the TCP endpoints that they need to throttle the speed unless that want to risk packet loss. But not all routers are configured to use it (plus you need to configure your endpoints correctly, see: http://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN ). But this will not tell you where along the path congestion occurred, only that it occurred (and if push comes to shove your packets still get dropped.) 
	Also, I believe, a congested router is going to drop packets to be able to “survive” the current load, it is not going to send additional packets to inform you that it is overloaded...

Best Regards
	Sebastian

> 
> 
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-23 20:01 ` Sebastian Moeller
@ 2014-08-25 17:13   ` Greg White
  2014-08-25 18:09     ` Jim Gettys
  2014-08-28 13:19     ` Jerry Jongerius
  0 siblings, 2 replies; 38+ messages in thread
From: Greg White @ 2014-08-25 17:13 UTC (permalink / raw)
  To: Sebastian Moeller, Jerry Jongerius; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 5628 bytes --]

As far as I know there are no deployments of AQM in DOCSIS networks yet.
So, the effect you are seeing is unlikely to be due to AQM.

As Sebastian indicated, it looks like an interaction between power boost,
a drop tail buffer and the tcp congestion window getting reset to
slow-start.

I ran a quick simulation of a simple network with power boost and basic
(bloated) drop tail buffer (no AQM) this morning in an attempt to
understand what is going on here. You didn't give me a lot to go on in the
text of your blog post, but nonetheless after playing around with
parameters a bit, I was able to get a result that was close to what you
are seeing (attached).  Let me know if you disagree.

I'm a bit concerned with the tone of your article, making AQM out to be
the bad guy here ("weapon against end users", etc.).  The folks on this
list and those who participate in the IETF AQM WG are working on AQM and
packet scheduling algorithms in an attempt to "fix the Internet".  At this
point AQM/PS is the best known solution, let's not create negative
perceptions unnecessarily.

-Greg

On 8/23/14, 2:01 PM, "Sebastian Moeller" <moeller0@gmx.de> wrote:

>Hi Jerry,
>
>On Aug 23, 2014, at 20:16 , Jerry Jongerius <jerryj@duckware.com> wrote:
>
>> Request for comments on: www.duckware.com/darkaqm
>> 
>> The bottom line: How do you know which AQM device in a network
>>intentionally
>> drops a packet, without cooperation from AQM?
>> 
>> Or is this in AQM somewhere and I just missed it?
>
>
>I am sure you will get more expert responses later, but let me try to
>comment.
>
>Paragraph 1:
>
>I think you hit the nail on the head with your observation:
>
>The average user can not figure out what AQM device intentionally dropped
>packets
>
>Only, I might add, this does not depend on AQM, the user can not figure
>out where packets where dropped in the case that not all involved network
>hops are under said user¹s control ;) So move on, nothing to see here ;)
>
>Paragraph 2:
>
>There is no guarantee that any network equipment responds to ICMP
>requests at all (for example my DSLAM does not). What about pinging a
>host further away and look at that hosts RTT development over time?
>(Minor clarification: its the load dependent increase of ping RTT to the
>CMTS that would be diagnostic of a queue, not the RTT per se). No
>increase of ICMP RTT could also mean there is no AQM involved ;)
>
>	I used to think along similar lines, but reading
>https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute
>_N47_Sun.pdf made me realize that my assumptions about ping and trace
>route were not really backed up by reality. Notably traceroute will not
>necessarily show the real data's path and latencies or drop probability.
>
>Paragraph 3
>
>What is the advertised bandwidth of your link? To my naive eye this looks
>a bit like power boosting (the cable company allowing you higher than
>advertised bandwidth for a short time that is later reduced to the
>advertised speed). Your plot needs a better legend, BTW, what is the blue
>line showing? When you say that neither ping nor trace route showed
>anything, I assumed that you measured concurrently to your download. It
>would be really great if you could netperf-wrapper to get comparable data
>(see the link on 
>http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferbloa
>t ) There the latency is not only assessed by ICMP echo requests but also
>by UDP packets, and it is very unlikely that your ISP can special case
>these in any tricky way, short of giving priority to sparse flows (which
>is pretty much what you would like your ISP to do in the first place ;) )
>
>	Here is where I reveal that I am just a layman, but you complain about
>the loss of one packet, but how do you assume does a (TCP) settle on its
>transfer speed? Exactly it keeps increasing until it looses a packet,
>then reduces its speed to 50% or so and slowly ramps up again until the
>next packet loss. So unless your test data is not TCP I see no way to
>avoid packet loss (and no reason why it is harmful). Now if my power
>boost intuition should prove right I can explain the massive drop quite
>well, TCP had ramped up to above the long-term stable and suffers several
>packet losses in a short time, basically resetting it to 0 or so,
>therefore the new ramping to 40Mbps looks pretty similar to the initial
>ramping to 110Mbps...
>
>Paragraph 4:
>
>I guess, ECN, explicit congestion notification is the best you can
>expect, or routers will initially set a mark on a packet to notify the
>TCP endpoints that they need to throttle the speed unless that want to
>risk packet loss. But not all routers are configured to use it (plus you
>need to configure your endpoints correctly, see:
>http://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN ). But this
>will not tell you where along the path congestion occurred, only that it
>occurred (and if push comes to shove your packets still get dropped.)
>	Also, I believe, a congested router is going to drop packets to be able
>to ³survive² the current load, it is not going to send additional packets
>to inform you that it is overloaded...
>	
>
>Best Regards
>	Sebastian
>
>
>> 
>> 
>> _______________________________________________
>> Bloat mailing list
>> Bloat@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat
>
>_______________________________________________
>Bloat mailing list
>Bloat@lists.bufferbloat.net
>https://lists.bufferbloat.net/listinfo/bloat


[-- Attachment #2: thruput.pdf --]
[-- Type: application/pdf, Size: 25625 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-25 17:13   ` Greg White
@ 2014-08-25 18:09     ` Jim Gettys
  2014-08-25 19:12       ` Sebastian Moeller
  2014-08-28 13:19     ` Jerry Jongerius
  1 sibling, 1 reply; 38+ messages in thread
From: Jim Gettys @ 2014-08-25 18:09 UTC (permalink / raw)
  To: Greg White; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

Note that I worked with Folkert Van Heusden to get some options added to
his httping program to allow "ping" style testing against any HTTP server
out there using HTTP/TCP.

See:

http://www.vanheusden.com/httping/

I find it slightly ironic that people are now concerned about ICMP ping no
longer returning queuing information given that when I started working on
bufferbloat, a number of people claimed that ICMP Ping could not be relied
upon to report reliable information, as it may be prioritized differently
by routers. This "urban legend" may or may not be true; I never observed it
in my explorations.

In any case, you all may find it useful, and my thanks to Folkert for a
very useful tool.

                                                   - Jim

[-- Attachment #2: Type: text/html, Size: 1394 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-25 18:09     ` Jim Gettys
@ 2014-08-25 19:12       ` Sebastian Moeller
  2014-08-25 21:17         ` Bill Ver Steeg (versteb)
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Moeller @ 2014-08-25 19:12 UTC (permalink / raw)
  To: Jim Gettys; +Cc: bloat

Hi Jim,


On Aug 25, 2014, at 20:09 , Jim Gettys <jg@freedesktop.org> wrote:

> Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow "ping" style testing against any HTTP server out there using HTTP/TCP.
> 
> See:
> 
> http://www.vanheusden.com/httping/

	That is quite cool!

> 
> I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. 

	Just to add what I learned: some routers seem to have rate limiting for ICMP processing and process these on a slow-path (see https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf ). Mind you this applies if the router processes the ICMP packet, not if it simply passes it along. So as long as the host responding to the pings is not a router with interesting limitations, this should not affect the suitability of ICMP to detect and measure buffer bloat (heck this is what netperf-wrapper’s RRUL test automated). But since Jerry wants to pinpoint the exact location of his assumed single packet drop he wants to use ping/traceroute to actually probe routers on the way, so all this urban legends about ICMP processing on routers will actually affect him. But then what do I know...

Best Regards
	Sebastian

> This "urban legend" may or may not be true; I never observed it in my explorations.
> 
> In any case, you all may find it useful, and my thanks to Folkert for a very useful tool.
> 
>                                                   - Jim
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-25 19:12       ` Sebastian Moeller
@ 2014-08-25 21:17         ` Bill Ver Steeg (versteb)
  2014-08-25 21:20           ` Bill Ver Steeg (versteb)
  0 siblings, 1 reply; 38+ messages in thread
From: Bill Ver Steeg (versteb) @ 2014-08-25 21:17 UTC (permalink / raw)
  To: Sebastian Moeller, Jim Gettys; +Cc: bloat

Just a cautionary tale- There was a fairly well publicized DOS attack that involved TCP SYN packets with a zero TTL (If I recall correctly), so be careful running that tool. Be particularly careful if you run it in bulk, as you may end up in a black list on a firewall somewhere......




Bill Ver Steeg
Distinguished Engineer 
Cisco Systems






-----Original Message-----
From: bloat-bounces@lists.bufferbloat.net [mailto:bloat-bounces@lists.bufferbloat.net] On Behalf Of Sebastian Moeller
Sent: Monday, August 25, 2014 3:13 PM
To: Jim Gettys
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Hi Jim,


On Aug 25, 2014, at 20:09 , Jim Gettys <jg@freedesktop.org> wrote:

> Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow "ping" style testing against any HTTP server out there using HTTP/TCP.
> 
> See:
> 
> http://www.vanheusden.com/httping/

	That is quite cool!

> 
> I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. 

	Just to add what I learned: some routers seem to have rate limiting for ICMP processing and process these on a slow-path (see https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf ). Mind you this applies if the router processes the ICMP packet, not if it simply passes it along. So as long as the host responding to the pings is not a router with interesting limitations, this should not affect the suitability of ICMP to detect and measure buffer bloat (heck this is what netperf-wrapper's RRUL test automated). But since Jerry wants to pinpoint the exact location of his assumed single packet drop he wants to use ping/traceroute to actually probe routers on the way, so all this urban legends about ICMP processing on routers will actually affect him. But then what do I know...

Best Regards
	Sebastian

> This "urban legend" may or may not be true; I never observed it in my explorations.
> 
> In any case, you all may find it useful, and my thanks to Folkert for a very useful tool.
> 
>                                                   - Jim
> 

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-25 21:17         ` Bill Ver Steeg (versteb)
@ 2014-08-25 21:20           ` Bill Ver Steeg (versteb)
  0 siblings, 0 replies; 38+ messages in thread
From: Bill Ver Steeg (versteb) @ 2014-08-25 21:20 UTC (permalink / raw)
  To: Sebastian Moeller, Jim Gettys; +Cc: bloat

Oops - never mind. I thought the tool was doing traceroute-like things with varying TTLs in order to get per-hop data. 

Go back to whatever you were doing......


Bill Ver Steeg
Distinguished Engineer 
Cisco Systems






-----Original Message-----
From: bloat-bounces@lists.bufferbloat.net [mailto:bloat-bounces@lists.bufferbloat.net] On Behalf Of Bill Ver Steeg (versteb)
Sent: Monday, August 25, 2014 5:17 PM
To: Sebastian Moeller; Jim Gettys
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Just a cautionary tale- There was a fairly well publicized DOS attack that involved TCP SYN packets with a zero TTL (If I recall correctly), so be careful running that tool. Be particularly careful if you run it in bulk, as you may end up in a black list on a firewall somewhere......




Bill Ver Steeg
Distinguished Engineer 
Cisco Systems






-----Original Message-----
From: bloat-bounces@lists.bufferbloat.net [mailto:bloat-bounces@lists.bufferbloat.net] On Behalf Of Sebastian Moeller
Sent: Monday, August 25, 2014 3:13 PM
To: Jim Gettys
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Hi Jim,


On Aug 25, 2014, at 20:09 , Jim Gettys <jg@freedesktop.org> wrote:

> Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow "ping" style testing against any HTTP server out there using HTTP/TCP.
> 
> See:
> 
> http://www.vanheusden.com/httping/

	That is quite cool!

> 
> I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. 

	Just to add what I learned: some routers seem to have rate limiting for ICMP processing and process these on a slow-path (see https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf ). Mind you this applies if the router processes the ICMP packet, not if it simply passes it along. So as long as the host responding to the pings is not a router with interesting limitations, this should not affect the suitability of ICMP to detect and measure buffer bloat (heck this is what netperf-wrapper's RRUL test automated). But since Jerry wants to pinpoint the exact location of his assumed single packet drop he wants to use ping/traceroute to actually probe routers on the way, so all this urban legends about ICMP processing on routers will actually affect him. But then what do I know...

Best Regards
	Sebastian

> This "urban legend" may or may not be true; I never observed it in my explorations.
> 
> In any case, you all may find it useful, and my thanks to Folkert for a very useful tool.
> 
>                                                   - Jim
> 

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-25 17:13   ` Greg White
  2014-08-25 18:09     ` Jim Gettys
@ 2014-08-28 13:19     ` Jerry Jongerius
  2014-08-28 14:07       ` Jonathan Morton
  2014-08-28 14:39       ` Rich Brown
  1 sibling, 2 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-28 13:19 UTC (permalink / raw)
  To: 'Greg White', 'Sebastian Moeller'; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 7239 bytes --]

Mr. White,

 

AQM is a great solution for bufferbloat.  End of story.  But if you want to
track down which device in the network intentionally dropped a packet (when
many devices in the network path will be running AQM), how are you going to
do that?  Or how do you propose to do that?

 

The graph presented is caused the interaction of a single dropped packet,
bufferbloat, and the Westwood+ congestion control algorithm – and not power
boost.

 

- Jerry

 

 

 

-----Original Message-----
From: Greg White [mailto:g.white@CableLabs.com] 
Sent: Monday, August 25, 2014 1:14 PM
To: Sebastian Moeller; Jerry Jongerius
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

 

As far as I know there are no deployments of AQM in DOCSIS networks yet.

So, the effect you are seeing is unlikely to be due to AQM.

 

As Sebastian indicated, it looks like an interaction between power boost, a
drop tail buffer and the tcp congestion window getting reset to slow-start.

 

I ran a quick simulation of a simple network with power boost and basic

(bloated) drop tail buffer (no AQM) this morning in an attempt to understand
what is going on here. You didn't give me a lot to go on in the text of your
blog post, but nonetheless after playing around with parameters a bit, I was
able to get a result that was close to what you are seeing (attached).  Let
me know if you disagree.

 

I'm a bit concerned with the tone of your article, making AQM out to be the
bad guy here ("weapon against end users", etc.).  The folks on this list and
those who participate in the IETF AQM WG are working on AQM and packet
scheduling algorithms in an attempt to "fix the Internet".  At this point
AQM/PS is the best known solution, let's not create negative perceptions
unnecessarily.

 

-Greg

 

On 8/23/14, 2:01 PM, "Sebastian Moeller" < <mailto:moeller0@gmx.de>
moeller0@gmx.de> wrote:

 

>Hi Jerry,

> 

>On Aug 23, 2014, at 20:16 , Jerry Jongerius < <mailto:jerryj@duckware.com>
jerryj@duckware.com> wrote:

> 

>> Request for comments on:  <http://www.duckware.com/darkaqm>
www.duckware.com/darkaqm

>> 

>> The bottom line: How do you know which AQM device in a network 

>>intentionally  drops a packet, without cooperation from AQM?

>> 

>> Or is this in AQM somewhere and I just missed it?

> 

> 

>I am sure you will get more expert responses later, but let me try to 

>comment.

> 

>Paragraph 1:

> 

>I think you hit the nail on the head with your observation:

> 

>The average user can not figure out what AQM device intentionally 

>dropped packets

> 

>Only, I might add, this does not depend on AQM, the user can not figure 

>out where packets where dropped in the case that not all involved 

>network hops are under said user¹s control ;) So move on, nothing to 

>see here ;)

> 

>Paragraph 2:

> 

>There is no guarantee that any network equipment responds to ICMP 

>requests at all (for example my DSLAM does not). What about pinging a 

>host further away and look at that hosts RTT development over time?

>(Minor clarification: its the load dependent increase of ping RTT to 

>the CMTS that would be diagnostic of a queue, not the RTT per se). No 

>increase of ICMP RTT could also mean there is no AQM involved ;)

> 

>             I used to think along similar lines, but reading 

> <https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Tracero>
https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Tracero

>ute _N47_Sun.pdf made me realize that my assumptions about ping and 

>trace route were not really backed up by reality. Notably traceroute 

>will not necessarily show the real data's path and latencies or drop 

>probability.

> 

>Paragraph 3

> 

>What is the advertised bandwidth of your link? To my naive eye this 

>looks a bit like power boosting (the cable company allowing you higher 

>than advertised bandwidth for a short time that is later reduced to the 

>advertised speed). Your plot needs a better legend, BTW, what is the 

>blue line showing? When you say that neither ping nor trace route 

>showed anything, I assumed that you measured concurrently to your 

>download. It would be really great if you could netperf-wrapper to get 

>comparable data (see the link on 

> <http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferb>
http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferb

>loa t ) There the latency is not only assessed by ICMP echo requests 

>but also by UDP packets, and it is very unlikely that your ISP can 

>special case these in any tricky way, short of giving priority to 

>sparse flows (which is pretty much what you would like your ISP to do 

>in the first place ;) )

> 

>             Here is where I reveal that I am just a layman, but you
complain about 

>the loss of one packet, but how do you assume does a (TCP) settle on 

>its transfer speed? Exactly it keeps increasing until it looses a 

>packet, then reduces its speed to 50% or so and slowly ramps up again 

>until the next packet loss. So unless your test data is not TCP I see 

>no way to avoid packet loss (and no reason why it is harmful). Now if 

>my power boost intuition should prove right I can explain the massive 

>drop quite well, TCP had ramped up to above the long-term stable and 

>suffers several packet losses in a short time, basically resetting it 

>to 0 or so, therefore the new ramping to 40Mbps looks pretty similar to 

>the initial ramping to 110Mbps...

> 

>Paragraph 4:

> 

>I guess, ECN, explicit congestion notification is the best you can 

>expect, or routers will initially set a mark on a packet to notify the 

>TCP endpoints that they need to throttle the speed unless that want to 

>risk packet loss. But not all routers are configured to use it (plus 

>you need to configure your endpoints correctly, see:

> <http://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN>
http://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN ). But this 

>will not tell you where along the path congestion occurred, only that 

>it occurred (and if push comes to shove your packets still get dropped.)

>             Also, I believe, a congested router is going to drop packets
to be 

>able to ³survive² the current load, it is not going to send additional 

>packets to inform you that it is overloaded...

>             

> 

>Best Regards

>             Sebastian

> 

> 

>> 

>> 

>> _______________________________________________

>> Bloat mailing list

>>  <mailto:Bloat@lists.bufferbloat.net> Bloat@lists.bufferbloat.net

>>  <https://lists.bufferbloat.net/listinfo/bloat>
https://lists.bufferbloat.net/listinfo/bloat

> 

>_______________________________________________

>Bloat mailing list

> <mailto:Bloat@lists.bufferbloat.net> Bloat@lists.bufferbloat.net

> <https://lists.bufferbloat.net/listinfo/bloat>
https://lists.bufferbloat.net/listinfo/bloat

 


[-- Attachment #2: Type: text/html, Size: 16559 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 13:19     ` Jerry Jongerius
@ 2014-08-28 14:07       ` Jonathan Morton
  2014-08-28 17:20         ` Jerry Jongerius
  2014-08-28 14:39       ` Rich Brown
  1 sibling, 1 reply; 38+ messages in thread
From: Jonathan Morton @ 2014-08-28 14:07 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:

> AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do you propose to do that?

We don't plan to do that.  Not from the outside.  Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss.

But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time.  You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it.

> The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm – and not power boost.

This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks.

I'd be surprised if, in fact, *only* one packet was lost.  The more usual case is of "burst loss", where several packets are lost in quick succession, and not necessarily consecutive packets.  This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way).  Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it.

The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum.  This maximises the time before your TCP gets to even notice the lost packet's nonexistence, during which the sender keeps the buffer full because it still thinks everything's fine.

What is probably happening is that the bottleneck queue, being so large, delays the retransmission of the lost packet until the Retransmit Timer expires.  This will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this case) that the characteristics of the channel have changed.  You can see that it takes most of the first second for the sender to ramp up to full speed, and nearly as long to ramp back up to the reduced speed, both of which are characteristic of slow-start at WAN latencies.  NB: during slow-start, the buffer remains empty as long as the incoming data rate is less than the output capacity, so latency is at a minimum.

Do you have TCP SACK and timestamps turned on?  Those usually allow minor losses like that to be handled more gracefully - the sending TCP gets a better idea of the RTT (allowing it to set the Retransmit Timer more intelligently), and would be able to see that progress is still being made with the backlog of buffered packets, even though the core TCP ACK is not advancing.  In the event of burst loss, it would also be able to retransmit the correct set of packets straight away.

What AQM would do for you here - if your ISP implemented it properly - is to eliminate the negative effects of filling that massive buffer at your ISP.  It would allow the sending TCP to detect and recover from any packet loss more quickly, and with ECN turned on you probably wouldn't even get any packet loss.

What's also interesting is that, after recovering from the change in bandwidth, you get smaller bursts of about 15-40KB arriving at roughly half-second intervals, mixed in with the relatively steady 1-, 2- and 3-packet stream.  That is characteristic of low-level packet loss with a low-latency recovery.

This either implies that your ISP has stuck you on a much shorter buffer for the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is enforcing a smaller congestion window on you after having suffered a slow-start recovery.  The latter restricts your bandwidth to match the delay-bandwidth product, but happily the "delay" in that equation is at a minimum if it keeps your buffer empty.

And frankly, you're still getting 45Mbps under those conditions.  Many people would kill for that sort of performance - although they'd probably then want to kill everyone in the Comcast call centre later on.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 13:19     ` Jerry Jongerius
  2014-08-28 14:07       ` Jonathan Morton
@ 2014-08-28 14:39       ` Rich Brown
  2014-08-28 16:20         ` Jerry Jongerius
  1 sibling, 1 reply; 38+ messages in thread
From: Rich Brown @ 2014-08-28 14:39 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

[-- Attachment #1.1: Type: text/plain, Size: 1890 bytes --]

Hi Jerry,

> AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do youpropose to do that?

Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information?

The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. 

If the drops happen in your local gateway/home router, then it's interesting to you as the "operator" of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions.

The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. 

Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping...

Best regards,

Rich

[-- Attachment #1.2: Type: text/html, Size: 2966 bytes --]

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 14:39       ` Rich Brown
@ 2014-08-28 16:20         ` Jerry Jongerius
  2014-08-28 16:35           ` Fred Baker (fred)
                             ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-28 16:20 UTC (permalink / raw)
  To: 'Rich Brown'; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 2638 bytes --]

It add accountability.  Everyone in the path right now denies that they
could possibly be the one dropping the packet.

If I want (or need!) to address the problem, I can't now.  I would have to
make a change and just hope that it fixed the problem.

With accountability, I can address the problem.  I then have a choice.  If
the problem is the ISP, I can switch ISP's.  If the problem is the mid-level
peer or the hosting provider, I can test out new hosting providers.

- Jerry

From: Rich Brown [mailto:richb.hanover@gmail.com] 
Sent: Thursday, August 28, 2014 10:39 AM
To: Jerry Jongerius
Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Hi Jerry,

AQM is a great solution for bufferbloat.  End of story.  But if you want to
track down which device in the network intentionally dropped a packet (when
many devices in the network path will be running AQM), how are you going to
do that?  Or how do youpropose to do that?

Yes, but... I want to understand why you are looking to know which device
dropped the packet. What would you do with the information?

The great beauty of fq_codel is that it discards packets that have dwelt too
long in a queue by actually *measuring* how long they've been in the queue. 

If the drops happen in your local gateway/home router, then it's interesting
to you as the "operator" of that device. If the drops happen elsewhere
(perhaps some enlightened ISP has installed fq_codel, PIE, or some other
zoomy queue discipline) then they're doing the right thing as well - they're
managing their traffic as well as they can. But once the data leaves your
gateway router, you can't make any further predictions.

The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal
performance of the *local* gateway, to make it adapt to the remainder of the
(black box) network. It might make sense to instrument the CeroWrt/OpenWrt
code to track the number of fq_codel drops to come up with a sense of what's
'normal'. And if you need to know exactly what's happening, then
tcpdump/wireshark are your friends. 

Maybe I'm missing the point of your note, but I'm not sure there's anything
you can do beyond your gateway. In the broader network, operators are
continually watching their traffic and drop rates, and
adjusting/reconfiguring their networks to adapt. But in general, it's
impossible for you to have any sway/influence on their operations, so I'm
not sure what you would do if you could know that the third router in
traceroute was dropping...

Best regards,

Rich

[-- Attachment #2: Type: text/html, Size: 7086 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 16:20         ` Jerry Jongerius
@ 2014-08-28 16:35           ` Fred Baker (fred)
  2014-08-28 18:00             ` Jan Ceuleers
  2014-08-28 16:36           ` Greg White
  2014-09-01 11:47           ` Richard Scheffenegger
  2 siblings, 1 reply; 38+ messages in thread
From: Fred Baker (fred) @ 2014-08-28 16:35 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat


[-- Attachment #1.1: Type: text/plain, Size: 3904 bytes --]


On Aug 28, 2014, at 9:20 AM, Jerry Jongerius <jerryj@duckware.com> wrote:

> It add accountability.  Everyone in the path right now denies that they could possibly be the one dropping the packet.
>  
> If I want (or need!) to address the problem, I can’t now.  I would have to make a change and just hope that it fixed the problem.
>  
> With accountability, I can address the problem.  I then have a choice.  If the problem is the ISP, I can switch ISP’s.  If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers.
 
May I ask what may be a dumb question?

All communications has some probability of error. That’s the reason we have CRCs on link layer frames; to detect and discard errored packets. The probability of such an error varies by media type; it’s relatively uncommon (O(10^-11)) on fiber, a little more common (perhaps O(10^-9)) on wired Ethernet, likely on Wifi (O(1p^-7 or so), which is why Wifi incorporates local retransmission), and very likely (O(10^-4)) on satellite links, which is why they use forward error correction.

Errors are not usually single bit errors. They are far more commonly block errors, especially if trellis coding is in use, as once there is an error the entire link goes screwy until it works out where the data is going. Such block errors might consume entire messages, or sets of messages, including not only the messages but the gaps between them.

When a message is lost due to an error, how do you determine whose fault it is?

> - Jerry
>  
>  
>  
> From: Rich Brown [mailto:richb.hanover@gmail.com] 
> Sent: Thursday, August 28, 2014 10:39 AM
> To: Jerry Jongerius
> Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>  
> Hi Jerry,
>  
> AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do youpropose to do that?
>  
> Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information?
>  
> The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. 
>  
> If the drops happen in your local gateway/home router, then it's interesting to you as the "operator" of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions.
>  
> The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. 
>  
> Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping...
>  
> Best regards,
>  
> Rich
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


[-- Attachment #1.2: Type: text/html, Size: 12938 bytes --]

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 16:20         ` Jerry Jongerius
  2014-08-28 16:35           ` Fred Baker (fred)
@ 2014-08-28 16:36           ` Greg White
  2014-08-28 16:52             ` Bill Ver Steeg (versteb)
  2014-09-01 11:47           ` Richard Scheffenegger
  2 siblings, 1 reply; 38+ messages in thread
From: Greg White @ 2014-08-28 16:36 UTC (permalink / raw)
  To: Jerry Jongerius, 'Rich Brown'; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 3723 bytes --]

And again, AQM is not causing the problem that you observed.  As Jonathan indicated, it would almost certainly make your performance better.    I can't speak for Comcast, but AFAIK they are on a path to deploy AQM.  If their customers start raising FUD that could change.

TCP requires congestion signals.  In the vast majority of cases today (and for the foreseeable future) those signals are dropped packets.  Going on a witch hunt to find the evildoer that dropped your packet is counter productive.  I think you should instead be asking "why didn't you drop my packet earlier, before the buffer got so bloated and power boost cut the BDP by 60%?"

-Greg

From: Jerry Jongerius <jerryj@duckware.com<mailto:jerryj@duckware.com>>
Date: Thursday, August 28, 2014 at 10:20 AM
To: 'Rich Brown' <richb.hanover@gmail.com<mailto:richb.hanover@gmail.com>>
Cc: "bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>" <bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>>
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

It add accountability.  Everyone in the path right now denies that they could possibly be the one dropping the packet.

If I want (or need!) to address the problem, I can’t now.  I would have to make a change and just hope that it fixed the problem.

With accountability, I can address the problem.  I then have a choice.  If the problem is the ISP, I can switch ISP’s.  If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers.

- Jerry

From: Rich Brown [mailto:richb.hanover@gmail.com]
Sent: Thursday, August 28, 2014 10:39 AM
To: Jerry Jongerius
Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Hi Jerry,

AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do youpropose to do that?

Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information?

The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue.

If the drops happen in your local gateway/home router, then it's interesting to you as the "operator" of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions.

The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends.

Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping...

Best regards,

Rich

[-- Attachment #2: Type: text/html, Size: 9559 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 16:36           ` Greg White
@ 2014-08-28 16:52             ` Bill Ver Steeg (versteb)
  0 siblings, 0 replies; 38+ messages in thread
From: Bill Ver Steeg (versteb) @ 2014-08-28 16:52 UTC (permalink / raw)
  To: Greg White, Jerry Jongerius, 'Rich Brown'; +Cc: bloat

[-- Attachment #1.1: Type: text/plain, Size: 4749 bytes --]

Regarding AQM in North American HFC deployments-

I also can't speak for individual Service Providers, but Greg was being modest and the following may be interesting.

The most recent DOCSIS 3.1 specs calls for AQM in the CMTS. It specifically calls for a specific variant of  PIE that is designed with the DOCSIS MAC layer in mind. The DOCSIS 3.0 spec is also being amended to require AQM. Both specs also have recommendations to include AQM in the Cable Modems that can be turned on in the HFC network.

See http://tools.ietf.org/html/draft-white-aqm-docsis-pie-00 for more details.

bvs

[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

Bill Ver Steeg
Distinguished Engineer
Cisco Systems

From: bloat-bounces@lists.bufferbloat.net [mailto:bloat-bounces@lists.bufferbloat.net] On Behalf Of Greg White
Sent: Thursday, August 28, 2014 12:36 PM
To: Jerry Jongerius; 'Rich Brown'
Cc: bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

And again, AQM is not causing the problem that you observed.  As Jonathan indicated, it would almost certainly make your performance better.    I can't speak for Comcast, but AFAIK they are on a path to deploy AQM.  If their customers start raising FUD that could change.

TCP requires congestion signals.  In the vast majority of cases today (and for the foreseeable future) those signals are dropped packets.  Going on a witch hunt to find the evildoer that dropped your packet is counter productive.  I think you should instead be asking "why didn't you drop my packet earlier, before the buffer got so bloated and power boost cut the BDP by 60%?"

-Greg

From: Jerry Jongerius <jerryj@duckwae.com<mailto:jerryj@duckwae.com>>
Date: Thursday, August 28, 2014 at 10:20 AM
To: 'Rich Brown' <richb.hanover@gmail.com<mailto:richb.hanover@gmail.com>>
Cc: "bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>" <bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>>
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

It add accountability.  Everyone in the path right now denies that they could possibly be the one dropping the packet.

If I want (or need!) to address the problem, I can't now.  I would have to make a change and just hope that it fixed the problem.

With accountability, I can address the problem.  I then have a choice.  If the problem is the ISP, I can switch ISP's.  If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers.

- Jerry

From: Rich Brown [mailto:richb.hanover@gmail.com]
Sent: Thursday, August 28, 2014 10:39 AM
To: Jerry Jongerius
Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net<mailto:bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

Hi Jerry,

AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do youpropose to do that?

Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information?

The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue.

If the drops happen in your local gateway/home router, then it's interesting to you as the "operator" of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions.

The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends.

Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping...

Best regards,

Rich

[-- Attachment #1.2: Type: text/html, Size: 16518 bytes --]

[-- Attachment #2: image001.jpg --]
[-- Type: image/jpeg, Size: 5673 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 14:07       ` Jonathan Morton
@ 2014-08-28 17:20         ` Jerry Jongerius
  2014-08-28 17:41           ` Dave Taht
                             ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-28 17:20 UTC (permalink / raw)
  To: 'Jonathan Morton'; +Cc: bloat

Jonathan,

Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
size.  The RWIN size can be below the BDP (no measurable queuing within the
CMTS).  Or, the RWIN size can be very large, causing significant queuing
within the CMTS.  With a larger RWIN value, the single dropped packet
typically happens sooner in the download, rather than later.  The fact there
is no "burst loss" is a significant clue.

The graph is fully explained by the Westwood+ algorithm that the server is
using.  If you input the data observed into the Westwood+ bandwidth
estimator, you end up with the rate seen in the graph after the packet loss
event.  The reason the rate gets limited (no ramp up) is due to Westwood+
behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
and the timing of the lost packet in relation to when the bufferbloat
starts.  When there is no RTO, I see the expected drop (to the Westwood+
bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
ssthresh and cwnd to its bandwidth estimate.

The PC does SACK, the server does not, so not used.  Timestamps off.

- Jerry

-----Original Message-----
From: Jonathan Morton [mailto:chromatix99@gmail.com] 
Sent: Thursday, August 28, 2014 10:08 AM
To: Jerry Jongerius
Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:

> AQM is a great solution for bufferbloat.  End of story.  But if you want
to track down which device in the network intentionally dropped a packet
(when many devices in the network path will be running AQM), how are you
going to do that?  Or how do you propose to do that?

We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
tell which routers drop packets today, when AQM is not at all widely
deployed, so that's no great loss.

But if ECN finally gets deployed, AQM can set the Congestion Experienced
flag instead of dropping packets, most of the time.  You still don't get to
see which router did it, but the packet still gets through and the TCP
session knows what to do about it.

> The graph presented is caused the interaction of a single dropped packet,
bufferbloat, and the Westwood+ congestion control algorithm - and not power
boost.

This surprises me somewhat - Westwood+ is supposed to be deliberately
tolerant of single packet losses, since it was designed explicitly to get
around the problem of slight random loss on wireless networks.

I'd be surprised if, in fact, *only* one packet was lost.  The more usual
case is of "burst loss", where several packets are lost in quick succession,
and not necessarily consecutive packets.  This tends to happen repeatedly on
dump drop-tail queues, unless the buffer is so large that it accommodates
the entire receive window (which, for modern OSes, is quite impressive in a
dark sort of way).  Burst loss is characteristic of congestion, whereas
random loss tends to lose isolated packets, so it would be much less
surprising for Westwood+ to react to it.

The packets were lost in the first place because the queue became
chock-full, probably at just about the exact moment when the PowerBoost
allowance ran out and the bandwidth came down (which tends to cause the
buffer to fill rapidly), so you get the worst-case scenario: the buffer at
its fullest, and the bandwidth draining it at its minimum.  This maximises
the time before your TCP gets to even notice the lost packet's nonexistence,
during which the sender keeps the buffer full because it still thinks
everything's fine.

What is probably happening is that the bottleneck queue, being so large,
delays the retransmission of the lost packet until the Retransmit Timer
expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
(rightly in this case) that the characteristics of the channel have changed.
You can see that it takes most of the first second for the sender to ramp up
to full speed, and nearly as long to ramp back up to the reduced speed, both
of which are characteristic of slow-start at WAN latencies.  NB: during
slow-start, the buffer remains empty as long as the incoming data rate is
less than the output capacity, so latency is at a minimum.

Do you have TCP SACK and timestamps turned on?  Those usually allow minor
losses like that to be handled more gracefully - the sending TCP gets a
better idea of the RTT (allowing it to set the Retransmit Timer more
intelligently), and would be able to see that progress is still being made
with the backlog of buffered packets, even though the core TCP ACK is not
advancing.  In the event of burst loss, it would also be able to retransmit
the correct set of packets straight away.

What AQM would do for you here - if your ISP implemented it properly - is to
eliminate the negative effects of filling that massive buffer at your ISP.
It would allow the sending TCP to detect and recover from any packet loss
more quickly, and with ECN turned on you probably wouldn't even get any
packet loss.

What's also interesting is that, after recovering from the change in
bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
half-second intervals, mixed in with the relatively steady 1-, 2- and
3-packet stream.  That is characteristic of low-level packet loss with a
low-latency recovery.

This either implies that your ISP has stuck you on a much shorter buffer for
the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
enforcing a smaller congestion window on you after having suffered a
slow-start recovery.  The latter restricts your bandwidth to match the
delay-bandwidth product, but happily the "delay" in that equation is at a
minimum if it keeps your buffer empty.

And frankly, you're still getting 45Mbps under those conditions.  Many
people would kill for that sort of performance - although they'd probably
then want to kill everyone in the Comcast call centre later on.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 17:20         ` Jerry Jongerius
@ 2014-08-28 17:41           ` Dave Taht
  2014-08-28 18:15           ` Jonathan Morton
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Dave Taht @ 2014-08-28 17:41 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On Thu, Aug 28, 2014 at 10:20 AM, Jerry Jongerius <jerryj@duckware.com> wrote:
> Jonathan,
>
> Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
> size.  The RWIN size can be below the BDP (no measurable queuing within the
> CMTS).  Or, the RWIN size can be very large, causing significant queuing
> within the CMTS.  With a larger RWIN value, the single dropped packet
> typically happens sooner in the download, rather than later.  The fact there
> is no "burst loss" is a significant clue.
>
> The graph is fully explained by the Westwood+ algorithm that the server is
> using.  If you input the data observed into the Westwood+ bandwidth
> estimator, you end up with the rate seen in the graph after the packet loss
> event.  The reason the rate gets limited (no ramp up) is due to Westwood+
> behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
> and the timing of the lost packet in relation to when the bufferbloat
> starts.  When there is no RTO, I see the expected drop (to the Westwood+
> bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
> ssthresh and cwnd to its bandwidth estimate.

On the same network, what does cubic do?

> The PC does SACK, the server does not, so not used.  Timestamps off.

Timestamps are *critical* for good tcp performance above 5-10mbit on
most cc algos.

I note that the netperf-wrapper test has the ability to test multiple
variants of
TCP, if enabled on the server (basically you need to modprobe the needed
algorithms, enable them in /proc/sys/net/ipv4/tcp_allowed_congestion_control,
and select them in the test tool (iperf and netperf have support)).

Everyone here has installed netperf-wrapper already, yes?

Very fast to generate a good test and a variety of plots like those shown
here:  http://burntchrome.blogspot.com/2014_05_01_archive.html

(in reading that over, does anyone have any news on CMTS aqm or packet
scheduling systems? It's the bulk of the problem there...)

 netperf-wrapper is easy to bring up
on linux, on osx it needs macports, and the only way I've come up to test
windows behavior is using windows as a netperf client rather than server.

I haven't looked into westwood+'s behavior much of late, I will try to add it
and a few other tcps to some future tests. I do have some old plots showing
it misbehaving relative to other TCPs, but that was before many fixes landed
in the kernel.

Note: I keep hoping to find a correctly working ledbat module, the one
I have doesn't look correct (and needs
to be updated to linux 3.15's change to us based timestamping.)

>
> - Jerry
>
>
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99@gmail.com]
> Sent: Thursday, August 28, 2014 10:08 AM
> To: Jerry Jongerius
> Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>
>
> On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:
>
>> AQM is a great solution for bufferbloat.  End of story.  But if you want
> to track down which device in the network intentionally dropped a packet
> (when many devices in the network path will be running AQM), how are you
> going to do that?  Or how do you propose to do that?
>
> We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
> tell which routers drop packets today, when AQM is not at all widely
> deployed, so that's no great loss.
>
> But if ECN finally gets deployed, AQM can set the Congestion Experienced
> flag instead of dropping packets, most of the time.  You still don't get to
> see which router did it, but the packet still gets through and the TCP
> session knows what to do about it.
>
>> The graph presented is caused the interaction of a single dropped packet,
> bufferbloat, and the Westwood+ congestion control algorithm - and not power
> boost.
>
> This surprises me somewhat - Westwood+ is supposed to be deliberately
> tolerant of single packet losses, since it was designed explicitly to get
> around the problem of slight random loss on wireless networks.
>
> I'd be surprised if, in fact, *only* one packet was lost.  The more usual
> case is of "burst loss", where several packets are lost in quick succession,
> and not necessarily consecutive packets.  This tends to happen repeatedly on
> dump drop-tail queues, unless the buffer is so large that it accommodates
> the entire receive window (which, for modern OSes, is quite impressive in a
> dark sort of way).  Burst loss is characteristic of congestion, whereas
> random loss tends to lose isolated packets, so it would be much less
> surprising for Westwood+ to react to it.
>
> The packets were lost in the first place because the queue became
> chock-full, probably at just about the exact moment when the PowerBoost
> allowance ran out and the bandwidth came down (which tends to cause the
> buffer to fill rapidly), so you get the worst-case scenario: the buffer at
> its fullest, and the bandwidth draining it at its minimum.  This maximises
> the time before your TCP gets to even notice the lost packet's nonexistence,
> during which the sender keeps the buffer full because it still thinks
> everything's fine.
>
> What is probably happening is that the bottleneck queue, being so large,
> delays the retransmission of the lost packet until the Retransmit Timer
> expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
> (rightly in this case) that the characteristics of the channel have changed.
> You can see that it takes most of the first second for the sender to ramp up
> to full speed, and nearly as long to ramp back up to the reduced speed, both
> of which are characteristic of slow-start at WAN latencies.  NB: during
> slow-start, the buffer remains empty as long as the incoming data rate is
> less than the output capacity, so latency is at a minimum.
>
> Do you have TCP SACK and timestamps turned on?  Those usually allow minor
> losses like that to be handled more gracefully - the sending TCP gets a
> better idea of the RTT (allowing it to set the Retransmit Timer more
> intelligently), and would be able to see that progress is still being made
> with the backlog of buffered packets, even though the core TCP ACK is not
> advancing.  In the event of burst loss, it would also be able to retransmit
> the correct set of packets straight away.
>
> What AQM would do for you here - if your ISP implemented it properly - is to
> eliminate the negative effects of filling that massive buffer at your ISP.
> It would allow the sending TCP to detect and recover from any packet loss
> more quickly, and with ECN turned on you probably wouldn't even get any
> packet loss.
>
> What's also interesting is that, after recovering from the change in
> bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
> half-second intervals, mixed in with the relatively steady 1-, 2- and
> 3-packet stream.  That is characteristic of low-level packet loss with a
> low-latency recovery.
>
> This either implies that your ISP has stuck you on a much shorter buffer for
> the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
> enforcing a smaller congestion window on you after having suffered a
> slow-start recovery.  The latter restricts your bandwidth to match the
> delay-bandwidth product, but happily the "delay" in that equation is at a
> minimum if it keeps your buffer empty.
>
> And frankly, you're still getting 45Mbps under those conditions.  Many
> people would kill for that sort of performance - although they'd probably
> then want to kill everyone in the Comcast call centre later on.
>
>  - Jonathan Morton
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 16:35           ` Fred Baker (fred)
@ 2014-08-28 18:00             ` Jan Ceuleers
  2014-08-28 18:13               ` Dave Taht
  2014-08-28 18:41               ` Kenyon Ralph
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Ceuleers @ 2014-08-28 18:00 UTC (permalink / raw)
  To: bloat

On 08/28/2014 06:35 PM, Fred Baker (fred) wrote:
> When a message is lost due to an error, how do you determine whose fault
> it is?

Links need to be engineered for the optimum combination of power,
bandwidth, overhead and residual error that meets requirements. I agree
with your implied point that a single error is unlikely to be indicative
of a real problem, but a link not meeting requirements is someone's fault.

So like Jerry I'd be interested in an ability for endpoints to be able
to collect statistics on per-hop loss probabilities so that admins can
hold their providers accountable.

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:00             ` Jan Ceuleers
@ 2014-08-28 18:13               ` Dave Taht
  2014-08-29  1:57                 ` David Lang
  2014-08-28 18:41               ` Kenyon Ralph
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Taht @ 2014-08-28 18:13 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: bloat

On Thu, Aug 28, 2014 at 11:00 AM, Jan Ceuleers <jan.ceuleers@gmail.com> wrote:
> On 08/28/2014 06:35 PM, Fred Baker (fred) wrote:
>> When a message is lost due to an error, how do you determine whose fault
>> it is?
>
> Links need to be engineered for the optimum combination of power,
> bandwidth, overhead and residual error that meets requirements. I agree
> with your implied point that a single error is unlikely to be indicative
> of a real problem, but a link not meeting requirements is someone's fault.
>
> So like Jerry I'd be interested in an ability for endpoints to be able
> to collect statistics on per-hop loss probabilities so that admins can
> hold their providers accountable.

I will argue that a provider demonstrating 3% packet loss and low
latency is "better" than a provider showing .03% packet loss and
exorbitant latency. So I'd rather be measuring latency AND loss.

One very cool thing that went by at sigcomm last week was the concept
of "active networking" revived in the form of "Tiny Packet Programs":
see:

http://arxiv.org/pdf/1405.7143v3.pdf

Which has a core concept of a protocol and virtual machine that can
actively gather data from the path itself about buffering, loss, etc.

No implementation was presented, but I could see a way to easily do it
in linux via iptables. Regrettably, elsewhere in the real world, we
have to infer these statistics via various means.



> Jan
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 17:20         ` Jerry Jongerius
  2014-08-28 17:41           ` Dave Taht
@ 2014-08-28 18:15           ` Jonathan Morton
  2014-08-29 14:21             ` Jerry Jongerius
  2014-08-28 18:59           ` Sebastian Moeller
  2014-08-29  1:59           ` David Lang
  3 siblings, 1 reply; 38+ messages in thread
From: Jonathan Morton @ 2014-08-28 18:15 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 569 bytes --]

If it is genuinely a single packet, then I have an alternate theory.

I note from http://www.dslreports.com/faq/14520 that PowerBoost works on
the first 20MB of a download.  At 100Mbps or so, that's about 2 seconds.
So that's quite convincing evidence that your packet loss is happening at
the moment PowerBoost switches off.

It might be that the switching process takes long enough to drop one
packet. Or it might be that Comcast deliberately drop one packet in order
to signal the change in bandwidth to the sender. Clever, if mildly
distasteful.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 710 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:00             ` Jan Ceuleers
  2014-08-28 18:13               ` Dave Taht
@ 2014-08-28 18:41               ` Kenyon Ralph
  2014-08-28 19:04                 ` Dave Taht
  1 sibling, 1 reply; 38+ messages in thread
From: Kenyon Ralph @ 2014-08-28 18:41 UTC (permalink / raw)
  To: bloat

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

On 2014-08-28T20:00:54+0200, Jan Ceuleers <jan.ceuleers@gmail.com> wrote:
> On 08/28/2014 06:35 PM, Fred Baker (fred) wrote:
> > When a message is lost due to an error, how do you determine whose fault
> > it is?
> 
> Links need to be engineered for the optimum combination of power,
> bandwidth, overhead and residual error that meets requirements. I agree
> with your implied point that a single error is unlikely to be indicative
> of a real problem, but a link not meeting requirements is someone's fault.
> 
> So like Jerry I'd be interested in an ability for endpoints to be able
> to collect statistics on per-hop loss probabilities so that admins can
> hold their providers accountable.

Here is some relevant work:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2417573
"Measurement and Analysis of Internet Interconnection and Congestion"

-- 
Kenyon Ralph

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 17:20         ` Jerry Jongerius
  2014-08-28 17:41           ` Dave Taht
  2014-08-28 18:15           ` Jonathan Morton
@ 2014-08-28 18:59           ` Sebastian Moeller
  2014-08-29 11:33             ` Jerry Jongerius
  2014-08-29  1:59           ` David Lang
  3 siblings, 1 reply; 38+ messages in thread
From: Sebastian Moeller @ 2014-08-28 18:59 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

Hi Jerry,


On Aug 28, 2014, at 19:20 , Jerry Jongerius <jerryj@duckware.com> wrote:

> Jonathan,
> 
> Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
> size.  The RWIN size can be below the BDP (no measurable queuing within the
> CMTS).  Or, the RWIN size can be very large, causing significant queuing
> within the CMTS.  With a larger RWIN value, the single dropped packet
> typically happens sooner in the download, rather than later.  The fact there
> is no "burst loss" is a significant clue.
> 
> The graph is fully explained by the Westwood+ algorithm that the server is
> using.  If you input the data observed into the Westwood+ bandwidth
> estimator, you end up with the rate seen in the graph after the packet loss
> event.  The reason the rate gets limited (no ramp up) is due to Westwood+
> behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
> and the timing of the lost packet in relation to when the bufferbloat
> starts.  When there is no RTO, I see the expected drop (to the Westwood+
> bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
> ssthresh and cwnd to its bandwidth estimate.
> 
> The PC does SACK, the server does not, so not used.  Timestamps off.

	Okay that is interesting, Could I convince you to try to enable SACK on the server and test whether you still see the catastrophic results? And/or try another tcp variant instead of westwood+, like the default cubic.

Best Regards
	Sebastian

> 
> - Jerry
> 
> 
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99@gmail.com] 
> Sent: Thursday, August 28, 2014 10:08 AM
> To: Jerry Jongerius
> Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
> 
> 
> On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:
> 
>> AQM is a great solution for bufferbloat.  End of story.  But if you want
> to track down which device in the network intentionally dropped a packet
> (when many devices in the network path will be running AQM), how are you
> going to do that?  Or how do you propose to do that?
> 
> We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
> tell which routers drop packets today, when AQM is not at all widely
> deployed, so that's no great loss.
> 
> But if ECN finally gets deployed, AQM can set the Congestion Experienced
> flag instead of dropping packets, most of the time.  You still don't get to
> see which router did it, but the packet still gets through and the TCP
> session knows what to do about it.
> 
>> The graph presented is caused the interaction of a single dropped packet,
> bufferbloat, and the Westwood+ congestion control algorithm - and not power
> boost.
> 
> This surprises me somewhat - Westwood+ is supposed to be deliberately
> tolerant of single packet losses, since it was designed explicitly to get
> around the problem of slight random loss on wireless networks.
> 
> I'd be surprised if, in fact, *only* one packet was lost.  The more usual
> case is of "burst loss", where several packets are lost in quick succession,
> and not necessarily consecutive packets.  This tends to happen repeatedly on
> dump drop-tail queues, unless the buffer is so large that it accommodates
> the entire receive window (which, for modern OSes, is quite impressive in a
> dark sort of way).  Burst loss is characteristic of congestion, whereas
> random loss tends to lose isolated packets, so it would be much less
> surprising for Westwood+ to react to it.
> 
> The packets were lost in the first place because the queue became
> chock-full, probably at just about the exact moment when the PowerBoost
> allowance ran out and the bandwidth came down (which tends to cause the
> buffer to fill rapidly), so you get the worst-case scenario: the buffer at
> its fullest, and the bandwidth draining it at its minimum.  This maximises
> the time before your TCP gets to even notice the lost packet's nonexistence,
> during which the sender keeps the buffer full because it still thinks
> everything's fine.
> 
> What is probably happening is that the bottleneck queue, being so large,
> delays the retransmission of the lost packet until the Retransmit Timer
> expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
> (rightly in this case) that the characteristics of the channel have changed.
> You can see that it takes most of the first second for the sender to ramp up
> to full speed, and nearly as long to ramp back up to the reduced speed, both
> of which are characteristic of slow-start at WAN latencies.  NB: during
> slow-start, the buffer remains empty as long as the incoming data rate is
> less than the output capacity, so latency is at a minimum.
> 
> Do you have TCP SACK and timestamps turned on?  Those usually allow minor
> losses like that to be handled more gracefully - the sending TCP gets a
> better idea of the RTT (allowing it to set the Retransmit Timer more
> intelligently), and would be able to see that progress is still being made
> with the backlog of buffered packets, even though the core TCP ACK is not
> advancing.  In the event of burst loss, it would also be able to retransmit
> the correct set of packets straight away.
> 
> What AQM would do for you here - if your ISP implemented it properly - is to
> eliminate the negative effects of filling that massive buffer at your ISP.
> It would allow the sending TCP to detect and recover from any packet loss
> more quickly, and with ECN turned on you probably wouldn't even get any
> packet loss.
> 
> What's also interesting is that, after recovering from the change in
> bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
> half-second intervals, mixed in with the relatively steady 1-, 2- and
> 3-packet stream.  That is characteristic of low-level packet loss with a
> low-latency recovery.
> 
> This either implies that your ISP has stuck you on a much shorter buffer for
> the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
> enforcing a smaller congestion window on you after having suffered a
> slow-start recovery.  The latter restricts your bandwidth to match the
> delay-bandwidth product, but happily the "delay" in that equation is at a
> minimum if it keeps your buffer empty.
> 
> And frankly, you're still getting 45Mbps under those conditions.  Many
> people would kill for that sort of performance - although they'd probably
> then want to kill everyone in the Comcast call centre later on.
> 
> - Jonathan Morton
> 
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:41               ` Kenyon Ralph
@ 2014-08-28 19:04                 ` Dave Taht
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Taht @ 2014-08-28 19:04 UTC (permalink / raw)
  To: bloat

On Thu, Aug 28, 2014 at 11:41 AM, Kenyon Ralph <kenyon@kenyonralph.com> wrote:
> On 2014-08-28T20:00:54+0200, Jan Ceuleers <jan.ceuleers@gmail.com> wrote:
>> On 08/28/2014 06:35 PM, Fred Baker (fred) wrote:
>> > When a message is lost due to an error, how do you determine whose fault
>> > it is?
>>
>> Links need to be engineered for the optimum combination of power,
>> bandwidth, overhead and residual error that meets requirements. I agree
>> with your implied point that a single error is unlikely to be indicative
>> of a real problem, but a link not meeting requirements is someone's fault.
>>
>> So like Jerry I'd be interested in an ability for endpoints to be able
>> to collect statistics on per-hop loss probabilities so that admins can
>> hold their providers accountable.
>
> Here is some relevant work:
> http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2417573
> "Measurement and Analysis of Internet Interconnection and Congestion"

Wow. That gets the "paper of the month" award from me.

> --
> Kenyon Ralph
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:13               ` Dave Taht
@ 2014-08-29  1:57                 ` David Lang
  0 siblings, 0 replies; 38+ messages in thread
From: David Lang @ 2014-08-29  1:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: bloat

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2036 bytes --]

On Thu, 28 Aug 2014, Dave Taht wrote:

> On Thu, Aug 28, 2014 at 11:00 AM, Jan Ceuleers <jan.ceuleers@gmail.com> wrote:
>> On 08/28/2014 06:35 PM, Fred Baker (fred) wrote:
>>> When a message is lost due to an error, how do you determine whose fault
>>> it is?
>>
>> Links need to be engineered for the optimum combination of power,
>> bandwidth, overhead and residual error that meets requirements. I agree
>> with your implied point that a single error is unlikely to be indicative
>> of a real problem, but a link not meeting requirements is someone's fault.
>>
>> So like Jerry I'd be interested in an ability for endpoints to be able
>> to collect statistics on per-hop loss probabilities so that admins can
>> hold their providers accountable.
>
> I will argue that a provider demonstrating 3% packet loss and low
> latency is "better" than a provider showing .03% packet loss and
> exorbitant latency. So I'd rather be measuring latency AND loss.

Yep, the drive to never loose a packet is what caused buffer sizes to grow to 
such silly extremes.

David Lang

> One very cool thing that went by at sigcomm last week was the concept
> of "active networking" revived in the form of "Tiny Packet Programs":
> see:
>
> http://arxiv.org/pdf/1405.7143v3.pdf
>
> Which has a core concept of a protocol and virtual machine that can
> actively gather data from the path itself about buffering, loss, etc.
>
> No implementation was presented, but I could see a way to easily do it
> in linux via iptables. Regrettably, elsewhere in the real world, we
> have to infer these statistics via various means.
>
>
>
>> Jan
>>
>> _______________________________________________
>> Bloat mailing list
>> Bloat@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat
>
>
>
> -- 
> Dave Täht
>
> NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 17:20         ` Jerry Jongerius
                             ` (2 preceding siblings ...)
  2014-08-28 18:59           ` Sebastian Moeller
@ 2014-08-29  1:59           ` David Lang
  2014-08-29 14:37             ` Jerry Jongerius
  3 siblings, 1 reply; 38+ messages in thread
From: David Lang @ 2014-08-29  1:59 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On Thu, 28 Aug 2014, Jerry Jongerius wrote:

> Yes, WireShark shows that *only* one packet gets lost.  Regardless of RWIN
> size.  The RWIN size can be below the BDP (no measurable queuing within the
> CMTS).  Or, the RWIN size can be very large, causing significant queuing
> within the CMTS.  With a larger RWIN value, the single dropped packet
> typically happens sooner in the download, rather than later.  The fact there
> is no "burst loss" is a significant clue.

did you check to see if packets were re-sent even if they weren't lost? on of 
the side effects of excessive buffering is that it's possible for a packet to be 
held in the buffer long enough that the sender thinks that it's been lost and 
retransmits it, so the packet is effectivly 'lost' even if it actually arrives 
at it's destination.

David Lang

> The graph is fully explained by the Westwood+ algorithm that the server is
> using.  If you input the data observed into the Westwood+ bandwidth
> estimator, you end up with the rate seen in the graph after the packet loss
> event.  The reason the rate gets limited (no ramp up) is due to Westwood+
> behavior on a RTO.  And the reason there is the RTO is due the bufferbloat,
> and the timing of the lost packet in relation to when the bufferbloat
> starts.  When there is no RTO, I see the expected drop (to the Westwood+
> bandwidth estimate) and ramp back up.  On a RTO, Westwood+ sets both
> ssthresh and cwnd to its bandwidth estimate.
>
> The PC does SACK, the server does not, so not used.  Timestamps off.
>
> - Jerry
>
>
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99@gmail.com]
> Sent: Thursday, August 28, 2014 10:08 AM
> To: Jerry Jongerius
> Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>
>
> On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:
>
>> AQM is a great solution for bufferbloat.  End of story.  But if you want
> to track down which device in the network intentionally dropped a packet
> (when many devices in the network path will be running AQM), how are you
> going to do that?  Or how do you propose to do that?
>
> We don't plan to do that.  Not from the outside.  Frankly, we can't reliably
> tell which routers drop packets today, when AQM is not at all widely
> deployed, so that's no great loss.
>
> But if ECN finally gets deployed, AQM can set the Congestion Experienced
> flag instead of dropping packets, most of the time.  You still don't get to
> see which router did it, but the packet still gets through and the TCP
> session knows what to do about it.
>
>> The graph presented is caused the interaction of a single dropped packet,
> bufferbloat, and the Westwood+ congestion control algorithm - and not power
> boost.
>
> This surprises me somewhat - Westwood+ is supposed to be deliberately
> tolerant of single packet losses, since it was designed explicitly to get
> around the problem of slight random loss on wireless networks.
>
> I'd be surprised if, in fact, *only* one packet was lost.  The more usual
> case is of "burst loss", where several packets are lost in quick succession,
> and not necessarily consecutive packets.  This tends to happen repeatedly on
> dump drop-tail queues, unless the buffer is so large that it accommodates
> the entire receive window (which, for modern OSes, is quite impressive in a
> dark sort of way).  Burst loss is characteristic of congestion, whereas
> random loss tends to lose isolated packets, so it would be much less
> surprising for Westwood+ to react to it.
>
> The packets were lost in the first place because the queue became
> chock-full, probably at just about the exact moment when the PowerBoost
> allowance ran out and the bandwidth came down (which tends to cause the
> buffer to fill rapidly), so you get the worst-case scenario: the buffer at
> its fullest, and the bandwidth draining it at its minimum.  This maximises
> the time before your TCP gets to even notice the lost packet's nonexistence,
> during which the sender keeps the buffer full because it still thinks
> everything's fine.
>
> What is probably happening is that the bottleneck queue, being so large,
> delays the retransmission of the lost packet until the Retransmit Timer
> expires.  This will cause Reno-family TCPs to revert to slow-start, assuming
> (rightly in this case) that the characteristics of the channel have changed.
> You can see that it takes most of the first second for the sender to ramp up
> to full speed, and nearly as long to ramp back up to the reduced speed, both
> of which are characteristic of slow-start at WAN latencies.  NB: during
> slow-start, the buffer remains empty as long as the incoming data rate is
> less than the output capacity, so latency is at a minimum.
>
> Do you have TCP SACK and timestamps turned on?  Those usually allow minor
> losses like that to be handled more gracefully - the sending TCP gets a
> better idea of the RTT (allowing it to set the Retransmit Timer more
> intelligently), and would be able to see that progress is still being made
> with the backlog of buffered packets, even though the core TCP ACK is not
> advancing.  In the event of burst loss, it would also be able to retransmit
> the correct set of packets straight away.
>
> What AQM would do for you here - if your ISP implemented it properly - is to
> eliminate the negative effects of filling that massive buffer at your ISP.
> It would allow the sending TCP to detect and recover from any packet loss
> more quickly, and with ECN turned on you probably wouldn't even get any
> packet loss.
>
> What's also interesting is that, after recovering from the change in
> bandwidth, you get smaller bursts of about 15-40KB arriving at roughly
> half-second intervals, mixed in with the relatively steady 1-, 2- and
> 3-packet stream.  That is characteristic of low-level packet loss with a
> low-latency recovery.
>
> This either implies that your ISP has stuck you on a much shorter buffer for
> the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is
> enforcing a smaller congestion window on you after having suffered a
> slow-start recovery.  The latter restricts your bandwidth to match the
> delay-bandwidth product, but happily the "delay" in that equation is at a
> minimum if it keeps your buffer empty.
>
> And frankly, you're still getting 45Mbps under those conditions.  Many
> people would kill for that sort of performance - although they'd probably
> then want to kill everyone in the Comcast call centre later on.
>
> - Jonathan Morton
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:59           ` Sebastian Moeller
@ 2014-08-29 11:33             ` Jerry Jongerius
  2014-08-29 12:18               ` Sebastian Moeller
  2014-08-29 14:42               ` Dave Taht
  0 siblings, 2 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-29 11:33 UTC (permalink / raw)
  To: 'Sebastian Moeller'; +Cc: bloat

> Okay that is interesting, Could I convince you to try to enable SACK
> on the server and test whether you still see the catastrophic results?
> And/or try another tcp variant instead of westwood+, like the default
cubic.

Would love to, but can not.  I have read only access to settings on that
server.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29 11:33             ` Jerry Jongerius
@ 2014-08-29 12:18               ` Sebastian Moeller
  2014-08-29 14:42               ` Dave Taht
  1 sibling, 0 replies; 38+ messages in thread
From: Sebastian Moeller @ 2014-08-29 12:18 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

Hi Jerry,


On Aug 29, 2014, at 13:33 , Jerry Jongerius <jerryj@duckware.com> wrote:

>> Okay that is interesting, Could I convince you to try to enable SACK
>> on the server and test whether you still see the catastrophic results?
>> And/or try another tcp variant instead of westwood+, like the default
> cubic.
> 
> Would love to, but can not.  I have read only access to settings on that
> server.

	Ah, too bad, it would have been nice to be able to pinpoint this closer (is this effect a quirk/bug in westwood+ or caused by the “archaic” lack of SACKs). But this list contains vast knowledge about networking, so I hope that someone has an idea how to get closer to the root-cause even without root access on the server. Oh, maybe you can ask the hosting company/ owner of the server to switch the tcp for you?

Best Regards
	Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 18:15           ` Jonathan Morton
@ 2014-08-29 14:21             ` Jerry Jongerius
  2014-08-29 16:31               ` Jonathan Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-29 14:21 UTC (permalink / raw)
  To: 'Jonathan Morton'; +Cc: bloat

[-- Attachment #1.1: Type: text/plain, Size: 895 bytes --]

A ‘boost’ has never been seen.  Bandwidth graphs where there is no packet loss look like:

From: Jonathan Morton [mailto:chromatix99@gmail.com] 
Sent: Thursday, August 28, 2014 2:15 PM
To: Jerry Jongerius
Cc: bloat
Subject: RE: [Bloat] The Dark Problem with AQM in the Internet?

If it is genuinely a single packet, then I have an alternate theory.

I note from http://www.dslreports.com/faq/14520 that PowerBoost works on the first 20MB of a download.  At 100Mbps or so, that's about 2 seconds.  So that's quite convincing evidence that your packet loss is happening at the moment PowerBoost switches off.

It might be that the switching process takes long enough to drop one packet. Or it might be that Comcast deliberately drop one packet in order to signal the change in bandwidth to the sender. Clever, if mildly distasteful.

- Jonathan Morton

[-- Attachment #1.2: Type: text/html, Size: 4279 bytes --]

[-- Attachment #2: image001.jpg --]
[-- Type: application/octet-stream, Size: 13900 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29  1:59           ` David Lang
@ 2014-08-29 14:37             ` Jerry Jongerius
  2014-08-30  6:05               ` Jonathan Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-29 14:37 UTC (permalink / raw)
  To: 'David Lang'; +Cc: bloat

> did you check to see if packets were re-sent even if they weren't lost? on
of
> the side effects of excessive buffering is that it's possible for a packet
to
> be held in the buffer long enough that the sender thinks that it's been
> lost and retransmits it, so the packet is effectivly 'lost' even if it
actually
> arrives at it's destination.

Yes.  A duplicate packet for the missing packet is not seen.

The receiver 'misses' a packet; starts sending out tons of dup acks (for all
packets in flight and queued up due to bufferbloat), and then way later, the
packet does come in (after the RTT caused by bufferbloat; indicating it is
the 'resent' packet).  



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29 11:33             ` Jerry Jongerius
  2014-08-29 12:18               ` Sebastian Moeller
@ 2014-08-29 14:42               ` Dave Taht
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Taht @ 2014-08-29 14:42 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On Fri, Aug 29, 2014 at 4:33 AM, Jerry Jongerius <jerryj@duckware.com> wrote:
>> Okay that is interesting, Could I convince you to try to enable SACK
>> on the server and test whether you still see the catastrophic results?
>> And/or try another tcp variant instead of westwood+, like the default
> cubic.
>
> Would love to, but can not.  I have read only access to settings on that
> server.

I have servers setup across the planet at the planet (linode and
google cloud services) at this point that I can setup with any
tcp/sack/timestamp combination desired. Give me a location you want to
repeat your tests.

(west coast us, colorado, east coast, japan, england are the ones I
can easily muck with)

> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29 14:21             ` Jerry Jongerius
@ 2014-08-29 16:31               ` Jonathan Morton
  2014-08-29 16:54                 ` Jerry Jongerius
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan Morton @ 2014-08-29 16:31 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 614 bytes --]

> A ‘boost’ has never been seen.  Bandwidth graphs where there is no packet
loss look like:

That's very odd, if true. Westwood+ should still be increasing the
congestion window additively after recovering, so even if it got the
bandwidth or latency estimates wrong, it should still recover full
performance. Not necessarily very quickly, but it should still be visible
on a timescale of several seconds.

More likely is that you're conflating cause and effect. The packet is only
lost when the boost ends, so if for some reason the boost never ends, the
packet is never lost.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 684 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29 16:31               ` Jonathan Morton
@ 2014-08-29 16:54                 ` Jerry Jongerius
  0 siblings, 0 replies; 38+ messages in thread
From: Jerry Jongerius @ 2014-08-29 16:54 UTC (permalink / raw)
  To: 'Jonathan Morton'; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

The additive increase is there in the raw data.

From: Jonathan Morton [mailto:chromatix99@gmail.com] 
Sent: Friday, August 29, 2014 12:31 PM
To: Jerry Jongerius
Cc: bloat
Subject: RE: [Bloat] The Dark Problem with AQM in the Internet?

> A ‘boost’ has never been seen.  Bandwidth graphs where there is no packet loss look like:

That's very odd, if true. Westwood+ should still be increasing the congestion window additively after recovering, so even if it got the bandwidth or latency estimates wrong, it should still recover full performance. Not necessarily very quickly, but it should still be visible on a timescale of several seconds.

More likely is that you're conflating cause and effect. The packet is only lost when the boost ends, so if for some reason the boost never ends, the packet is never lost.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 3172 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-29 14:37             ` Jerry Jongerius
@ 2014-08-30  6:05               ` Jonathan Morton
  2014-08-30  6:28                 ` Stephen Hemminger
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan Morton @ 2014-08-30  6:05 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:

>> did you check to see if packets were re-sent even if they weren't lost? on of
>> the side effects of excessive buffering is that it's possible for a packet to
>> be held in the buffer long enough that the sender thinks that it's been
>> lost and retransmits it, so the packet is effectivly 'lost' even if it actually
>> arrives at it's destination.
> 
> Yes.  A duplicate packet for the missing packet is not seen.
> 
> The receiver 'misses' a packet; starts sending out tons of dup acks (for all
> packets in flight and queued up due to bufferbloat), and then way later, the
> packet does come in (after the RTT caused by bufferbloat; indicating it is
> the 'resent' packet).  

I think I've cracked this one - the cause, if not the solution.

Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this.  That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period.

There *is* a sequence of events that can lead to that happening:

1) Packet is lost, at the tail end of the bottleneck queue.

2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes.

3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received.  Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them.  The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).

4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB).  This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering.

5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment.

6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized.  Additive increase would eventually correct that - but the key word is *eventually*.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-30  6:05               ` Jonathan Morton
@ 2014-08-30  6:28                 ` Stephen Hemminger
  2014-08-30  6:45                   ` Jonathan Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2014-08-30  6:28 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: bloat

On Sat, 30 Aug 2014 09:05:58 +0300
Jonathan Morton <chromatix99@gmail.com> wrote:

> 
> On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:
> 
> >> did you check to see if packets were re-sent even if they weren't lost? on of
> >> the side effects of excessive buffering is that it's possible for a packet to
> >> be held in the buffer long enough that the sender thinks that it's been
> >> lost and retransmits it, so the packet is effectivly 'lost' even if it actually
> >> arrives at it's destination.
> > 
> > Yes.  A duplicate packet for the missing packet is not seen.
> > 
> > The receiver 'misses' a packet; starts sending out tons of dup acks (for all
> > packets in flight and queued up due to bufferbloat), and then way later, the
> > packet does come in (after the RTT caused by bufferbloat; indicating it is
> > the 'resent' packet).  
> 
> I think I've cracked this one - the cause, if not the solution.
> 
> Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this.  That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period.
> 
> There *is* a sequence of events that can lead to that happening:
> 
> 1) Packet is lost, at the tail end of the bottleneck queue.
> 
> 2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes.
> 
> 3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received.  Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them.  The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).
> 
> 4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB).  This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering.
> 
> 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment.
> 
> 6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized.  Additive increase would eventually correct that - but the key word is *eventually*.
> 
>  - Jonathan Morton

Bandwidth estimates by ack RTT is fraught with problems. The returning ACK can be
delayed for any number of reasons such as other traffic or aggregation. This kind
of delay based congestion control suffers badly from any latency induced in the network.
So instead of causing bloat, it gets hit by bloat.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-30  6:28                 ` Stephen Hemminger
@ 2014-08-30  6:45                   ` Jonathan Morton
  2014-09-01 17:30                     ` Jerry Jongerius
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan Morton @ 2014-08-30  6:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: bloat


On 30 Aug, 2014, at 9:28 am, Stephen Hemminger wrote:

> On Sat, 30 Aug 2014 09:05:58 +0300
> Jonathan Morton <chromatix99@gmail.com> wrote:
> 
>> 
>> On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:
>> 
>>>> did you check to see if packets were re-sent even if they weren't lost? on of
>>>> the side effects of excessive buffering is that it's possible for a packet to
>>>> be held in the buffer long enough that the sender thinks that it's been
>>>> lost and retransmits it, so the packet is effectivly 'lost' even if it actually
>>>> arrives at it's destination.
>>> 
>>> Yes.  A duplicate packet for the missing packet is not seen.
>>> 
>>> The receiver 'misses' a packet; starts sending out tons of dup acks (for all
>>> packets in flight and queued up due to bufferbloat), and then way later, the
>>> packet does come in (after the RTT caused by bufferbloat; indicating it is
>>> the 'resent' packet).  
>> 
>> I think I've cracked this one - the cause, if not the solution.
>> 
>> Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this.  That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period.
>> 
>> There *is* a sequence of events that can lead to that happening:
>> 
>> 1) Packet is lost, at the tail end of the bottleneck queue.
>> 
>> 2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes.
>> 
>> 3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received.  Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them.  The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).
>> 
>> 4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB).  This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering.
>> 
>> 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment.
>> 
>> 6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized.  Additive increase would eventually correct that - but the key word is *eventually*.
>> 
>> - Jonathan Morton
> 
> Bandwidth estimates by ack RTT is fraught with problems. The returning ACK can be
> delayed for any number of reasons such as other traffic or aggregation. This kind
> of delay based congestion control suffers badly from any latency induced in the network.
> So instead of causing bloat, it gets hit by bloat.

In this case, the TCP is actually tracking RTT surprisingly well, but the bandwidth estimate goes wrong because the duplicate ACKs go missing.  Note that if the MinRTT was estimated too high (which is the only direction it could go), this would result in the slow-start threshold being *higher* than required, and the symptoms observed would not occur, since the cwnd would grow to the required value after recovery.

This is the opposite effect from what happens to TCP Vegas in a bloated environment.  Vegas stops increasing cwnd when the estimated RTT is noticeably higher than MinRTT, but if the true MinRTT changes (or it has to compete with a non-Vegas TCP flow), it has trouble tracking that fact.

There is another possibility:  that the assumption of non-queue RTT being constant against varying bandwidth is incorrect.  If that is the case, then the observed behaviour can be explained without recourse to lost duplicate ACKs - so Westwood+ is correctly tracking both MinRTT and BWE - but (MinRTT * BWE) turns out to be a poor estimate of the true BDP.  I think this still fails to explain why the cwnd is reset (which should occur only on RTO), but everything else potentially fits.

I think we can distinguish the two theories by running tests against a server that supports SACK and timestamps, and where ideally we can capture packet traces at both ends.

 - Jonathan Morton


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-28 16:20         ` Jerry Jongerius
  2014-08-28 16:35           ` Fred Baker (fred)
  2014-08-28 16:36           ` Greg White
@ 2014-09-01 11:47           ` Richard Scheffenegger
  2 siblings, 0 replies; 38+ messages in thread
From: Richard Scheffenegger @ 2014-09-01 11:47 UTC (permalink / raw)
  To: Jerry Jongerius, 'Rich Brown'; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 3592 bytes --]

Hi Jerry,

isn't this the problem statement of Conex?

Again, you at the end host would gain little insight with Conex, but every intermediate network operator can observe the red/black marked packets, compare the ratios and know to what extent (by looking at ingress vs egress into his network ) he is contributing...

Best regards,
  Richard

  ----- Original Message ----- 
  From: Jerry Jongerius 
  To: 'Rich Brown' 
  Cc: bloat@lists.bufferbloat.net 
  Sent: Thursday, August 28, 2014 6:20 PM
  Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

  It add accountability.  Everyone in the path right now denies that they could possibly be the one dropping the packet.

  If I want (or need!) to address the problem, I can't now.  I would have to make a change and just hope that it fixed the problem.

  With accountability, I can address the problem.  I then have a choice.  If the problem is the ISP, I can switch ISP's.  If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers.

  - Jerry

  From: Rich Brown [mailto:richb.hanover@gmail.com] 
  Sent: Thursday, August 28, 2014 10:39 AM
  To: Jerry Jongerius
  Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net
  Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

  Hi Jerry,

    AQM is a great solution for bufferbloat.  End of story.  But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that?  Or how do youpropose to do that?

  Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information?

  The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. 

  If the drops happen in your local gateway/home router, then it's interesting to you as the "operator" of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions.

  The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. 

  Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping...

  Best regards,

  Rich

------------------------------------------------------------------------------

  _______________________________________________
  Bloat mailing list
  Bloat@lists.bufferbloat.net
  https://lists.bufferbloat.net/listinfo/bloat

[-- Attachment #2: Type: text/html, Size: 9814 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-08-30  6:45                   ` Jonathan Morton
@ 2014-09-01 17:30                     ` Jerry Jongerius
  2014-09-01 17:40                       ` Dave Taht
  0 siblings, 1 reply; 38+ messages in thread
From: Jerry Jongerius @ 2014-09-01 17:30 UTC (permalink / raw)
  To: 'Jonathan Morton', 'Stephen Hemminger'; +Cc: bloat

Westwood+, as described in published researched papers, does not fully
explain the graph that was seen.  However, Westwood+, as implemented in
Linux, DOES fully explain the graph that was seen.  One place to review the
source code is here:

http://lxr.free-electrons.com/source/net/ipv4/tcp_westwood.c?v=3.2

Some observations about this code:

1. The bandwidth estimate is run through a “(7×prev+new)/8” filter TWICE
[see lines 93-94].
2. The units of time for all objects in the code (rtt, bwe, delta, etc) is
‘jiffies’, not milliseconds, nor microseconds [see line 108].
3. The bandwidth estimate is updated every “rtt” with the test in the code
(line 139) essentially: delta>rtt.  However, “rtt” is the last unsmoothed
rtt seen on the link (and increasing during bufferbloat).  When rtt
increases, the frequency of bandwidth updates drops.
4. The server is Linux 3.2 with HZ=100 (meaning jiffies increases every
10ms).

When you graph some of the raw data observed (see
http://www.duckware.com/blog/the-dark-problem-with-aqm-in-the-internet/image
s/chart.gif), the Westwood+ bandwidth estimate takes significant time to
ramp up.

For the first 0.84 seconds of the download, we expect the Westwood+ code to
update the bandwidth estimate around 14 times, or once every 60ms or so. 
However, after this, we know there is a bufferbloat episode, with RTT times
increasing (decreasing the frequency of bandwidth updates).  The red line in
the graph above suggests that Westwood might have only updated the bandwidth
estimate around 9-10 more times, before using it to set cwnd/ssthresh.

- Jerry

-----Original Message-----
From: Jonathan Morton [mailto:chromatix99@gmail.com] 
Sent: Saturday, August 30, 2014 2:46 AM
To: Stephen Hemminger
Cc: Jerry Jongerius; bloat@lists.bufferbloat.net
Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

On 30 Aug, 2014, at 9:28 am, Stephen Hemminger wrote:

> On Sat, 30 Aug 2014 09:05:58 +0300
> Jonathan Morton <chromatix99@gmail.com> wrote:
> 
>> 
>> On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:
>> 
>>>> did you check to see if packets were re-sent even if they weren't 
>>>> lost? on of the side effects of excessive buffering is that it's 
>>>> possible for a packet to be held in the buffer long enough that the 
>>>> sender thinks that it's been lost and retransmits it, so the packet 
>>>> is effectivly 'lost' even if it actually arrives at it's destination.
>>> 
>>> Yes.  A duplicate packet for the missing packet is not seen.
>>> 
>>> The receiver 'misses' a packet; starts sending out tons of dup acks 
>>> (for all packets in flight and queued up due to bufferbloat), and 
>>> then way later, the packet does come in (after the RTT caused by 
>>> bufferbloat; indicating it is the 'resent' packet).
>> 
>> I think I've cracked this one - the cause, if not the solution.
>> 
>> Let's assume, for the moment, that Jerry is correct and PowerBoost plays
no part in this.  That implies that the flow is not using the full bandwidth
after the loss, *and* that the additive increase of cwnd isn't sufficient to
recover to that point within the test period.
>> 
>> There *is* a sequence of events that can lead to that happening:
>> 
>> 1) Packet is lost, at the tail end of the bottleneck queue.
>> 
>> 2) Eventually, receiver sees the loss and starts sending duplicate acks
(each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running
Westwood+) assumes that each of these represents a received, full-size
packet, for bandwidth estimation purposes.
>> 
>> 3) The receiver doesn't send, or the sender doesn't receive, a duplicate
ack for every packet actually received.  Maybe some firewall sees a large
number of identical packets arriving - without SACK or timestamps, they
*would* be identical - and filters some of them.  The bandwidth estimate
therefore becomes significantly lower than the true value, and additionally
the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).
>> 
>> 4) The retransmitted packet finally reaches the receiver, and the ack it
sends includes all the data received in the meantime (about 3.5MB).  This is
not sufficient to immediately reset the bandwidth estimate to the true
value, because the BWE is sampled at RTT intervals, and also includes
low-pass filtering.
>> 
>> 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender
resets the slow-start threshold to correspond to the estimated
delay-bandwidth product (MinRTT * BWE) at that moment.
>> 
>> 6) This estimated DBP is lower than the true value, so the subsequent
slow-start phase ends with the cwnd inadequately sized.  Additive increase
would eventually correct that - but the key word is *eventually*.
>> 
>> - Jonathan Morton
> 
> Bandwidth estimates by ack RTT is fraught with problems. The returning 
> ACK can be delayed for any number of reasons such as other traffic or 
> aggregation. This kind of delay based congestion control suffers badly
from any latency induced in the network.
> So instead of causing bloat, it gets hit by bloat.

In this case, the TCP is actually tracking RTT surprisingly well, but the
bandwidth estimate goes wrong because the duplicate ACKs go missing.  Note
that if the MinRTT was estimated too high (which is the only direction it
could go), this would result in the slow-start threshold being *higher* than
required, and the symptoms observed would not occur, since the cwnd would
grow to the required value after recovery.

This is the opposite effect from what happens to TCP Vegas in a bloated
environment.  Vegas stops increasing cwnd when the estimated RTT is
noticeably higher than MinRTT, but if the true MinRTT changes (or it has to
compete with a non-Vegas TCP flow), it has trouble tracking that fact.

There is another possibility:  that the assumption of non-queue RTT being
constant against varying bandwidth is incorrect.  If that is the case, then
the observed behaviour can be explained without recourse to lost duplicate
ACKs - so Westwood+ is correctly tracking both MinRTT and BWE - but (MinRTT
* BWE) turns out to be a poor estimate of the true BDP.  I think this still
fails to explain why the cwnd is reset (which should occur only on RTO), but
everything else potentially fits.

I think we can distinguish the two theories by running tests against a
server that supports SACK and timestamps, and where ideally we can capture
packet traces at both ends.

- Jonathan Morton

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bloat] The Dark Problem with AQM in the Internet?
  2014-09-01 17:30                     ` Jerry Jongerius
@ 2014-09-01 17:40                       ` Dave Taht
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Taht @ 2014-09-01 17:40 UTC (permalink / raw)
  To: Jerry Jongerius; +Cc: bloat

On Mon, Sep 1, 2014 at 10:30 AM, Jerry Jongerius <jerryj@duckware.com> wrote:
> Westwood+, as described in published researched papers, does not fully
> explain the graph that was seen.  However, Westwood+, as implemented in
> Linux, DOES fully explain the graph that was seen.  One place to review the
> source code is here:
>
> http://lxr.free-electrons.com/source/net/ipv4/tcp_westwood.c?v=3.2
>
> Some observations about this code:
>
> 1. The bandwidth estimate is run through a “(7×prev+new)/8” filter TWICE
> [see lines 93-94].
> 2. The units of time for all objects in the code (rtt, bwe, delta, etc) is
> ‘jiffies’, not milliseconds, nor microseconds [see line 108].
> 3. The bandwidth estimate is updated every “rtt” with the test in the code
> (line 139) essentially: delta>rtt.  However, “rtt” is the last unsmoothed
> rtt seen on the link (and increasing during bufferbloat).  When rtt
> increases, the frequency of bandwidth updates drops.
> 4. The server is Linux 3.2 with HZ=100 (meaning jiffies increases every
> 10ms).

Oy, this also means that there is no BQL on this server, and thus it's
TX ring can get quite filled. So I'd like to see what a BQL enabled server
does to westwood+ now.

https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel

I imagine that tcp offloads are enabled, also? So much work went into fixing
things like tcp timestamps, etc, in the face of TSO, after 3.2.



>
> When you graph some of the raw data observed (see
> http://www.duckware.com/blog/the-dark-problem-with-aqm-in-the-internet/image
> s/chart.gif), the Westwood+ bandwidth estimate takes significant time to
> ramp up.

>
> For the first 0.84 seconds of the download, we expect the Westwood+ code to
> update the bandwidth estimate around 14 times, or once every 60ms or so.
> However, after this, we know there is a bufferbloat episode, with RTT times
> increasing (decreasing the frequency of bandwidth updates).  The red line in
> the graph above suggests that Westwood might have only updated the bandwidth
> estimate around 9-10 more times, before using it to set cwnd/ssthresh.
>
> - Jerry
>
>
>
>
> -----Original Message-----
> From: Jonathan Morton [mailto:chromatix99@gmail.com]
> Sent: Saturday, August 30, 2014 2:46 AM
> To: Stephen Hemminger
> Cc: Jerry Jongerius; bloat@lists.bufferbloat.net
> Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?
>
>
> On 30 Aug, 2014, at 9:28 am, Stephen Hemminger wrote:
>
>> On Sat, 30 Aug 2014 09:05:58 +0300
>> Jonathan Morton <chromatix99@gmail.com> wrote:
>>
>>>
>>> On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote:
>>>
>>>>> did you check to see if packets were re-sent even if they weren't
>>>>> lost? on of the side effects of excessive buffering is that it's
>>>>> possible for a packet to be held in the buffer long enough that the
>>>>> sender thinks that it's been lost and retransmits it, so the packet
>>>>> is effectivly 'lost' even if it actually arrives at it's destination.
>>>>
>>>> Yes.  A duplicate packet for the missing packet is not seen.
>>>>
>>>> The receiver 'misses' a packet; starts sending out tons of dup acks
>>>> (for all packets in flight and queued up due to bufferbloat), and
>>>> then way later, the packet does come in (after the RTT caused by
>>>> bufferbloat; indicating it is the 'resent' packet).
>>>
>>> I think I've cracked this one - the cause, if not the solution.
>>>
>>> Let's assume, for the moment, that Jerry is correct and PowerBoost plays
> no part in this.  That implies that the flow is not using the full bandwidth
> after the loss, *and* that the additive increase of cwnd isn't sufficient to
> recover to that point within the test period.
>>>
>>> There *is* a sequence of events that can lead to that happening:
>>>
>>> 1) Packet is lost, at the tail end of the bottleneck queue.
>>>
>>> 2) Eventually, receiver sees the loss and starts sending duplicate acks
> (each triggering CA_EVENT_SLOW_ACK path in the sender).  Sender (running
> Westwood+) assumes that each of these represents a received, full-size
> packet, for bandwidth estimation purposes.
>>>
>>> 3) The receiver doesn't send, or the sender doesn't receive, a duplicate
> ack for every packet actually received.  Maybe some firewall sees a large
> number of identical packets arriving - without SACK or timestamps, they
> *would* be identical - and filters some of them.  The bandwidth estimate
> therefore becomes significantly lower than the true value, and additionally
> the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS).
>>>
>>> 4) The retransmitted packet finally reaches the receiver, and the ack it
> sends includes all the data received in the meantime (about 3.5MB).  This is
> not sufficient to immediately reset the bandwidth estimate to the true
> value, because the BWE is sampled at RTT intervals, and also includes
> low-pass filtering.
>>>
>>> 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender
> resets the slow-start threshold to correspond to the estimated
> delay-bandwidth product (MinRTT * BWE) at that moment.
>>>
>>> 6) This estimated DBP is lower than the true value, so the subsequent
> slow-start phase ends with the cwnd inadequately sized.  Additive increase
> would eventually correct that - but the key word is *eventually*.
>>>
>>> - Jonathan Morton
>>
>> Bandwidth estimates by ack RTT is fraught with problems. The returning
>> ACK can be delayed for any number of reasons such as other traffic or
>> aggregation. This kind of delay based congestion control suffers badly
> from any latency induced in the network.
>> So instead of causing bloat, it gets hit by bloat.
>
> In this case, the TCP is actually tracking RTT surprisingly well, but the
> bandwidth estimate goes wrong because the duplicate ACKs go missing.  Note
> that if the MinRTT was estimated too high (which is the only direction it
> could go), this would result in the slow-start threshold being *higher* than
> required, and the symptoms observed would not occur, since the cwnd would
> grow to the required value after recovery.
>
> This is the opposite effect from what happens to TCP Vegas in a bloated
> environment.  Vegas stops increasing cwnd when the estimated RTT is
> noticeably higher than MinRTT, but if the true MinRTT changes (or it has to
> compete with a non-Vegas TCP flow), it has trouble tracking that fact.
>
> There is another possibility:  that the assumption of non-queue RTT being
> constant against varying bandwidth is incorrect.  If that is the case, then
> the observed behaviour can be explained without recourse to lost duplicate
> ACKs - so Westwood+ is correctly tracking both MinRTT and BWE - but (MinRTT
> * BWE) turns out to be a poor estimate of the true BDP.  I think this still
> fails to explain why the cwnd is reset (which should occur only on RTO), but
> everything else potentially fits.
>
> I think we can distinguish the two theories by running tests against a
> server that supports SACK and timestamps, and where ideally we can capture
> packet traces at both ends.
>
> - Jonathan Morton
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2014-09-01 17:40 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-23 18:16 [Bloat] The Dark Problem with AQM in the Internet? Jerry Jongerius
2014-08-23 19:30 ` Jonathan Morton
2014-08-23 20:01 ` Sebastian Moeller
2014-08-25 17:13   ` Greg White
2014-08-25 18:09     ` Jim Gettys
2014-08-25 19:12       ` Sebastian Moeller
2014-08-25 21:17         ` Bill Ver Steeg (versteb)
2014-08-25 21:20           ` Bill Ver Steeg (versteb)
2014-08-28 13:19     ` Jerry Jongerius
2014-08-28 14:07       ` Jonathan Morton
2014-08-28 17:20         ` Jerry Jongerius
2014-08-28 17:41           ` Dave Taht
2014-08-28 18:15           ` Jonathan Morton
2014-08-29 14:21             ` Jerry Jongerius
2014-08-29 16:31               ` Jonathan Morton
2014-08-29 16:54                 ` Jerry Jongerius
2014-08-28 18:59           ` Sebastian Moeller
2014-08-29 11:33             ` Jerry Jongerius
2014-08-29 12:18               ` Sebastian Moeller
2014-08-29 14:42               ` Dave Taht
2014-08-29  1:59           ` David Lang
2014-08-29 14:37             ` Jerry Jongerius
2014-08-30  6:05               ` Jonathan Morton
2014-08-30  6:28                 ` Stephen Hemminger
2014-08-30  6:45                   ` Jonathan Morton
2014-09-01 17:30                     ` Jerry Jongerius
2014-09-01 17:40                       ` Dave Taht
2014-08-28 14:39       ` Rich Brown
2014-08-28 16:20         ` Jerry Jongerius
2014-08-28 16:35           ` Fred Baker (fred)
2014-08-28 18:00             ` Jan Ceuleers
2014-08-28 18:13               ` Dave Taht
2014-08-29  1:57                 ` David Lang
2014-08-28 18:41               ` Kenyon Ralph
2014-08-28 19:04                 ` Dave Taht
2014-08-28 16:36           ` Greg White
2014-08-28 16:52             ` Bill Ver Steeg (versteb)
2014-09-01 11:47           ` Richard Scheffenegger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox