* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 10:36 [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps? Jim Gettys
@ 2011-03-15 14:40 ` Jim Gettys
2011-03-15 16:47 ` Jonathan Morton
2011-03-15 16:34 ` Jonathan Morton
` (3 subsequent siblings)
4 siblings, 1 reply; 70+ messages in thread
From: Jim Gettys @ 2011-03-15 14:40 UTC (permalink / raw)
To: bloat
[-- Attachment #1: Type: text/plain, Size: 3785 bytes --]
I've just been re-reading Van's "A Rant on Queues", found at:
http://pollere.net/Pdfdocs/QrantJul06.pdf and am struck by a few
observations in particular there.
Slide 14 is useful to focus the mind on the problem we face with home
routers (or many of these individual boxes in the network): " Three minor
(and completely standard) variations in protocol implementation give three
wildly different average queue lengths. I.e., the average queue
length contains *no* information about demand or load."
This sends me thinking in interesting directions... First, to stop focusing
on the current length of the queues as having any useful information. It
doesn't. Any buffer (or delay) in the system will translate to window
opening, and apps attempting to fill them. We gotta keep the queue length
down.
Slide 32 states:
Suggestions
• Queue length is meaningless (but long term min can be useful).
• Try have at least a bandwidth*delay of buffer.
• Don’t let it stay full.
• ....
Ok, I am struck by the first suggestion: we can in fact monitor the long
term minimum time, whether by using TCP timestamps, or other hackish
measures.
Whenever packets are delayed by significantly more than this minimum time,
we know we are congested, and should be marking. We really want to cause
the buffer to empty.
There is an interesting question about what "long term minimum" means
here...
- Jim
On Tue, Mar 15, 2011 at 6:36 AM, Jim Gettys <jg@freedesktop.org> wrote:
> I've been watching all the discussion of different TCP flavours with a
> certain amount of disquiet; this is not because I think working on
> improvements to TCP are bad; in fact, it is clear for wireless we could do
> with improvements in algorithms. I'm not trying to discourage work on this
> topic.
>
> My disquiet is otherwise; it is:
> 0) the buffers can be filled by any traffic, not necessarily your own
> (in fact, often that of others), so improving your behaviour, while
> admirable, doesn't mean you or others sharing any piece of your won't
> suffer.
> 1) the bloated buffers are already all over, and updating hosts is often
> a very slow process.
> 2) suffering from this bloat is due to the lack of signalling congestion
> to congestion avoiding protocols.
>
> OK, what does this mean? it means not that we should abandon improving
> TCP; but that doing so won't fundamentally eliminate bufferbloat suffering.
> It won't get us to a fundamentally different place, but only to marginally
> better places in terms of bufferbloating.
>
> The fundamental question, therefore, is how we start marking traffic during
> periods when the buffers fill (either by packet drop or by ECN), to provide
> the missing feedback in congestion avoiding protocol's servo system. No
> matter what flavour of protocol involved, they will then back off.
>
> Back last summer, to my surprise, when I asked Van Jacobson about my
> traces, he said all the required proof was already present in my traces,
> since modern Linux (and I presume other) operating systems had time stamps
> in them (the TCP timestamps option).
>
> Here's the off the wall idea. The buffers we observe are often many times
> (orders of magnitude) larger than any rational RTT.
>
> So the question I have is whether there is some technique whereby
> monitoring the timestamps that may already be present in the traffic (and
> knowing what "sane" RTT's are) that we can start marking traffic in time
> prevent the worst effects of bloating buffers?
> - Jim
>
>
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
[-- Attachment #2: Type: text/html, Size: 4741 bytes --]
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 14:40 ` Jim Gettys
@ 2011-03-15 16:47 ` Jonathan Morton
2011-03-15 17:59 ` Don Marti
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 16:47 UTC (permalink / raw)
To: Jim Gettys; +Cc: bloat
On 15 Mar, 2011, at 4:40 pm, Jim Gettys wrote:
> There is an interesting question about what "long term minimum" means here...
VJ does expand on that in "RED in a different light". He means that the relevant measure of queue length is to take the minimum value over some interval of time, say 100ms or 1-2 RTTs, whichever is longer. The average queue length is irrelevant. The nRED algorithm in that paper proposes a method of doing that.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 16:47 ` Jonathan Morton
@ 2011-03-15 17:59 ` Don Marti
2011-03-15 18:14 ` Rick Jones
0 siblings, 1 reply; 70+ messages in thread
From: Don Marti @ 2011-03-15 17:59 UTC (permalink / raw)
To: bloat
begin Jonathan Morton quotation of Tue, Mar 15, 2011 at 06:47:17PM +0200:
> On 15 Mar, 2011, at 4:40 pm, Jim Gettys wrote:
>
> > There is an interesting question about what "long term minimum" means here...
>
> VJ does expand on that in "RED in a different light". He means that the relevant measure of queue length is to take the minimum value over some interval of time, say 100ms or 1-2 RTTs, whichever is longer. The average queue length is irrelevant. The nRED algorithm in that paper proposes a method of doing that.
It seems like a host ought to be able to track the
dwell time of packets in its own buffer(s), and drop
anything that it held onto too long.
Timestamp every packet going into the buffer, and
independently of any QoS work, check if a packet is
"stale" on its way out, and if so, drop it instead of
sending it. Is this in use anywhere? Haven't seen
it in the literature I've read linked to from Jim's
blog and this list.
--
Don Marti
http://zgp.org/~dmarti/
dmarti@zgp.org
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 17:59 ` Don Marti
@ 2011-03-15 18:14 ` Rick Jones
2011-03-15 18:31 ` John W. Linville
0 siblings, 1 reply; 70+ messages in thread
From: Rick Jones @ 2011-03-15 18:14 UTC (permalink / raw)
To: Don Marti; +Cc: bloat
On Tue, 2011-03-15 at 10:59 -0700, Don Marti wrote:
> begin Jonathan Morton quotation of Tue, Mar 15, 2011 at 06:47:17PM +0200:
> > On 15 Mar, 2011, at 4:40 pm, Jim Gettys wrote:
> >
> > > There is an interesting question about what "long term minimum" means here...
> >
> > VJ does expand on that in "RED in a different light". He means that the relevant measure of queue length is to take the minimum value over some interval of time, say 100ms or 1-2 RTTs, whichever is longer. The average queue length is irrelevant. The nRED algorithm in that paper proposes a method of doing that.
>
> It seems like a host ought to be able to track the
> dwell time of packets in its own buffer(s), and drop
> anything that it held onto too long.
>
> Timestamp every packet going into the buffer, and
> independently of any QoS work, check if a packet is
> "stale" on its way out, and if so, drop it instead of
> sending it. Is this in use anywhere? Haven't seen
> it in the literature I've read linked to from Jim's
> blog and this list.
Are there any NICs setup to allow (efficient) removal of packets from
the transmit queue (the one known to the NIC) once they have become
known to the NIC? I'm not a driver writer (I've only complained to them
that their drivers were using too much CPU :), but what little I've seen
suggests that the programming models of most (all?) NICs are such that
they assume the producer index only ever increases (modulo the queue
size)... Or put another way, the host giveth, but only the NIC taketh
away.
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 18:14 ` Rick Jones
@ 2011-03-15 18:31 ` John W. Linville
2011-03-15 19:40 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: John W. Linville @ 2011-03-15 18:31 UTC (permalink / raw)
To: Rick Jones; +Cc: bloat
On Tue, Mar 15, 2011 at 11:14:37AM -0700, Rick Jones wrote:
> On Tue, 2011-03-15 at 10:59 -0700, Don Marti wrote:
> > begin Jonathan Morton quotation of Tue, Mar 15, 2011 at 06:47:17PM +0200:
> > > On 15 Mar, 2011, at 4:40 pm, Jim Gettys wrote:
> > >
> > > > There is an interesting question about what "long term minimum" means here...
> > >
> > > VJ does expand on that in "RED in a different light". He means that the relevant measure of queue length is to take the minimum value over some interval of time, say 100ms or 1-2 RTTs, whichever is longer. The average queue length is irrelevant. The nRED algorithm in that paper proposes a method of doing that.
> >
> > It seems like a host ought to be able to track the
> > dwell time of packets in its own buffer(s), and drop
> > anything that it held onto too long.
> >
> > Timestamp every packet going into the buffer, and
> > independently of any QoS work, check if a packet is
> > "stale" on its way out, and if so, drop it instead of
> > sending it. Is this in use anywhere? Haven't seen
> > it in the literature I've read linked to from Jim's
> > blog and this list.
>
> Are there any NICs setup to allow (efficient) removal of packets from
> the transmit queue (the one known to the NIC) once they have become
> known to the NIC? I'm not a driver writer (I've only complained to them
> that their drivers were using too much CPU :), but what little I've seen
> suggests that the programming models of most (all?) NICs are such that
> they assume the producer index only ever increases (modulo the queue
> size)... Or put another way, the host giveth, but only the NIC taketh
> away.
Right. This is more-or-less what I was driving-at with the various
versions of my eBDP patches. Once you hand the frame to the device,
you have no more control over it (at least for the devices I know
about today). All I can see that you can do is to evaluate how
quickly (i.e. how low latency) the device is moving those frames
it has been given, and then appropriately throttle the amount of
frames you continue to give the device. The question is how to do
that evaluation, which presumably involves some sort of timestamp.
How that evaluation should guide the throttling is also a question.
This is something that I think a lot of the discussion about qdiscs
is missing. Controlling the queue on the host is great, and there
is no point in maintaining long queues on the host. But the queues
on some devices are already huge. If you don't throttle _both_
the _enqueue_ and the _dequeue_, then you could be keeping a nice,
near-empty tx queue on the host and still have a long, bloated queue
building at the device.
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 18:31 ` John W. Linville
@ 2011-03-15 19:40 ` Jonathan Morton
2011-03-15 19:59 ` Rick Jones
2011-03-15 20:51 ` John W. Linville
0 siblings, 2 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 19:40 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat
On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
> If you don't throttle _both_
> the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> near-empty tx queue on the host and still have a long, bloated queue
> building at the device.
Don't devices at least let you query how full their queue is?
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 19:40 ` Jonathan Morton
@ 2011-03-15 19:59 ` Rick Jones
2011-03-15 20:51 ` John W. Linville
1 sibling, 0 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-15 19:59 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Tue, 2011-03-15 at 21:40 +0200, Jonathan Morton wrote:
> On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
>
> > If you don't throttle _both_
> > the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> > near-empty tx queue on the host and still have a long, bloated queue
> > building at the device.
>
> Don't devices at least let you query how full their queue is?
I believe John is referring to the queue(s) in the intermediate devices.
While it may be possible to query the queue lengths there (at one point
the MIBs had entries for that) it is still impractical - for one thing,
apart from the next-hop device, the host knows nothing about queues out
there. Heck, the host may not even know about the next hop device - it
could be transparent.
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 19:40 ` Jonathan Morton
2011-03-15 19:59 ` Rick Jones
@ 2011-03-15 20:51 ` John W. Linville
2011-03-15 21:31 ` Rick Jones
2011-03-15 22:01 ` Jonathan Morton
1 sibling, 2 replies; 70+ messages in thread
From: John W. Linville @ 2011-03-15 20:51 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Tue, Mar 15, 2011 at 09:40:06PM +0200, Jonathan Morton wrote:
>
> On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
>
> > If you don't throttle _both_
> > the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> > near-empty tx queue on the host and still have a long, bloated queue
> > building at the device.
>
> Don't devices at least let you query how full their queue is?
I suppose it depends on what you mean? Presumably drivers know that,
or at least can figure it out. The accuracy of that might depend on
the exact mechanism, how often the tx rings are replinished, etc.
However, I'm not aware of any API that would let something in the
stack (e.g. a qdisc) query the device driver for the current device
queue depth. At least, I don't think Linux has one -- do other
kernels/stacks provide that?
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 20:51 ` John W. Linville
@ 2011-03-15 21:31 ` Rick Jones
2011-03-16 0:32 ` John W. Linville
2011-03-15 22:01 ` Jonathan Morton
1 sibling, 1 reply; 70+ messages in thread
From: Rick Jones @ 2011-03-15 21:31 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat
On Tue, 2011-03-15 at 16:51 -0400, John W. Linville wrote:
> On Tue, Mar 15, 2011 at 09:40:06PM +0200, Jonathan Morton wrote:
> >
> > On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
> >
> > > If you don't throttle _both_
> > > the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> > > near-empty tx queue on the host and still have a long, bloated queue
> > > building at the device.
> >
> > Don't devices at least let you query how full their queue is?
>
> I suppose it depends on what you mean? Presumably drivers know that,
> or at least can figure it out. The accuracy of that might depend on
> the exact mechanism, how often the tx rings are replinished, etc.
>
> However, I'm not aware of any API that would let something in the
> stack (e.g. a qdisc) query the device driver for the current device
> queue depth. At least, I don't think Linux has one -- do other
> kernels/stacks provide that?
HP-UX's lanadmin (and I presume the nwmgr command in 11.31) command will
display the "classic" interface MIB stats, which includes the outbound
queue length. What it does (or should do) for that statistic in the
face of a multi-queue device I've no idea :)
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 21:31 ` Rick Jones
@ 2011-03-16 0:32 ` John W. Linville
2011-03-16 1:02 ` Rick Jones
0 siblings, 1 reply; 70+ messages in thread
From: John W. Linville @ 2011-03-16 0:32 UTC (permalink / raw)
To: Rick Jones; +Cc: bloat
On Tue, Mar 15, 2011 at 02:31:59PM -0700, Rick Jones wrote:
> On Tue, 2011-03-15 at 16:51 -0400, John W. Linville wrote:
> > On Tue, Mar 15, 2011 at 09:40:06PM +0200, Jonathan Morton wrote:
> > >
> > > On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
> > >
> > > > If you don't throttle _both_
> > > > the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> > > > near-empty tx queue on the host and still have a long, bloated queue
> > > > building at the device.
> > >
> > > Don't devices at least let you query how full their queue is?
> >
> > I suppose it depends on what you mean? Presumably drivers know that,
> > or at least can figure it out. The accuracy of that might depend on
> > the exact mechanism, how often the tx rings are replinished, etc.
> >
> > However, I'm not aware of any API that would let something in the
> > stack (e.g. a qdisc) query the device driver for the current device
> > queue depth. At least, I don't think Linux has one -- do other
> > kernels/stacks provide that?
>
> HP-UX's lanadmin (and I presume the nwmgr command in 11.31) command will
> display the "classic" interface MIB stats, which includes the outbound
> queue length. What it does (or should do) for that statistic in the
> face of a multi-queue device I've no idea :)
But that is capacity, right? Not current occupancy? I thought that
was the outcome of an earlier thread?
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 0:32 ` John W. Linville
@ 2011-03-16 1:02 ` Rick Jones
0 siblings, 0 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-16 1:02 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat
On Tue, 2011-03-15 at 20:32 -0400, John W. Linville wrote:
> On Tue, Mar 15, 2011 at 02:31:59PM -0700, Rick Jones wrote:
> > On Tue, 2011-03-15 at 16:51 -0400, John W. Linville wrote:
> > > On Tue, Mar 15, 2011 at 09:40:06PM +0200, Jonathan Morton wrote:
> > > >
> > > > On 15 Mar, 2011, at 8:31 pm, John W. Linville wrote:
> > > >
> > > > > If you don't throttle _both_
> > > > > the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> > > > > near-empty tx queue on the host and still have a long, bloated queue
> > > > > building at the device.
> > > >
> > > > Don't devices at least let you query how full their queue is?
> > >
> > > I suppose it depends on what you mean? Presumably drivers know that,
> > > or at least can figure it out. The accuracy of that might depend on
> > > the exact mechanism, how often the tx rings are replinished, etc.
> > >
> > > However, I'm not aware of any API that would let something in the
> > > stack (e.g. a qdisc) query the device driver for the current device
> > > queue depth. At least, I don't think Linux has one -- do other
> > > kernels/stacks provide that?
> >
> > HP-UX's lanadmin (and I presume the nwmgr command in 11.31) command will
> > display the "classic" interface MIB stats, which includes the outbound
> > queue length. What it does (or should do) for that statistic in the
> > face of a multi-queue device I've no idea :)
>
> But that is capacity, right? Not current occupancy? I thought that
> was the outcome of an earlier thread?
No, HP-UX shows current occupancy on its interfaces. I think it is
Cisco which shows capacity - at least that is my recollection of one of
the other discussions.
rick
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 20:51 ` John W. Linville
2011-03-15 21:31 ` Rick Jones
@ 2011-03-15 22:01 ` Jonathan Morton
2011-03-15 22:19 ` Stephen Hemminger
2011-03-16 0:47 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " John W. Linville
1 sibling, 2 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 22:01 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat
On 15 Mar, 2011, at 10:51 pm, John W. Linville wrote:
>>> If you don't throttle _both_
>>> the _enqueue_ and the _dequeue_, then you could be keeping a nice,
>>> near-empty tx queue on the host and still have a long, bloated queue
>>> building at the device.
>>
>> Don't devices at least let you query how full their queue is?
>
> I suppose it depends on what you mean? Presumably drivers know that,
> or at least can figure it out. The accuracy of that might depend on
> the exact mechanism, how often the tx rings are replinished, etc.
>
> However, I'm not aware of any API that would let something in the
> stack (e.g. a qdisc) query the device driver for the current device
> queue depth. At least, I don't think Linux has one -- do other
> kernels/stacks provide that?
I get the impression that eBDP is supposed to work relatively close to the device driver, rather than in the core network stack. As such it's not a qdisc, but instead manages a parameter used by a well-behaved device driver. (The number of well-behaved device drivers appears to be small at present.)
So there's a queue in the qdisc, and there's a queue in the hardware, and eBDP tries to make the latter smaller when possible, allowing the former (which is potentially much more intelligent) to do more work.
There is a tradeoff with wireless devices: if the buffer is bigger, more packets can be aggregated into a single timeslot and a greater packet loss rate can be hidden by local retransmission, but the latency gets bigger. So bigger buffers are required when the network is running fast, and smaller buffers when it is running slow. Packets which don't fit in the hardware buffer go to the qdisc instead.
Meanwhile the qdisc can re-order packets (eg. SFQ) so that one packet from each of a number of different flows is presented to the device in turn. This tends to increase fairness and smoothness, and makes the delay on interactive traffic much less dependent on the queue length occupied by bulk flows. It can also detect congestion (eg. nRED, SFB) and mark packets to cause TCPs to back off. But the qdisc can only operate effectively, for both of these tasks, if the hardware buffers are as small as possible.
In short:
- Network-stack queues can be large as long as they are smart.
- Hardware buffers can be dumb but should be as small as possible.
Knowing the occupancy of the hardware buffer is useful if the size of the buffer cannot be changed, because it is then possible to simply decline to fill the buffer more than a certain amount. If you can also assume that packets are sent in order of submission, or by some other easy rule, then you can also infer the time that the oldest packet has spent there, and use it to tune the future occupancy limit even if you can't cancel the old packet.
Cancelling old packets is potentially desirable because it allows TCPs and applications to retransmit (which they will do anyway) without fear of exacerbating a wireless congestion collapse. I do appreciate that not all hardware will support this, however, and it should be totally unnecessary for wired links.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:01 ` Jonathan Morton
@ 2011-03-15 22:19 ` Stephen Hemminger
2011-03-15 22:26 ` Jonathan Morton
2011-03-16 0:47 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " John W. Linville
1 sibling, 1 reply; 70+ messages in thread
From: Stephen Hemminger @ 2011-03-15 22:19 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Wed, 16 Mar 2011 00:01:41 +0200
Jonathan Morton <chromatix99@gmail.com> wrote:
>
> On 15 Mar, 2011, at 10:51 pm, John W. Linville wrote:
>
> >>> If you don't throttle _both_
> >>> the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> >>> near-empty tx queue on the host and still have a long, bloated queue
> >>> building at the device.
> >>
> >> Don't devices at least let you query how full their queue is?
> >
> > I suppose it depends on what you mean? Presumably drivers know that,
> > or at least can figure it out. The accuracy of that might depend on
> > the exact mechanism, how often the tx rings are replinished, etc.
> >
> > However, I'm not aware of any API that would let something in the
> > stack (e.g. a qdisc) query the device driver for the current device
> > queue depth. At least, I don't think Linux has one -- do other
> > kernels/stacks provide that?
>
> I get the impression that eBDP is supposed to work relatively close to the device driver, rather than in the core network stack. As such it's not a qdisc, but instead manages a parameter used by a well-behaved device driver. (The number of well-behaved device drivers appears to be small at present.)
>
> So there's a queue in the qdisc, and there's a queue in the hardware, and eBDP tries to make the latter smaller when possible, allowing the former (which is potentially much more intelligent) to do more work.
>
> There is a tradeoff with wireless devices: if the buffer is bigger, more packets can be aggregated into a single timeslot and a greater packet loss rate can be hidden by local retransmission, but the latency gets bigger. So bigger buffers are required when the network is running fast, and smaller buffers when it is running slow. Packets which don't fit in the hardware buffer go to the qdisc instead.
>
> Meanwhile the qdisc can re-order packets (eg. SFQ) so that one packet from each of a number of different flows is presented to the device in turn. This tends to increase fairness and smoothness, and makes the delay on interactive traffic much less dependent on the queue length occupied by bulk flows. It can also detect congestion (eg. nRED, SFB) and mark packets to cause TCPs to back off. But the qdisc can only operate effectively, for both of these tasks, if the hardware buffers are as small as possible.
>
> In short:
>
> - Network-stack queues can be large as long as they are smart.
>
> - Hardware buffers can be dumb but should be as small as possible.
>
> Knowing the occupancy of the hardware buffer is useful if the size of the buffer cannot be changed, because it is then possible to simply decline to fill the buffer more than a certain amount. If you can also assume that packets are sent in order of submission, or by some other easy rule, then you can also infer the time that the oldest packet has spent there, and use it to tune the future occupancy limit even if you can't cancel the old packet.
>
> Cancelling old packets is potentially desirable because it allows TCPs and applications to retransmit (which they will do anyway) without fear of exacerbating a wireless congestion collapse. I do appreciate that not all hardware will support this, however, and it should be totally unnecessary for wired links.
Have you looked at actual hardware interfaces. They usually are designed to
be "fire and go" with little to no checking by CPU. This is intentional because
of the overhead of bus and CPU access. Once packets go into the tx ring there
is no choice but to send or shutdown the device.
--
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:19 ` Stephen Hemminger
@ 2011-03-15 22:26 ` Jonathan Morton
2011-03-15 22:36 ` Rick Jones
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 22:26 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: bloat
On 16 Mar, 2011, at 12:19 am, Stephen Hemminger wrote:
>> Knowing the occupancy of the hardware buffer is useful if the size of the buffer cannot be changed, because it is then possible to simply decline to fill the buffer more than a certain amount. If you can also assume that packets are sent in order of submission, or by some other easy rule, then you can also infer the time that the oldest packet has spent there, and use it to tune the future occupancy limit even if you can't cancel the old packet.
>>
>> Cancelling old packets is potentially desirable because it allows TCPs and applications to retransmit (which they will do anyway) without fear of exacerbating a wireless congestion collapse. I do appreciate that not all hardware will support this, however, and it should be totally unnecessary for wired links.
>
> Have you looked at actual hardware interfaces. They usually are designed to
> be "fire and go" with little to no checking by CPU. This is intentional because
> of the overhead of bus and CPU access. Once packets go into the tx ring there
> is no choice but to send or shutdown the device.
For a wired device that would certainly make sense. For a wireless device some extra flexibility is plausible, even if it doesn't exist in practice.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:26 ` Jonathan Morton
@ 2011-03-15 22:36 ` Rick Jones
2011-03-15 22:40 ` Jonathan Morton
2011-03-15 22:52 ` Eric Dumazet
0 siblings, 2 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-15 22:36 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
On Wed, 2011-03-16 at 00:26 +0200, Jonathan Morton wrote:
> On 16 Mar, 2011, at 12:19 am, Stephen Hemminger wrote:
>
> >> Knowing the occupancy of the hardware buffer is useful if the size of the buffer cannot be changed, because it is then possible to simply decline to fill the buffer more than a certain amount. If you can also assume that packets are sent in order of submission, or by some other easy rule, then you can also infer the time that the oldest packet has spent there, and use it to tune the future occupancy limit even if you can't cancel the old packet.
> >>
> >> Cancelling old packets is potentially desirable because it allows TCPs and applications to retransmit (which they will do anyway) without fear of exacerbating a wireless congestion collapse. I do appreciate that not all hardware will support this, however, and it should be totally unnecessary for wired links.
> >
> > Have you looked at actual hardware interfaces. They usually are designed to
> > be "fire and go" with little to no checking by CPU. This is intentional because
> > of the overhead of bus and CPU access. Once packets go into the tx ring there
> > is no choice but to send or shutdown the device.
>
> For a wired device that would certainly make sense. For a wireless
> device some extra flexibility is plausible, even if it doesn't exist
> in practice.
Back and forth synchronization between driver and device is
doubleplusungood. Being able to remove a packet on the tx queue already
made known to the NIC sounds like it could become a rathole. If you are
lucky, you *might* have a "valid/invalid" bit in a packet descriptor
that the driver could hope to set before the NIC had pulled-in a copy
across the I/O bus.
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:36 ` Rick Jones
@ 2011-03-15 22:40 ` Jonathan Morton
2011-03-15 22:42 ` Stephen Hemminger
2011-03-15 22:52 ` Eric Dumazet
1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 22:40 UTC (permalink / raw)
To: rick.jones2; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 12:36 am, Rick Jones wrote:
> Back and forth synchronization between driver and device is
> doubleplusungood. Being able to remove a packet on the tx queue already
> made known to the NIC sounds like it could become a rathole. If you are
> lucky, you *might* have a "valid/invalid" bit in a packet descriptor
> that the driver could hope to set before the NIC had pulled-in a copy
> across the I/O bus.
Since this would be on the order of a second after submission, this seems unlikely then.
The even better solution would be if the hardware timed-out an old packet by itself after about 1 second. Does this happen already? If not, can it?
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:40 ` Jonathan Morton
@ 2011-03-15 22:42 ` Stephen Hemminger
0 siblings, 0 replies; 70+ messages in thread
From: Stephen Hemminger @ 2011-03-15 22:42 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Wed, 16 Mar 2011 00:40:28 +0200
Jonathan Morton <chromatix99@gmail.com> wrote:
>
> On 16 Mar, 2011, at 12:36 am, Rick Jones wrote:
>
> > Back and forth synchronization between driver and device is
> > doubleplusungood. Being able to remove a packet on the tx queue already
> > made known to the NIC sounds like it could become a rathole. If you are
> > lucky, you *might* have a "valid/invalid" bit in a packet descriptor
> > that the driver could hope to set before the NIC had pulled-in a copy
> > across the I/O bus.
>
> Since this would be on the order of a second after submission, this seems unlikely then.
>
> The even better solution would be if the hardware timed-out an old packet by itself after about 1 second. Does this happen already? If not, can it?
>
> - Jonathan
The real problem was the IEEE 802 design that does retransmit at link level and
therefore ends up being outside of control of software.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:36 ` Rick Jones
2011-03-15 22:40 ` Jonathan Morton
@ 2011-03-15 22:52 ` Eric Dumazet
2011-03-15 23:02 ` Rick Jones
` (2 more replies)
1 sibling, 3 replies; 70+ messages in thread
From: Eric Dumazet @ 2011-03-15 22:52 UTC (permalink / raw)
To: rick.jones2; +Cc: Stephen Hemminger, bloat
Le mardi 15 mars 2011 à 15:36 -0700, Rick Jones a écrit :
> Back and forth synchronization between driver and device is
> doubleplusungood. Being able to remove a packet on the tx queue already
> made known to the NIC sounds like it could become a rathole. If you are
> lucky, you *might* have a "valid/invalid" bit in a packet descriptor
> that the driver could hope to set before the NIC had pulled-in a copy
> across the I/O bus.
There are two different use cases :
1) Wired devices, where we want to push more 10+ Gbps, so we can assume
a posted skb is transmitted immediately. Even a basic qdisc can be a
performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
avoid taking too many interrupts.
2) wireless, were typical bandwidth is small enough we can afford a
qdisc with a trafic shaper, good flow classification, whatever limit on
"maximum waiting time in qdisc queue or drop it" and a very small queue
on hardware ?
In both cases, we dont need to "cancel" a packet post to NIC hardware,
or we need special hardware support (some NICS already provide hardware
TX completion times)
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:52 ` Eric Dumazet
@ 2011-03-15 23:02 ` Rick Jones
2011-03-15 23:12 ` Jonathan Morton
2011-03-15 23:46 ` Dave Täht
2 siblings, 0 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-15 23:02 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Stephen Hemminger, bloat
On Tue, 2011-03-15 at 23:52 +0100, Eric Dumazet wrote:
> Le mardi 15 mars 2011 à 15:36 -0700, Rick Jones a écrit :
> > Back and forth synchronization between driver and device is
> > doubleplusungood. Being able to remove a packet on the tx queue already
> > made known to the NIC sounds like it could become a rathole. If you are
> > lucky, you *might* have a "valid/invalid" bit in a packet descriptor
> > that the driver could hope to set before the NIC had pulled-in a copy
> > across the I/O bus.
>
> There are two different use cases :
>
> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
> a posted skb is transmitted immediately. Even a basic qdisc can be a
> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
> avoid taking too many interrupts.
>
> 2) wireless, were typical bandwidth is small enough we can afford a
> qdisc with a trafic shaper, good flow classification, whatever limit on
> "maximum waiting time in qdisc queue or drop it" and a very small queue
> on hardware ?
So, I've no worries that my home system has plenty of "oomph" for fancy
things when speaking over wireless, but that is a desktop. How much
"oomph" relative to wireless bandwidth exists in hand-helds? Right now
I think of "wireless" as being, in essence, 100BTto1GbE (wild
handwaving) - do the CPUs in handhelds possess that much more "oomph"
than "regular" systems did when 100BT or 1GbE first appeared?
rick jones
>
> In both cases, we dont need to "cancel" a packet post to NIC hardware,
> or we need special hardware support (some NICS already provide hardware
> TX completion times)
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:52 ` Eric Dumazet
2011-03-15 23:02 ` Rick Jones
@ 2011-03-15 23:12 ` Jonathan Morton
2011-03-15 23:25 ` Rick Jones
2011-03-15 23:46 ` Dave Täht
2 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 23:12 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 12:52 am, Eric Dumazet wrote:
>> Back and forth synchronization between driver and device is
>> doubleplusungood. Being able to remove a packet on the tx queue already
>> made known to the NIC sounds like it could become a rathole. If you are
>> lucky, you *might* have a "valid/invalid" bit in a packet descriptor
>> that the driver could hope to set before the NIC had pulled-in a copy
>> across the I/O bus.
>
> There are two different use cases :
>
> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
> a posted skb is transmitted immediately. Even a basic qdisc can be a
> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
> avoid taking too many interrupts.
>
> 2) wireless, were typical bandwidth is small enough we can afford a
> qdisc with a trafic shaper, good flow classification, whatever limit on
> "maximum waiting time in qdisc queue or drop it" and a very small queue
> on hardware ?
>
> In both cases, we dont need to "cancel" a packet post to NIC hardware,
> or we need special hardware support (some NICS already provide hardware
> TX completion times)
Right, it is of course the wireless situation that I'm talking about. And I am being a bit provocative about it.
Ultimately, we need to be able to back off the transmit rate so hard that the transmit buffer is *empty* half of the time, in the relatively unusual cases where congestion collapse in the airspace has already occurred. If every node has a packet to transmit, they will carry on trying to get it on the air, and with the noise level already high and the data rate already low, there is no way to recover the network until some radios actually go silent for a while.
The congestion-collapse problem is, I think, not easy to replicate at home. It commonly occurs at conferences with hundreds of nodes in the room.
> Do the CPUs in handhelds possess that much more "oomph"
> than "regular" systems did when 100BT or 1GbE first appeared?
Typical handhelds now have anything from a 600MHz ARM11 (iPhone 3G) up to dual-core 1GHz+ Cortex-A9s (iPad2, Galaxy Tab). The latter are roughly equivalent to decent netbook/nettop hardware. There is typically 512MB RAM in total.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 23:12 ` Jonathan Morton
@ 2011-03-15 23:25 ` Rick Jones
2011-03-15 23:33 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Rick Jones @ 2011-03-15 23:25 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
> > Do the CPUs in handhelds possess that much more "oomph"
> > than "regular" systems did when 100BT or 1GbE first appeared?
>
> Typical handhelds now have anything from a 600MHz ARM11 (iPhone 3G) up
> to dual-core 1GHz+ Cortex-A9s (iPad2, Galaxy Tab). The latter are
> roughly equivalent to decent netbook/nettop hardware. There is
> typically 512MB RAM in total.
Pity that isn't enough memory to run SPECcpu2006 to take us out of the
mythical megahurts space :) But it might be enough to run
SPECcpu2000...
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 23:25 ` Rick Jones
@ 2011-03-15 23:33 ` Jonathan Morton
0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 23:33 UTC (permalink / raw)
To: rick.jones2; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 1:25 am, Rick Jones wrote:
>>> Do the CPUs in handhelds possess that much more "oomph"
>>> than "regular" systems did when 100BT or 1GbE first appeared?
>>
>> Typical handhelds now have anything from a 600MHz ARM11 (iPhone 3G) up
>> to dual-core 1GHz+ Cortex-A9s (iPad2, Galaxy Tab). The latter are
>> roughly equivalent to decent netbook/nettop hardware. There is
>> typically 512MB RAM in total.
>
> Pity that isn't enough memory to run SPECcpu2006 to take us out of the
> mythical megahurts space :) But it might be enough to run
> SPECcpu2000...
Nevertheless, I run my Internet through a 400MHz PowerBook G3, which can't be much faster than the iPhone 3G overall. If you can consider running AQM on that class of hardware (which I certainly would), or on a consumer-grade router, then there is little problem in putting it on a handheld these days.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:52 ` Eric Dumazet
2011-03-15 23:02 ` Rick Jones
2011-03-15 23:12 ` Jonathan Morton
@ 2011-03-15 23:46 ` Dave Täht
2011-03-16 0:49 ` Jonathan Morton
2011-03-16 22:07 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
2 siblings, 2 replies; 70+ messages in thread
From: Dave Täht @ 2011-03-15 23:46 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Stephen Hemminger, bloat
Eric Dumazet <eric.dumazet@gmail.com> writes:
> Le mardi 15 mars 2011 à 15:36 -0700, Rick Jones a écrit :
>> Back and forth synchronization between driver and device is
>> doubleplusungood. Being able to remove a packet on the tx queue already
>> made known to the NIC sounds like it could become a rathole. If you are
>> lucky, you *might* have a "valid/invalid" bit in a packet descriptor
>> that the driver could hope to set before the NIC had pulled-in a copy
>> across the I/O bus.
>
> There are two different use cases :
>
> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
> a posted skb is transmitted immediately. Even a basic qdisc can be a
> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
> avoid taking too many interrupts.
To talk to this a bit, the huge dynamic range discrepancy between a
10GigE device and what it may be connected to worries me. Some form of
fair queuing should be applied before the data hits the driver.
It would be good to know what 10Gbps hw was capable of pushing more
smarts (such as nRED) further down into the hardware itself, this may
inform future software abstractions and future hardware designs.
>
> 2) wireless, were typical bandwidth is small enough we can afford a
> qdisc with a trafic shaper, good flow classification, whatever limit on
> "maximum waiting time in qdisc queue or drop it" and a very small queue
> on hardware ?
>
> In both cases, we dont need to "cancel" a packet post to NIC hardware,
> or we need special hardware support (some NICS already provide hardware
> TX completion times)
Which NICs? For example, a whole bunch of us (at least 7 so far) have
settled on the wndr3700 hardware as a good base for experimenting with
wireless solutions. Finding out what NICs are smart enough to be managed
this way would be a goodness.
(And ultimately help them compete better in the marketplace)
>
>
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 23:46 ` Dave Täht
@ 2011-03-16 0:49 ` Jonathan Morton
2011-03-16 1:02 ` Dave Täht
2011-03-16 22:07 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-16 0:49 UTC (permalink / raw)
To: Dave Täht; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 1:46 am, Dave Täht wrote:
>> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
>> a posted skb is transmitted immediately. Even a basic qdisc can be a
>> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
>> avoid taking too many interrupts.
>
> To talk to this a bit, the huge dynamic range discrepancy between a
> 10GigE device and what it may be connected to worries me. Some form of
> fair queuing should be applied before the data hits the driver.
You mean plugging a 10GigE card into a 10Base-T hub? :-D
For less ridiculous topologies, the queues would mostly be in other devices.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 0:49 ` Jonathan Morton
@ 2011-03-16 1:02 ` Dave Täht
2011-03-16 1:28 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Dave Täht @ 2011-03-16 1:02 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
Jonathan Morton <chromatix99@gmail.com> writes:
> On 16 Mar, 2011, at 1:46 am, Dave Täht wrote:
>
>>> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
>>> a posted skb is transmitted immediately. Even a basic qdisc can be a
>>> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
>>> avoid taking too many interrupts.
>>
>> To talk to this a bit, the huge dynamic range discrepancy between a
>> 10GigE device and what it may be connected to worries me. Some form of
>> fair queuing should be applied before the data hits the driver.
>
> You mean plugging a 10GigE card into a 10Base-T hub? :-D
More like 10GigE into a 1Gig switch. Or spewing out the entire contents
of a stream to one destination across the internet.
>
> For less ridiculous topologies, the queues would mostly be in other devices.
But you flood them less with fair queuing, which was my point.
Nagle:
http://en.wikipedia.org/wiki/Fair_queuing
> - Jonathan
>
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 1:02 ` Dave Täht
@ 2011-03-16 1:28 ` Jonathan Morton
2011-03-16 1:59 ` Dave Täht
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-16 1:28 UTC (permalink / raw)
To: Dave Täht; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 3:02 am, Dave Täht wrote:
>>>> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
>>>> a posted skb is transmitted immediately. Even a basic qdisc can be a
>>>> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
>>>> avoid taking too many interrupts.
>>>
>>> To talk to this a bit, the huge dynamic range discrepancy between a
>>> 10GigE device and what it may be connected to worries me. Some form of
>>> fair queuing should be applied before the data hits the driver.
>>
>> You mean plugging a 10GigE card into a 10Base-T hub? :-D
>
> More like 10GigE into a 1Gig switch. Or spewing out the entire contents
> of a stream to one destination across the internet.
Then that's no different to what I have in my apartment right now - a GigE switch connected to a 100base-TX switch, then to a 2Mbps DSL uplink, which could then be routed (after bouncing around backhauls for a bit) through a 500Kbps 3G downlink to a computer I've isolated from the LAN.
If the flow is responsive, as with every sane TCP, the queue will end up in front of the slowest link - at the 3G tower. That's where the AQM would need to be. The GigE adapter in my nettop would be largely idle, as a normal function of the TCP congestion window.
If it isn't, the queue will build up at *every* narrowing of the channel and the packet loss will be astronomical. All AQM could do then is to pick any real traffic out of the din.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 1:28 ` Jonathan Morton
@ 2011-03-16 1:59 ` Dave Täht
2011-03-16 2:23 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Dave Täht @ 2011-03-16 1:59 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
Jonathan Morton <chromatix99@gmail.com> writes:
> On 16 Mar, 2011, at 3:02 am, Dave Täht wrote:
>
>>>>> 1) Wired devices, where we want to push more 10+ Gbps, so we can assume
>>>>> a posted skb is transmitted immediately. Even a basic qdisc can be a
>>>>> performance bottleneck. Set TX ring size to 256 or 1024+ buffers to
>>>>> avoid taking too many interrupts.
>>>>
>>>> To talk to this a bit, the huge dynamic range discrepancy between a
>>>> 10GigE device and what it may be connected to worries me. Some form of
>>>> fair queuing should be applied before the data hits the driver.
>>>
>>> You mean plugging a 10GigE card into a 10Base-T hub? :-D
>>
>> More like 10GigE into a 1Gig switch. Or spewing out the entire contents
>> of a stream to one destination across the internet.
>
> Then that's no different to what I have in my apartment right now - a
>GigE switch connected to a 100base-TX switch, then to a 2Mbps DSL
>uplink, which could then be routed (after bouncing around backhauls for
>a bit) through a 500Kbps 3G downlink to a computer I've isolated from
>the LAN.
Having won one battle today (with ecn and pfifo) I'm ill inclined to
start another...
The problem with your analogy is you are starting from the edge in,
rather from the center out. Consider a topology such as
10 Gig E - 10Gig Switch to 1Gig
| | | | | | | | | |
1 2 3 4 5 6 7 8 9 0
And the server connected servicing hundreds of flows. Statistically,
with fair queuing the number of receive buffers required per port will
be close to or equal to 1, where in a primitive FIFO setup, something >
10 are required.
Multiply the number of ports that are effectively strewn across the
eventual path along with the number of streams, and the enormous
disparity between the source data rate and the receive data rate is
lessened.
This is what I mean by talking to fair queuing or any of a zillion
variants thereof.
> If the flow is responsive, as with every sane TCP, the queue will end
But isn't with large buffers that we are dealing with at present with
bufferbloat.
> up in front of the slowest link - at the 3G tower. That's where the
> AQM would need to be. The GigE adapter in my nettop would be largely
> idle, as a normal function of the TCP congestion window.
Yes. But you started from a different place in your analogy than mine.
>
> If it isn't, the queue will build up at *every* narrowing of the
> channel and the packet loss will be astronomical. All AQM could do
> then is to pick any real traffic out of the din.
Never said we had only one problem. :)
>
> - Jonathan
>
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 1:59 ` Dave Täht
@ 2011-03-16 2:23 ` Jonathan Morton
2011-03-16 22:22 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-16 2:23 UTC (permalink / raw)
To: Dave Täht; +Cc: Stephen Hemminger, bloat
On 16 Mar, 2011, at 3:59 am, Dave Täht wrote:
> 10 Gig E - 10Gig Switch to 1Gig
> | | | | | | | | | |
> 1 2 3 4 5 6 7 8 9 0
>
> And the server connected servicing hundreds of flows. Statistically,
> with fair queuing the number of receive buffers required per port will
> be close to or equal to 1, where in a primitive FIFO setup, something >
> 10 are required.
Well, that's a rather different picture than I had before. I'd hazard a guess that most good switches can deal with that, but they are switches, not routers, so latency through them is expected to be even less.
With that said, at 10GE speeds you are approaching a megapacket per second if jumbo frames are not a significant fraction of the traffic. I think something like SFQ can be made to work at those speeds, but simply getting the data through the computer that fast is a fairly tough job. So I agree that if the NIC can do it by itself, so much the better.
On the flip side, at a megapacket per second, a thousand-packet buffer empties in a millisecond. That's less than a disk seek.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-16 2:23 ` Jonathan Morton
@ 2011-03-16 22:22 ` Richard Scheffenegger
2011-03-16 23:38 ` richard
` (2 more replies)
0 siblings, 3 replies; 70+ messages in thread
From: Richard Scheffenegger @ 2011-03-16 22:22 UTC (permalink / raw)
To: Jonathan Morton, "Dave Täht"; +Cc: Stephen Hemminger, bloat
Heretical question: Why must the congestion notification implemented as a
(distributed) function of the network itself, and take the reaction of the
end hosts into consideration. If the signaling would only indicate the local
congestion state, but then move the reaction to that into the end hosts, i
think the design would be much more simple.
If the network would let the (reactive) senders know the extent of the
current congestion, the end hosts can use more smarts and react to it
properly.
However, AQMs are designed with the standard TCP reaction in mind - half the
sending rate at any indication of congestion within one RTT.
(See DCTCP, Conex for additional information).
Furthermore, I learned that a couple of 10G switch vendors are planning to
have up to 4 GB of buffer RAM in their next generation of switches. So we
are not talking about thousands of packets in the buffer, but of millions of
packets (think of up to 400ms buffering if only a single 10G egress port is
being loaded in such a switch). Compared to the base RTT of a 10G network (a
few tens of microseconds, some vendors go even below a microsecond), this is
even more extreme than the home router / DSLAM scenario...
Regards,
Richard
----- Original Message -----
From: "Jonathan Morton" <chromatix99@gmail.com>
With that said, at 10GE speeds you are approaching a megapacket per second
if jumbo frames are not a significant fraction of the traffic. I think
something like SFQ can be made to work at those speeds, but simply getting
the data through the computer that fast is a fairly tough job. So I agree
that if the NIC can do it by itself, so much the better.
On the flip side, at a megapacket per second, a thousand-packet buffer
empties in a millisecond. That's less than a disk seek.
- Jonathan
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-16 22:22 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
@ 2011-03-16 23:38 ` richard
2011-03-16 23:50 ` Rick Jones
2011-03-17 12:05 ` Fred Baker
2011-03-18 18:27 ` [Bloat] Random idea in reaction to all the discussion ofTCPflavours " Richard Scheffenegger
2 siblings, 1 reply; 70+ messages in thread
From: richard @ 2011-03-16 23:38 UTC (permalink / raw)
To: Richard Scheffenegger; +Cc: bloat
On Wed, 2011-03-16 at 23:22 +0100, Richard Scheffenegger wrote:
> Heretical question: Why must the congestion notification implemented as a
> (distributed) function of the network itself, and take the reaction of the
> end hosts into consideration. If the signaling would only indicate the local
> congestion state, but then move the reaction to that into the end hosts, i
> think the design would be much more simple.
>
I don't think it is heretical - I think it is pragmatic.
I see the problem as being 2 parts:
now - with current hardware in place and whatever (broken) AQM and ECN
might be available to be turned on
future - products not out of development yet, software updates, etc.
> If the network would let the (reactive) senders know the extent of the
> current congestion, the end hosts can use more smarts and react to it
> properly.
>
In other words - let the normal TCP "lose a few packets and fix the
window size" mechanism actually do its work and/or turn on ECN and get
the rest of the world to at least not drop it due to mis-configured
stuff, because that technology is in place now.
This is a "now" thing, achieved by publicity, cajoling, shaming,
education, etc. to the network engineers and management as well as the
general public - "Anything is better than nothing" so turn on your QOS
and set your bandrate limiter and turn on ECN and get some sort of AQM
working and oh, by the way, for this situation this, this, this and that
are suggestions"
> However, AQMs are designed with the standard TCP reaction in mind - half the
> sending rate at any indication of congestion within one RTT.
> (See DCTCP, Conex for additional information).
>
>
> Furthermore, I learned that a couple of 10G switch vendors are planning to
> have up to 4 GB of buffer RAM in their next generation of switches. So we
> are not talking about thousands of packets in the buffer, but of millions of
> packets (think of up to 400ms buffering if only a single 10G egress port is
> being loaded in such a switch). Compared to the base RTT of a 10G network (a
> few tens of microseconds, some vendors go even below a microsecond), this is
> even more extreme than the home router / DSLAM scenario...
>
The stuff for the future involves getting to the designers and getting
them to understand the situation they're putting the rest of the net in
while they play their pissing games with marketing (is there any other
reason for 4 GB of buffer in any switch?), etc. That and getting the
basic research done, measurement facilities in place, and long-term
methodologies to really fix the problem "forever".
So no, it isn't heretical :)
richard
> Regards,
> Richard
>
>
> ----- Original Message -----
> From: "Jonathan Morton" <chromatix99@gmail.com>
>
> With that said, at 10GE speeds you are approaching a megapacket per second
> if jumbo frames are not a significant fraction of the traffic. I think
> something like SFQ can be made to work at those speeds, but simply getting
> the data through the computer that fast is a fairly tough job. So I agree
> that if the NIC can do it by itself, so much the better.
>
> On the flip side, at a megapacket per second, a thousand-packet buffer
> empties in a millisecond. That's less than a disk seek.
>
> - Jonathan
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
--
Richard C. Pitt Pacific Data Capture
rcpitt@pacdat.net 604-644-9265
http://digital-rag.com www.pacdat.net
PGP Fingerprint: FCEF 167D 151B 64C4 3333 57F0 4F18 AF98 9F59 DD73
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-16 23:38 ` richard
@ 2011-03-16 23:50 ` Rick Jones
0 siblings, 0 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-16 23:50 UTC (permalink / raw)
To: richard; +Cc: bloat
> > Furthermore, I learned that a couple of 10G switch vendors are planning to
> > have up to 4 GB of buffer RAM in their next generation of switches. So we
> > are not talking about thousands of packets in the buffer, but of millions of
> > packets (think of up to 400ms buffering if only a single 10G egress port is
> > being loaded in such a switch). Compared to the base RTT of a 10G network (a
> > few tens of microseconds, some vendors go even below a microsecond), this is
> > even more extreme than the home router / DSLAM scenario...
> >
>
> The stuff for the future involves getting to the designers and getting
> them to understand the situation they're putting the rest of the net in
> while they play their pissing games with marketing (is there any other
> reason for 4 GB of buffer in any switch?), etc. That and getting the
> basic research done, measurement facilities in place, and long-term
> methodologies to really fix the problem "forever".
4GB shared across N ports no? And 10GBASE-ER can go 40 km. Where might
pause and FCoE wedge into this?
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-16 22:22 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
2011-03-16 23:38 ` richard
@ 2011-03-17 12:05 ` Fred Baker
2011-03-17 12:18 ` Fred Baker
2011-03-18 18:27 ` [Bloat] Random idea in reaction to all the discussion ofTCPflavours " Richard Scheffenegger
2 siblings, 1 reply; 70+ messages in thread
From: Fred Baker @ 2011-03-17 12:05 UTC (permalink / raw)
To: Richard Scheffenegger; +Cc: Stephen Hemminger, bloat
On Mar 16, 2011, at 3:22 PM, Richard Scheffenegger wrote:
> Heretical question: Why must the congestion notification implemented as a (distributed) function of the network itself, and take the reaction of the end hosts into consideration. If the signaling would only indicate the local congestion state, but then move the reaction to that into the end hosts, i think the design would be much more simple.
>
> If the network would let the (reactive) senders know the extent of the current congestion, the end hosts can use more smarts and react to it properly.
That's not heretical at all. You might be interested to look at Dina Katabi's XCP and Nandita Dukipati's RCP. Both work from the assumption that if a smallish information element is added to the network layer header by the transport and updated by routers en route, the end systems can calculate the correct window value and simply set it. Work on these protocols was done at MIT, USC/ISI, and Stanford over the past decade.
The problem is that it might have worked reasonably well in the 1990 Internet (although the "everything runs on IP" model didn't quite work there either), but doesn't reflect today's network. Think about various forms of tunnels (IP/IP, GRE, L2TP, ...), encrypted tunnel-mode VPNs such as the one I use every day to go to work, MPLS LSPs, Ethernet switches and especially Metropolitan Ethernet, and the today's broadband networks. In terms of modeling the network, the "Network of Networks" model is actually a pretty good one: IP tells the network *what* needs to be done, but the network then uses a vast array of underlying technologies to accomplish it. So saying "no problem, let the network update the IP header..." - the places that need to do so often can't see or can't change the IP header.
I'm very much in favor of ECN, which in all of the tests I have done has proven very effective at limiting queues to the knee. I'm also in favor of delay-based TCPs like CalTech FAST and the Hamilton and CAIA models; FAST tunes to having a small amount of data continuously in queue at the bottleneck, and Hamilton/CAIA tunes to a small bottleneck. The problem tends to be that the "TCP Mafia" - poorly named, but a smallish set of people who actually control widely-used TCP implementations - tend to very much believe in the loss-based model, in part because of poor performance from past delay-based implementations like Vegas and in part due to IPR concerns. Also, commercial interests like Google are pushing very hard for fast delivery of content, which is what is behind Linux' recent change to set the initial window segments.
There is also a needed educational effort. My company is unlikely to turn AQM or ECN on by default because we don't have customers asking us to do so, and my company's customers tend to think that loss is an indication of errors in the network.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-17 12:05 ` Fred Baker
@ 2011-03-17 12:18 ` Fred Baker
2011-03-17 17:27 ` Dave Täht
2011-03-18 18:30 ` Richard Scheffenegger
0 siblings, 2 replies; 70+ messages in thread
From: Fred Baker @ 2011-03-17 12:18 UTC (permalink / raw)
To: Richard Scheffenegger; +Cc: Stephen Hemminger, bloat
On Mar 17, 2011, at 5:05 AM, Fred Baker wrote:
> I'm very much in favor of ECN, which in all of the tests I have done has proven very effective at limiting queues to the knee. I'm also in favor of delay-based TCPs like CalTech FAST and the Hamilton and CAIA models; FAST tunes to having a small amount of data continuously in queue at the bottleneck, and Hamilton/CAIA tunes to a small bottleneck. The problem tends to be that the "TCP Mafia" - poorly named, but a smallish set of people who actually control widely-used TCP implementations - tend to very much believe in the loss-based model, in part because of poor performance from past delay-based implementations like Vegas and in part due to IPR concerns. Also, commercial interests like Google are pushing very hard for fast delivery of content, which is what is behind Linux' recent change to set the initial window segments.
I didn't say, and should have said: I'm also in favor of AQM in any form; I prefer marking to dropping, but both are signals to the end system. The issue is that we need the right mark/drop rate, and the algorithms are neither trivial nor (if the fact that after 20+ years Van and Kathy haven't yet published a red-lite paper they're happy with is any indication) well documented in the general case.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-17 12:18 ` Fred Baker
@ 2011-03-17 17:27 ` Dave Täht
2011-03-18 18:30 ` Richard Scheffenegger
1 sibling, 0 replies; 70+ messages in thread
From: Dave Täht @ 2011-03-17 17:27 UTC (permalink / raw)
To: Fred Baker; +Cc: Stephen Hemminger, bloat
Fred Baker <fred@cisco.com> writes:
> On Mar 17, 2011, at 5:05 AM, Fred Baker wrote:
>
>> I'm very much in favor of ECN, which in all of the tests I have done has proven very effective at limiting queues to the knee. I'm also in favor of delay-based TCPs like CalTech FAST and the Hamilton and CAIA models; FAST tunes to having a small amount of data continuously in queue at the bottleneck, and Hamilton/CAIA tunes to a small bottleneck. The problem tends to be that the "TCP Mafia" - poorly named, but a smallish set of people who actually control widely-used TCP implementations - tend to very much believe in the loss-based model, in part because of poor performance from past delay-based implementations like Vegas and in part due to IPR concerns. Also, commercial interests like Google are pushing very hard for fast delivery of content, which is what is behind Linux' recent change to set the initial window segments.
>
> I didn't say, and should have said: I'm also in favor of AQM in any form; I prefer marking to dropping, but both are signals to the end system. The issue is that we need the right mark/drop rate, and the algorithms are neither trivial nor (if the fact that after 20+ years Van and Kathy haven't yet published a red-lite paper they're happy with is any indication) well documented in the general case.
A mea culpa from a former ASIC designer, which discusses the
relationship between propagation delay, burstiness, and the real need
for something like RED, and why ASIC designers didn't make it more of a
priority.
"Our biggest mistake was in making queue management optional, and making it scary."
Really well explained, with good diagrams, too.
http://codingrelic.geekhold.com/2011/03/random-early-mea-culpa.html
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-17 12:18 ` Fred Baker
2011-03-17 17:27 ` Dave Täht
@ 2011-03-18 18:30 ` Richard Scheffenegger
2011-03-18 18:49 ` Fred Baker
1 sibling, 1 reply; 70+ messages in thread
From: Richard Scheffenegger @ 2011-03-18 18:30 UTC (permalink / raw)
To: Fred Baker; +Cc: Stephen Hemminger, bloat
How about trying to push for a default, that the logical egress buffers are
limited to say 90% of the physical capacity, and only ECN-enabled flows may
use the remaining 10% when they get marked...
Someone has to set an incentive for using ECN, unfortunately...
Richard
----- Original Message -----
From: "Fred Baker" <fred@cisco.com>
To: "Richard Scheffenegger" <rscheff@gmx.at>
Cc: "Stephen Hemminger" <shemminger@vyatta.com>; "bloat"
<bloat@lists.bufferbloat.net>
Sent: Thursday, March 17, 2011 1:18 PM
Subject: Re: [Bloat] Random idea in reaction to all the discussion of
TCPflavours - timestamps?
On Mar 17, 2011, at 5:05 AM, Fred Baker wrote:
> I'm very much in favor of ECN, which in all of the tests I have done has
> proven very effective at limiting queues to the knee. I'm also in favor of
> delay-based TCPs like CalTech FAST and the Hamilton and CAIA models; FAST
> tunes to having a small amount of data continuously in queue at the
> bottleneck, and Hamilton/CAIA tunes to a small bottleneck. The problem
> tends to be that the "TCP Mafia" - poorly named, but a smallish set of
> people who actually control widely-used TCP implementations - tend to very
> much believe in the loss-based model, in part because of poor performance
> from past delay-based implementations like Vegas and in part due to IPR
> concerns. Also, commercial interests like Google are pushing very hard for
> fast delivery of content, which is what is behind Linux' recent change to
> set the initial window segments.
I didn't say, and should have said: I'm also in favor of AQM in any form; I
prefer marking to dropping, but both are signals to the end system. The
issue is that we need the right mark/drop rate, and the algorithms are
neither trivial nor (if the fact that after 20+ years Van and Kathy haven't
yet published a red-lite paper they're happy with is any indication) well
documented in the general case.=
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-18 18:30 ` Richard Scheffenegger
@ 2011-03-18 18:49 ` Fred Baker
2011-03-20 11:40 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Fred Baker @ 2011-03-18 18:49 UTC (permalink / raw)
To: Richard Scheffenegger; +Cc: Stephen Hemminger, bloat
On Mar 18, 2011, at 11:30 AM, Richard Scheffenegger wrote:
>
> How about trying to push for a default, that the logical egress buffers are limited to say 90% of the physical capacity, and only ECN-enabled flows may use the remaining 10% when they get marked...
Lots of questions in that; 90% of the buffers in what? In a host, in a router, on a card in a router, in a queue's configured maximum depth, what? One would need some pedagogic support in the form of simulations - why 90% vs 91% vs 10% vs whatever?
> Someone has to set an incentive for using ECN, unfortunately...
Yes. From my perspective, the right approach is probably more like introducing a mark/drop threshold and a drop threshold. Taking the model that every M time units we "do something" like
if queue depth exceeds <toobig>
reduce M
drop something
else if queue depth exceeds <big>
reduce M
select something
if it is ECN-capable,
mark it congestion-experienced
else
drop it
else is below <hysteresis limit>
increase M
the advantage of ECN traffic is that it is less likely to be dropped. That might be a reasonable approach.
> Richard
>
> ----- Original Message ----- From: "Fred Baker" <fred@cisco.com>
> To: "Richard Scheffenegger" <rscheff@gmx.at>
> Cc: "Stephen Hemminger" <shemminger@vyatta.com>; "bloat" <bloat@lists.bufferbloat.net>
> Sent: Thursday, March 17, 2011 1:18 PM
> Subject: Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
>
>
>
> On Mar 17, 2011, at 5:05 AM, Fred Baker wrote:
>
>> I'm very much in favor of ECN, which in all of the tests I have done has proven very effective at limiting queues to the knee. I'm also in favor of delay-based TCPs like CalTech FAST and the Hamilton and CAIA models; FAST tunes to having a small amount of data continuously in queue at the bottleneck, and Hamilton/CAIA tunes to a small bottleneck. The problem tends to be that the "TCP Mafia" - poorly named, but a smallish set of people who actually control widely-used TCP implementations - tend to very much believe in the loss-based model, in part because of poor performance from past delay-based implementations like Vegas and in part due to IPR concerns. Also, commercial interests like Google are pushing very hard for fast delivery of content, which is what is behind Linux' recent change to set the initial window segments.
>
> I didn't say, and should have said: I'm also in favor of AQM in any form; I prefer marking to dropping, but both are signals to the end system. The issue is that we need the right mark/drop rate, and the algorithms are neither trivial nor (if the fact that after 20+ years Van and Kathy haven't yet published a red-lite paper they're happy with is any indication) well documented in the general case.=
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-18 18:49 ` Fred Baker
@ 2011-03-20 11:40 ` Jonathan Morton
2011-03-20 22:18 ` david
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-20 11:40 UTC (permalink / raw)
To: Fred Baker; +Cc: Stephen Hemminger, bloat
On 18 Mar, 2011, at 8:49 pm, Fred Baker wrote:
>> How about trying to push for a default, that the logical egress buffers are limited to say 90% of the physical capacity, and only ECN-enabled flows may use the remaining 10% when they get marked...
>
> Lots of questions in that; 90% of the buffers in what? In a host, in a router, on a card in a router, in a queue's configured maximum depth, what? One would need some pedagogic support in the form of simulations - why 90% vs 91% vs 10% vs whatever?
>
>> Someone has to set an incentive for using ECN, unfortunately...
>
> Yes. From my perspective, the right approach is probably more like introducing a mark/drop threshold and a drop threshold. Taking the model that every M time units we "do something" like
>
> if queue depth exceeds <toobig>
> reduce M
> drop something
> else if queue depth exceeds <big>
> reduce M
> select something
> if it is ECN-capable,
> mark it congestion-experienced
> else
> drop it
> else is below <hysteresis limit>
> increase M
>
> the advantage of ECN traffic is that it is less likely to be dropped. That might be a reasonable approach.
That does actually seem reasonable. What's the betting that HW vendors still say it's too complicated? :-D
I think we can come up with some simple empirical rules for choosing queue sizes. I may be half-remembering something VJ wrote, but here's a starting point:
0) Buffering more than 1 second of data is always unacceptable.
1) Measure (or estimate) the RTT of a full-sized packet over the exit link and back, then add 100ms for typical Internet latency, calling this total T1. If T1 is more than 500ms, clamp it to 500ms. Calculate T2 to be twice T1; this will be at most 1000ms.
2) Measure (or estimate) the throughput BW of the exit link, in bytes per second.
3) Calculate ideal queue length (in bytes) Q1 as T1 * BW, and the maximal queue length Q2 as T2 * BW. These may optionally be rounded to the nearest multiple of a whole packet size, if that is convenient for the hardware.
4) If the link quality is strongly time-varying, eg. mobile wireless, recalculate Q1 and Q2 as above regularly.
5) If the link speed depends on the type of equipment at the other end, the quality of cabling, or other similar factors, use the actual negotiated link speed when calculating BW. When these factors change, recalculate as above.
I would take the "hysteresis limit" to be an empty queue for the above algorithm.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 11:40 ` Jonathan Morton
@ 2011-03-20 22:18 ` david
2011-03-20 22:45 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: david @ 2011-03-20 22:18 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
On Sun, 20 Mar 2011, Jonathan Morton wrote:
> I think we can come up with some simple empirical rules for choosing queue sizes. I may be half-remembering something VJ wrote, but here's a starting point:
>
> 0) Buffering more than 1 second of data is always unacceptable.
what about satellite links? my understanding is that the four round trips
to geosync orbit (request up, down, reply up down) result in approximatly
1 sec round trip.
David Lang
> 1) Measure (or estimate) the RTT of a full-sized packet over the exit link and back, then add 100ms for typical Internet latency, calling this total T1. If T1 is more than 500ms, clamp it to 500ms. Calculate T2 to be twice T1; this will be at most 1000ms.
>
> 2) Measure (or estimate) the throughput BW of the exit link, in bytes per second.
>
> 3) Calculate ideal queue length (in bytes) Q1 as T1 * BW, and the maximal queue length Q2 as T2 * BW. These may optionally be rounded to the nearest multiple of a whole packet size, if that is convenient for the hardware.
>
> 4) If the link quality is strongly time-varying, eg. mobile wireless, recalculate Q1 and Q2 as above regularly.
>
> 5) If the link speed depends on the type of equipment at the other end, the quality of cabling, or other similar factors, use the actual negotiated link speed when calculating BW. When these factors change, recalculate as above.
>
> I would take the "hysteresis limit" to be an empty queue for the above algorithm.
>
> - Jonathan
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:18 ` david
@ 2011-03-20 22:45 ` Jonathan Morton
2011-03-20 22:50 ` Dave Täht
2011-03-21 1:28 ` david
0 siblings, 2 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-20 22:45 UTC (permalink / raw)
To: david; +Cc: Stephen Hemminger, bloat
On 21 Mar, 2011, at 12:18 am, david@lang.hm wrote:
>> 0) Buffering more than 1 second of data is always unacceptable.
>
> what about satellite links? my understanding is that the four round trips to geosync orbit (request up, down, reply up down) result in approximatly 1 sec round trip.
That is true, but it doesn't require more than a full second of buffering, just lots of FEC to avoid packet loss on the link. At those timescales, you want the flow to look smooth, not bursty. Bursty is normal at 100ms timescales.
What I've heard is that most consumer satellite links use split-TCP anyway (proxy boxes at each end) thus relieving the Internet at large from coping with an unusual problem. However, it also seems likely that backbone satellite links exist which do not use this technique. I heard something about South America, maybe?
Anyway, with a 1-second RTT, the formula comes out to max 1 second of buffering because of the clamping.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:45 ` Jonathan Morton
@ 2011-03-20 22:50 ` Dave Täht
2011-03-20 22:55 ` grenville armitage
2011-03-20 22:58 ` Jonathan Morton
2011-03-21 1:28 ` david
1 sibling, 2 replies; 70+ messages in thread
From: Dave Täht @ 2011-03-20 22:50 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
Jonathan Morton <chromatix99@gmail.com> writes:
> On 21 Mar, 2011, at 12:18 am, david@lang.hm wrote:
>
>>> 0) Buffering more than 1 second of data is always unacceptable.
Well, in the case of the DTN, it's required.
We're not testing interplanetary networks here, (rather, an artificially
induced one extending out well beyond the moon!) but it bears a little
thinking about.
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:50 ` Dave Täht
@ 2011-03-20 22:55 ` grenville armitage
2011-03-20 23:04 ` Dave Täht
2011-03-20 22:58 ` Jonathan Morton
1 sibling, 1 reply; 70+ messages in thread
From: grenville armitage @ 2011-03-20 22:55 UTC (permalink / raw)
To: bloat
On 03/21/2011 09:50, Dave Täht wrote:
[..]
> We're not testing interplanetary networks here, (rather, an artificially
> induced one extending out well beyond the moon!) but it bears a little
> thinking about.
Perhaps an idea for presenting bufferbloat visually? Draw a picture of the
space around the Earth, with circles around the earth whose diameters are
proportional to bufferbloat-induced equivalent RTT across different
ISP links, or different consumer hardware, etc. "Bufferbloat puts New York
on the far side of the moon!" might be a tagline to get people's attention ;)
cheers,
gja
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:55 ` grenville armitage
@ 2011-03-20 23:04 ` Dave Täht
2011-03-20 23:14 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Dave Täht @ 2011-03-20 23:04 UTC (permalink / raw)
To: grenville armitage; +Cc: bloat
grenville armitage <garmitage@swin.edu.au> writes:
> On 03/21/2011 09:50, Dave Täht wrote:
> [..]
>
>> We're not testing interplanetary networks here, (rather, an
>> artificially induced one extending out well beyond the moon!) but it
>> bears a little thinking about.
>
> Perhaps an idea for presenting bufferbloat visually? Draw a picture of
> the space around the Earth, with circles around the earth whose
> diameters are proportional to bufferbloat-induced equivalent RTT
> across different ISP links, or different consumer hardware,
> etc. "Bufferbloat puts New York on the far side of the moon!" might be
> a tagline to get people's attention ;)
That is similar to one of the ideas I had while prototyping the cosmic
background bufferbloat detector. (Since stalled out due issues with
mapping ntp data types to postgres and postgis data types)
There are some excellent caida and network geography maps that do something
like this, and it would make the point, thoroughly
See, for example:
http://www.mail-archive.com/bloat@lists.bufferbloat.net/msg00050.html
And the graphic at the middle left down (linked from the above) at:
http://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography//atlas/geographic.html
Several other ideas in the above worth thinking about. I mentally see a
map of the US pulsating like a old winamp graphic eq plugin, with
vertical 3D bars over each location extending into space while the earth
rotates under the terminator....
>
> cheers,
> gja
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 23:04 ` Dave Täht
@ 2011-03-20 23:14 ` Jonathan Morton
2011-03-20 23:19 ` Dave Täht
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-20 23:14 UTC (permalink / raw)
To: Dave Täht; +Cc: bloat
On 21 Mar, 2011, at 1:04 am, Dave Täht wrote:
>>> We're not testing interplanetary networks here, (rather, an
>>> artificially induced one extending out well beyond the moon!) but it
>>> bears a little thinking about.
>>
>> Perhaps an idea for presenting bufferbloat visually? Draw a picture of
>> the space around the Earth, with circles around the earth whose
>> diameters are proportional to bufferbloat-induced equivalent RTT
>> across different ISP links, or different consumer hardware,
>> etc. "Bufferbloat puts New York on the far side of the moon!" might be
>> a tagline to get people's attention ;)
>
> That is similar to one of the ideas I had while prototyping the cosmic
> background bufferbloat detector.
Come with us.
Leave your wheat fields.
Leave your back-street shops, your fishing boats.
Leave your offices in the tall skyscrapers.
Leave all that is routine and commonplace,
And come with us at once.
Come with us,
To where Man has never been,
But where he will go,
As certain as the passing of time.
Come with us to the Moon.
The rocket is waiting...
-- The Orb, "The Passing Of Time", Orblivion
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 23:14 ` Jonathan Morton
@ 2011-03-20 23:19 ` Dave Täht
2011-03-20 23:23 ` Dave Täht
0 siblings, 1 reply; 70+ messages in thread
From: Dave Täht @ 2011-03-20 23:19 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
Jonathan Morton <chromatix99@gmail.com> writes:
> On 21 Mar, 2011, at 1:04 am, Dave Täht wrote:
>
>>>> We're not testing interplanetary networks here, (rather, an
>>>> artificially induced one extending out well beyond the moon!) but it
>>>> bears a little thinking about.
>>>
>>> Perhaps an idea for presenting bufferbloat visually? Draw a picture of
>>> the space around the Earth, with circles around the earth whose
>>> diameters are proportional to bufferbloat-induced equivalent RTT
>>> across different ISP links, or different consumer hardware,
>>> etc. "Bufferbloat puts New York on the far side of the moon!" might be
>>> a tagline to get people's attention ;)
>>
>> That is similar to one of the ideas I had while prototyping the cosmic
>> background bufferbloat detector.
>
> Come with us.
> Leave your wheat fields.
> Leave your back-street shops, your fishing boats.
> Leave your offices in the tall skyscrapers.
> Leave all that is routine and commonplace,
> And come with us at once.
>
> Come with us,
> To where Man has never been,
> But where he will go,
> As certain as the passing of time.
>
> Come with us to the Moon.
> The rocket is waiting...
>
> -- The Orb, "The Passing Of Time", Orblivion
Pass the kool-aid, will ya? I hear owsley left a legacy.
http://articles.latimes.com/2011/mar/15/local/la-me-owsley-stanley-20110315
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 23:19 ` Dave Täht
@ 2011-03-20 23:23 ` Dave Täht
0 siblings, 0 replies; 70+ messages in thread
From: Dave Täht @ 2011-03-20 23:23 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
d@taht.net (Dave Täht) writes:
> Jonathan Morton <chromatix99@gmail.com> writes:
>
>> On 21 Mar, 2011, at 1:04 am, Dave Täht wrote:
>>> That is similar to one of the ideas I had while prototyping the cosmic
>>> background bufferbloat detector.
>>
>> Come with us.
>> Leave your wheat fields.
>> Leave your back-street shops, your fishing boats.
>> Leave your offices in the tall skyscrapers.
>> Leave all that is routine and commonplace,
>> And come with us at once.
>>
>> Come with us,
>> To where Man has never been,
>> But where he will go,
>> As certain as the passing of time.
>>
>> Come with us to the Moon.
>> The rocket is waiting...
>>
>> -- The Orb, "The Passing Of Time", Orblivion
>
> Pass the kool-aid, will ya? I hear owsley left a legacy.
>
> http://articles.latimes.com/2011/mar/15/local/la-me-owsley-stanley-20110315
In so naming the project a cosmic background bufferbloat detector I was
referring to this project:
http://en.wikipedia.org/wiki/Discovery_of_cosmic_microwave_background_radiation
That said, redoing the horn in that in day-glo colors strikes me as a
good logo idea, if I ever get the time to get back on the project...
/me passes back the kool-aid
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:50 ` Dave Täht
2011-03-20 22:55 ` grenville armitage
@ 2011-03-20 22:58 ` Jonathan Morton
1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-20 22:58 UTC (permalink / raw)
To: Dave Täht; +Cc: Stephen Hemminger, bloat
On 21 Mar, 2011, at 12:50 am, Dave Täht wrote:
>>>> 0) Buffering more than 1 second of data is always unacceptable.
>
> Well, in the case of the DTN, it's required.
>
> We're not testing interplanetary networks here, (rather, an artificially
> induced one extending out well beyond the moon!) but it bears a little
> thinking about.
It appears that most practical solutions for a DTN involve explicit store-and-forward techniques, and these would be irrelevant at the IP routing level.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-20 22:45 ` Jonathan Morton
2011-03-20 22:50 ` Dave Täht
@ 2011-03-21 1:28 ` david
2011-03-21 1:56 ` Wesley Eddy
1 sibling, 1 reply; 70+ messages in thread
From: david @ 2011-03-21 1:28 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Stephen Hemminger, bloat
On Mon, 21 Mar 2011, Jonathan Morton wrote:
> On 21 Mar, 2011, at 12:18 am, david@lang.hm wrote:
>
>>> 0) Buffering more than 1 second of data is always unacceptable.
>>
>> what about satellite links? my understanding is that the four round
>> trips to geosync orbit (request up, down, reply up down) result in
>> approximatly 1 sec round trip.
>
> That is true, but it doesn't require more than a full second of
> buffering, just lots of FEC to avoid packet loss on the link. At those
> timescales, you want the flow to look smooth, not bursty. Bursty is
> normal at 100ms timescales.
>
> What I've heard is that most consumer satellite links use split-TCP
> anyway (proxy boxes at each end) thus relieving the Internet at large
> from coping with an unusual problem. However, it also seems likely that
> backbone satellite links exist which do not use this technique. I heard
> something about South America, maybe?
I've heard that they do proxy boxes at each end for common protocols like
HTTP, but they can't do so for other protocols (think ssh for example)
> Anyway, with a 1-second RTT, the formula comes out to max 1 second of
> buffering because of the clamping.
and what if you have a 1 second satellite link plus 'normal internet
latency', or worse, both ends are on a satellite link, giving you a
2-second+ round trip time?
if you don't have large enough buffers to handle this, what happens?
David Lang
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-21 1:28 ` david
@ 2011-03-21 1:56 ` Wesley Eddy
0 siblings, 0 replies; 70+ messages in thread
From: Wesley Eddy @ 2011-03-21 1:56 UTC (permalink / raw)
To: bloat
On 3/20/2011 9:28 PM, david@lang.hm wrote:
> On Mon, 21 Mar 2011, Jonathan Morton wrote:
>
>> On 21 Mar, 2011, at 12:18 am, david@lang.hm wrote:
>>
>>>> 0) Buffering more than 1 second of data is always unacceptable.
>>>
>>> what about satellite links? my understanding is that the four round
>>> trips to geosync orbit (request up, down, reply up down) result in
>>> approximatly 1 sec round trip.
>>
>> That is true, but it doesn't require more than a full second of
>> buffering, just lots of FEC to avoid packet loss on the link. At those
>> timescales, you want the flow to look smooth, not bursty. Bursty is
>> normal at 100ms timescales.
>>
>> What I've heard is that most consumer satellite links use split-TCP
>> anyway (proxy boxes at each end) thus relieving the Internet at large
>> from coping with an unusual problem. However, it also seems likely
>> that backbone satellite links exist which do not use this technique. I
>> heard something about South America, maybe?
>
> I've heard that they do proxy boxes at each end for common protocols
> like HTTP, but they can't do so for other protocols (think ssh for example)
>
>> Anyway, with a 1-second RTT, the formula comes out to max 1 second of
>> buffering because of the clamping.
>
> and what if you have a 1 second satellite link plus 'normal internet
> latency', or worse, both ends are on a satellite link, giving you a
> 2-second+ round trip time?
>
> if you don't have large enough buffers to handle this, what happens?
>
Propagation delay to a geosynchronous relay is ~125 ms. Round-trip
propagation delay contributes ~500 ms, so it isn't quite as bad as you
think, though still long.
Many tricks are played with accelerator gateways to improve bulk
transfer throughput for TCP users (see e.g. RFC 3135).
There may be a challenge in debloating the devices that support such
links, as their buffers and the functions they serve optimize other
metrics (e.g. low packet loss rate, bulk transfer throughput, etc.).
--
Wes Eddy
MTI Systems
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion ofTCPflavours - timestamps?
2011-03-16 22:22 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
2011-03-16 23:38 ` richard
2011-03-17 12:05 ` Fred Baker
@ 2011-03-18 18:27 ` Richard Scheffenegger
2 siblings, 0 replies; 70+ messages in thread
From: Richard Scheffenegger @ 2011-03-18 18:27 UTC (permalink / raw)
To: bloat
Here are some links to papers the members of this group may be interested:
[PDF] Analysis of DCTCP: Stability, Convergence, and FairnessM Alizadeh, A
Javanmard.
ABSTRACT Cloud computing, social networking and information net- works (for
search, news
feeds, etc) are driving interest in the deployment of large data centers.
TCP is the domi- nant
Layer 3 transport protocol in these networks. How- ever, the operating
conditions-very ...
http://www.stanford.edu/~balaji/papers/11analysisof.pdf
[PDF] An Investigation into Data Center Congestion Control with ECNRR
Stewart, M Tüxen. - 2011
Page 1. An Investigation into Data Center Congestion Control with ECN
Randall R. Stewart ?
Michael Tüxen ? George V. Neville-Neil ? February 24, 2011 Abstract Data
centers pose a unique
set of demands on any transport proto- col being used within them. ...
http://www.bsdcan.org/2011/schedule/attachments/151_dc_cc.pdf
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-15 23:46 ` Dave Täht
2011-03-16 0:49 ` Jonathan Morton
@ 2011-03-16 22:07 ` Richard Scheffenegger
2011-03-17 0:10 ` Jonathan Morton
1 sibling, 1 reply; 70+ messages in thread
From: Richard Scheffenegger @ 2011-03-16 22:07 UTC (permalink / raw)
To: Dave "Täht", Eric Dumazet; +Cc: Stephen Hemminger, bloat
----- Original Message -----
> It would be good to know what 10Gbps hw was capable of pushing more
> smarts (such as nRED) further down into the hardware itself, this may
> inform future software abstractions and future hardware designs.
>
IEEE 802.1Qau is becoming available with CNA (converged network adapter),
10G NIC that also have FCoE hardware on-board.
This throttles individual flows - provided the entire end-to-end network
also supports 802.1Qau (Quantizied Congestion Notification), as a decent
enough flow granularity.
Deploying this in core networks will require a forklift upgrade, as current
widely deployed 10G switches don't support QCN.
On another page, cheap, widely deployed Broadcom L3 ASICs (found in the
low-end hardware of major network vendors) support RED in hardware -
however, OEM firmware typically doesn't allow the configuration of the full
feature set. (DCTCP as a TCP-based QCN-like was demonstrated with custom
broadcom firmware doing the ECN marking based, soley based on current queue
depth).
Regards,
Richard
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-16 22:07 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
@ 2011-03-17 0:10 ` Jonathan Morton
0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-17 0:10 UTC (permalink / raw)
To: Richard Scheffenegger; +Cc: Stephen Hemminger, bloat
On 17 Mar, 2011, at 12:07 am, Richard Scheffenegger wrote:
> IEEE 802.1Qau is becoming available with CNA (converged network adapter), 10G NIC that also have FCoE hardware on-board.
>
> This throttles individual flows - provided the entire end-to-end network also supports 802.1Qau (Quantizied Congestion Notification), as a decent enough flow granularity.
>
> Deploying this in core networks will require a forklift upgrade, as current widely deployed 10G switches don't support QCN.
Well that's an engineering failure - unless, of course, non-TCP/IP traffic is predominant in the environments these are put into. In any case, this is not a solution for the Internet.
> On another page, cheap, widely deployed Broadcom L3 ASICs (found in the low-end hardware of major network vendors) support RED in hardware - however, OEM firmware typically doesn't allow the configuration of the full feature set. (DCTCP as a TCP-based QCN-like was demonstrated with custom broadcom firmware doing the ECN marking based, soley based on current queue depth).
Ah, market segmentation, how do I hate thee? Let me count the ways.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 22:01 ` Jonathan Morton
2011-03-15 22:19 ` Stephen Hemminger
@ 2011-03-16 0:47 ` John W. Linville
2011-03-16 20:07 ` Jim Gettys
1 sibling, 1 reply; 70+ messages in thread
From: John W. Linville @ 2011-03-16 0:47 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Wed, Mar 16, 2011 at 12:01:41AM +0200, Jonathan Morton wrote:
>
> On 15 Mar, 2011, at 10:51 pm, John W. Linville wrote:
>
> >>> If you don't throttle _both_
> >>> the _enqueue_ and the _dequeue_, then you could be keeping a nice,
> >>> near-empty tx queue on the host and still have a long, bloated queue
> >>> building at the device.
> >>
> >> Don't devices at least let you query how full their queue is?
> >
> > I suppose it depends on what you mean? Presumably drivers know that,
> > or at least can figure it out. The accuracy of that might depend on
> > the exact mechanism, how often the tx rings are replinished, etc.
> >
> > However, I'm not aware of any API that would let something in the
> > stack (e.g. a qdisc) query the device driver for the current device
> > queue depth. At least, I don't think Linux has one -- do other
> > kernels/stacks provide that?
>
> I get the impression that eBDP is supposed to work relatively
> close to the device driver, rather than in the core network stack.
> As such it's not a qdisc, but instead manages a parameter used by
> a well-behaved device driver. (The number of well-behaved device
> drivers appears to be small at present.)
If you count mac80211 as part of the "driver", what is between the
qdisc and the "driver"? I wouldn't consider the bottom of the qdisc
as the core of the stack.
I would really like to see eBDP (or ALT or A* or something similar)
implemented in a single place if possible, rather than reimplemented
(perhaps poorly) in a series of drivers. I know Felix and others think
that 802.11n aggregation makes that impossible, but I'm inclined to
think that is still at least partly from a bias towards throughput
at the expense of latency -- I could be wrong! :-)
Someone suggested that perhaps eBDP/ALT/A*/whatever could be
implemented as a library for drivers to call -- that still requires
individual driver hacking, but maybe it is reasonable? I'd have to
see the code.
> So there's a queue in the qdisc, and there's a queue in the hardware,
> and eBDP tries to make the latter smaller when possible, allowing the
> former (which is potentially much more intelligent) to do more work.
So that is a possible implementation -- limit the tx queue length
in the driver, similar to manually doing 'ethtool -G'. But on the
other hand, you can achieve a similar effect by throttling the input
to the driver/hardware tx queue no matter how many hardware tx slots
are physically allowed.
<snip>
> Knowing the occupancy of the hardware buffer is useful if the
> size of the buffer cannot be changed, because it is then possible
> to simply decline to fill the buffer more than a certain amount.
> If you can also assume that packets are sent in order of submission,
> or by some other easy rule, then you can also infer the time that
> the oldest packet has spent there, and use it to tune the future
> occupancy limit even if you can't cancel the old packet.
Yes, I think we agree.
> Cancelling old packets is potentially desirable because it allows
> TCPs and applications to retransmit (which they will do anyway)
> without fear of exacerbating a wireless congestion collapse. I do
> appreciate that not all hardware will support this, however, and it
> should be totally unnecessary for wired links.
As other said later in the queue, I think trying to reach down into the
driver/hardware to cancel an already posted packet would be difficult,
slow, etc.
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 0:47 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " John W. Linville
@ 2011-03-16 20:07 ` Jim Gettys
2011-03-17 2:26 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Jim Gettys @ 2011-03-16 20:07 UTC (permalink / raw)
To: John W. Linville, bloat
On Tue, Mar 15, 2011 at 8:47 PM, John W. Linville
<linville@tuxdriver.com> wrote:
>
> If you count mac80211 as part of the "driver", what is between the
> qdisc and the "driver"? I wouldn't consider the bottom of the qdisc
> as the core of the stack.
>
> I would really like to see eBDP (or ALT or A* or something similar)
> implemented in a single place if possible, rather than reimplemented
> (perhaps poorly) in a series of drivers. I know Felix and others think
> that 802.11n aggregation makes that impossible, but I'm inclined to
> think that is still at least partly from a bias towards throughput
> at the expense of latency -- I could be wrong! :-)
I'm told by our cell phone wireless people there are similar concerns
for the wireless cellular technologies; in their case I encouraged
them strongly today to join the mailing list. Let's also distinguish
between "can't do it with today's broken hardware" (of which there is
almost certainly an ample supply), and having a solution that can work
when the hardware is cooperative.
"(2) Less well known to non-cellular folks is the fact that the *full
buffer* packet data cell throughput of EVDO or HSPA is noticeably
higher than that with *bursty* traffic at decent load. Part of it is
due to (1) above but a second part is due to the fact that each
cellular link is made more efficient through multi-user diversity
gain that exploits
fading channel peaks of independent users while still utilizing the
link fully provided there are enough users - my point is, such gains
dilute when a user enjoying a channel peak doesn't have data waiting
in his
buffer at that time... It helps to keep users' buffers non-empty from
this perspective..."
As I pointed out to them, it may (or may not) be that things just work
out well; if the channel is busy, you'll have more time for
aggregation of packets to naturally occur. We don't need to run the
channel efficiently when the channel isn't saturated. Whenever the
air isn't busy, it doesn't matter if we don't bother to aggregate.
Who knows what those driver interfaces look like at this point? Has
anyone tried to grok what's in Android for those drivers (if enough of
the source is available to them to be useful...)
- Jim
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 20:07 ` Jim Gettys
@ 2011-03-17 2:26 ` Jonathan Morton
2011-03-17 18:22 ` Rick Jones
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-17 2:26 UTC (permalink / raw)
To: Jim Gettys; +Cc: bloat
On 16 Mar, 2011, at 10:07 pm, Jim Gettys wrote:
> each cellular link is made more efficient through multi-user diversity gain that exploits fading channel peaks of independent users while still utilizing the link fully provided there are enough users - my point is, such gains dilute when a user enjoying a channel peak doesn't have data waiting in his buffer at that time... It helps to keep users' buffers non-empty from this perspective..."
>
> As I pointed out to them, it may (or may not) be that things just work
> out well; if the channel is busy, you'll have more time for
> aggregation of packets to naturally occur. We don't need to run the
> channel efficiently when the channel isn't saturated. Whenever the
> air isn't busy, it doesn't matter if we don't bother to aggregate.
So, what they're saying is that if I'm on a moving train (with varying relationships between steel cages, 25kV pylons and granite rockfaces - no exaggeration in Helsinki), there are times when my reception drops out, and the tower will use those times to preferentially serve everyone else, returning to serve me when my personal reception conditions improve. I can grok that.
What's not entirely clear is the timescale for this. It's most likely to be seconds or milliseconds, though I'm not sure which. But it could conceivably be sub-millisecond (which would look like random packet loss), and the amount of buffering I've observed for 3G suggests that provision for minutes of reduced connectivity is present - which is far too much.
It is of course worth remembering that a few seconds' worth of buffering at the network's highest theoretical speed rating is most likely several minutes' worth of solid GPRS traffic. If my train goes behind a rock and there's only a 2G signal on the other side of it, that's a serious problem.
There is of course another use-case: people who use "wireless broadband" while sitting still, whether that's in a cafe, a hotel room, or at home. I have been known to use my tethered 3G phone as a primary Internet connection, when for whatever reason a wired link was not available. In that case I could put the phone up on the windowsill, well above street level, and for the most part would expect a clean connection to the tower. Under those circumstances I saw practically no random packet loss, but until I tweaked my setup in some non-standard ways the connection was often very difficult to use because the latency would grow out of control.
It is thus quite understandable why Apple disables updating or downloading particularly large 'apps' over the air.
For the benefit of the 3G folks, here are some helpful axioms to discuss:
1) Buffering more than a couple of seconds of data (without employing AQM) is unhelpful, and will actually increase network load without increasing goodput. Unless there is a compelling reason, you should try to buffer less than a second.
This is because congestion and packet-loss information takes longer to influence existing flows, and new flows are more difficult to start. After about 3 seconds of no information, most TCPs will start retransmission - regardless of whether the packets were physically lost, or are simply languishing in a multi-megabyte buffer somewhere.
2) Buffering less than some threshold of data causes link under-utilisation. The threshold depends on link data rate, the length of pauses due to propagation conditions, and the round-trip delay, and to a lesser extent on the congestion control algorithm employed by the sending host.
2a) On a lightly loaded network link under-utilisation does not matter, as the buffer will often become empty anyway, so you should aim to minimise latency. On a heavily loaded one link utilisation does matter.
3) The number of packets that "a couple of seconds" represents will vary according to the link speed - which in a wireless network is strongly time-varying.
So even if you know the network, you cannot set the buffer length to a fixed value. This is what eBDP is designed to cope with. Because the link bandwidth is also different on a per-user basis, you'll need to vary the buffer size for each user. But see below...
4) AQM can let you use oversized buffers without destroying the user experience.
The specific features required are a) communication of congestion state to the client and server, eg. via packet drop or (preferably) ECN; and b) re-ordering of packets so that one flow does not starve another, eg. DNS packets can overtake HTTP packets, and packets belonging to a light user can bypass those belonging to a heavy user.
RED is the traditional solution to requirement a), but SFB may be a better solution. SFQ is a good way to implement b), and should dovetail nicely with SFB.
Axiom 4 is also particularly helpful to wired broadband providers, who have been known to complain about a relatively small number of heavy users who (allegedly) manage to starve lighter users of bandwidth. Their existing solution seems to be to "charge and discourage" the heavy users. The more intelligent solution is to make the network pay attention to the lighter users, while allowing the heavy users to occupy a fair share of the spare capacity. The intelligent solution makes for better PR.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 2:26 ` Jonathan Morton
@ 2011-03-17 18:22 ` Rick Jones
2011-03-17 21:50 ` Jonathan Morton
0 siblings, 1 reply; 70+ messages in thread
From: Rick Jones @ 2011-03-17 18:22 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Thu, 2011-03-17 at 04:26 +0200, Jonathan Morton wrote:
> For the benefit of the 3G folks, here are some helpful axioms to discuss:
>
> 1) Buffering more than a couple of seconds of data (without employing
> AQM) is unhelpful, and will actually increase network load without
> increasing goodput. Unless there is a compelling reason, you should
> try to buffer less than a second.
>
> This is because congestion and packet-loss information takes longer to
> influence existing flows, and new flows are more difficult to start.
> After about 3 seconds of no information, most TCPs will start
> retransmission - regardless of whether the packets were physically
> lost, or are simply languishing in a multi-megabyte buffer somewhere.
So initialRTO is specced currently to be 3 seconds, with a small but
non-trivial effort under way to reduce that, but once established
connections have a minimum RTO of less than or equal to a second don't
they?
rick jones
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 18:22 ` Rick Jones
@ 2011-03-17 21:50 ` Jonathan Morton
2011-03-17 22:20 ` Rick Jones
0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-17 21:50 UTC (permalink / raw)
To: rick.jones2; +Cc: bloat
On 17 Mar, 2011, at 8:22 pm, Rick Jones wrote:
>> For the benefit of the 3G folks, here are some helpful axioms to discuss:
>>
>> 1) Buffering more than a couple of seconds of data (without employing
>> AQM) is unhelpful, and will actually increase network load without
>> increasing goodput. Unless there is a compelling reason, you should
>> try to buffer less than a second.
>>
>> This is because congestion and packet-loss information takes longer to
>> influence existing flows, and new flows are more difficult to start.
>> After about 3 seconds of no information, most TCPs will start
>> retransmission - regardless of whether the packets were physically
>> lost, or are simply languishing in a multi-megabyte buffer somewhere.
>
> So initialRTO is specced currently to be 3 seconds, with a small but
> non-trivial effort under way to reduce that, but once established
> connections have a minimum RTO of less than or equal to a second don't
> they?
If the RTT they measure is low enough, then yes. If the queues lengthen, the measured RTT goes up and so does the RTO, once the connection is established.
But the *initial* RTO is the important one for unmanaged queue sizing, because that determines whether a new connection can be started without retransmissions, all else functioning correctly of course. There is no way to auto-tune that.
Note also that with AQM that can re-order packets, the length of the bulk queue starts to matter much less, because the SYN/ACK packets can bypass most of the traffic. In that case the RTT measured by the existing bulk flows will be higher than the latency seen by new and interactive flows.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 21:50 ` Jonathan Morton
@ 2011-03-17 22:20 ` Rick Jones
2011-03-17 22:56 ` Jonathan Morton
2011-03-18 5:51 ` Eric Dumazet
0 siblings, 2 replies; 70+ messages in thread
From: Rick Jones @ 2011-03-17 22:20 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Thu, 2011-03-17 at 23:50 +0200, Jonathan Morton wrote:
> On 17 Mar, 2011, at 8:22 pm, Rick Jones wrote:
> > So initialRTO is specced currently to be 3 seconds, with a small but
> > non-trivial effort under way to reduce that, but once established
> > connections have a minimum RTO of less than or equal to a second don't
> > they?
>
> If the RTT they measure is low enough, then yes. If the queues
> lengthen, the measured RTT goes up and so does the RTO, once the
> connection is established.
Right. I should have been more explicit about "You know it won't
retransmit any sooner than N." (for some, changing value of N :)
> But the *initial* RTO is the important one for unmanaged queue sizing,
> because that determines whether a new connection can be started
> without retransmissions, all else functioning correctly of course.
> There is no way to auto-tune that.
>
> Note also that with AQM that can re-order packets, the length of the
> bulk queue starts to matter much less, because the SYN/ACK packets can
> bypass most of the traffic. In that case the RTT measured by the
> existing bulk flows will be higher than the latency seen by new and
> interactive flows.
I would think that unless the rest of the segments of the connection
will also bypass most of the traffic, the SYN or SYN|ACK should not
bypass - to do so will give the TCP connection a low, unrealistic
initial estimate of the RTT. Given the recent change in Linux upstream
to go to cwnd_init of 10 segments, and the prospect of other stacks
following that lead in implementing the draft RFC, if there is a big
slow queue of traffic that the data segments will not bypass, it would
seem better to have the SYN or SYN|ACK get delayed and retransmitted to
get the cwnd down.
SYN and SYN|ACK segments should not receive special treatment beyond
what data segments for the same connection would get.
rick
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 22:20 ` Rick Jones
@ 2011-03-17 22:56 ` Jonathan Morton
2011-03-18 1:36 ` Justin McCann
2011-03-18 5:51 ` Eric Dumazet
1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-17 22:56 UTC (permalink / raw)
To: rick.jones2; +Cc: bloat
On 18 Mar, 2011, at 12:20 am, Rick Jones wrote:
>>> So initialRTO is specced currently to be 3 seconds, with a small but
>>> non-trivial effort under way to reduce that, but once established
>>> connections have a minimum RTO of less than or equal to a second don't
>>> they?
>>
>> If the RTT they measure is low enough, then yes. If the queues
>> lengthen, the measured RTT goes up and so does the RTO, once the
>> connection is established.
>
> Right. I should have been more explicit about "You know it won't
> retransmit any sooner than N." (for some, changing value of N :)
I think there is a minimum value, on the order of 100ms - I don't know precisely.
>> But the *initial* RTO is the important one for unmanaged queue sizing,
>> because that determines whether a new connection can be started
>> without retransmissions, all else functioning correctly of course.
>> There is no way to auto-tune that.
>>
>> Note also that with AQM that can re-order packets, the length of the
>> bulk queue starts to matter much less, because the SYN/ACK packets can
>> bypass most of the traffic. In that case the RTT measured by the
>> existing bulk flows will be higher than the latency seen by new and
>> interactive flows.
>
> I would think that unless the rest of the segments of the connection
> will also bypass most of the traffic, the SYN or SYN|ACK should not
> bypass - to do so will give the TCP connection a low, unrealistic
> initial estimate of the RTT.
> SYN and SYN|ACK segments should not receive special treatment beyond
> what data segments for the same connection would get.
I was thinking of SFQ, which doesn't discriminate based on TCP flags, only on addresses and ports.
With that said, while a low initial estimate of RTT is bad for calculating RTO, it is not so bad for the rest of the congestion control system. Remember that a major problem with Vegas is that it can grossly overestimate the optimal RTT if, because the queues are already full, it never sees the real propagation time.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 22:56 ` Jonathan Morton
@ 2011-03-18 1:36 ` Justin McCann
0 siblings, 0 replies; 70+ messages in thread
From: Justin McCann @ 2011-03-18 1:36 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Thu, Mar 17, 2011 at 6:56 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
>
> On 18 Mar, 2011, at 12:20 am, Rick Jones wrote:
>
>>>> So initialRTO is specced currently to be 3 seconds, with a small but
>>>> non-trivial effort under way to reduce that, but once established
>>>> connections have a minimum RTO of less than or equal to a second don't
>>>> they?
>>>
>>> If the RTT they measure is low enough, then yes. If the queues
>>> lengthen, the measured RTT goes up and so does the RTO, once the
>>> connection is established.
>>
>> Right. I should have been more explicit about "You know it won't
>> retransmit any sooner than N." (for some, changing value of N :)
>
> I think there is a minimum value, on the order of 100ms - I don't know precisely.
The current TCP_MIN_RTO is 200ms on Linux
(http://lxr.linux.no/linux+v2.6.38/include/net/tcp.h#L124). Apparently
it's 30ms on FreeBSD, if this old thread still holds true
(http://www.postel.org/pipermail/end2end-interest/2004-November/004402.html).
From the DCTCP paper, it looks like Windows has a 300 ms RTO.
There were several papers the last couple of years about the minimum
RTO causing low throughput in data center with very small nominal
RTTs. The DCTCP paper has been mentioned here several times already;
it has a lot of relevant discussion about buffer sizing, latency, AQM,
and the difficulties of tuning RED.
Justin
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-17 22:20 ` Rick Jones
2011-03-17 22:56 ` Jonathan Morton
@ 2011-03-18 5:51 ` Eric Dumazet
1 sibling, 0 replies; 70+ messages in thread
From: Eric Dumazet @ 2011-03-18 5:51 UTC (permalink / raw)
To: rick.jones2; +Cc: bloat
Le jeudi 17 mars 2011 à 15:20 -0700, Rick Jones a écrit :
> I would think that unless the rest of the segments of the connection
> will also bypass most of the traffic, the SYN or SYN|ACK should not
> bypass - to do so will give the TCP connection a low, unrealistic
> initial estimate of the RTT. Given the recent change in Linux upstream
> to go to cwnd_init of 10 segments, and the prospect of other stacks
> following that lead in implementing the draft RFC, if there is a big
> slow queue of traffic that the data segments will not bypass, it would
> seem better to have the SYN or SYN|ACK get delayed and retransmitted to
> get the cwnd down.
>
> SYN and SYN|ACK segments should not receive special treatment beyond
> what data segments for the same connection would get.
Agreed. RFC 3168 (The Addition of Explicit Congestion Notification (ECN)
to IP) discuss of this topic, and goes further : SYN packets, or
retransmitted packets dont have ECT marker : They have higher
probabilities to be dropped by ECN routers than marked and pass the
congestion point.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 10:36 [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps? Jim Gettys
2011-03-15 14:40 ` Jim Gettys
@ 2011-03-15 16:34 ` Jonathan Morton
2011-03-15 18:13 ` Stephen Hemminger
2011-03-16 5:41 ` Fred Baker
` (2 subsequent siblings)
4 siblings, 1 reply; 70+ messages in thread
From: Jonathan Morton @ 2011-03-15 16:34 UTC (permalink / raw)
To: Jim Gettys; +Cc: bloat
On 15 Mar, 2011, at 12:36 pm, Jim Gettys wrote:
> 0) the buffers can be filled by any traffic, not necessarily your own (in fact, often that of others), so improving your behaviour, while admirable, doesn't mean you or others sharing any piece of your won't suffer.
> 1) the bloated buffers are already all over, and updating hosts is often a very slow process.
> 2) suffering from this bloat is due to the lack of signalling congestion to congestion avoiding protocols.
I actually agree. I was looking at the various TCPs to see how well they would respond to ECN, and ECN can only be created by some kind of AQM, even if it's a braindead version of RED.
Previously, I pointed out that there is no point in buffering more than about 1-2 seconds of data, full stop - and that includes wireless and very-thin links (eg. analogue modems). In most cases the appropriate buffering level is less still. The fundamental limit is the initial-RTO at 2.5-3 seconds, which happens to correspond to a notional lunar bounce; once the delay exceeds that, you start to get symptoms of catastrophic runaway, from both automatic and human feedback sources.
The main concern actually seems to be "can we turn on ECN safely" together with the related "how do we convince everyone to use ECN". Once ECN is widely deployed, we can more easily convince the routing people that using it might be a good idea, which is a gateway to AQM for people who are otherwise skeptical about AQM. It is clear that concerns still exist for ECN deployment, but that these might only be noticed and fixed by relevant people once deployment actually happens. Chicken and egg... Anecdotally, I have ECN turned on for most of my traffic, I suspect that not all servers I communicate with respond to it, but I don't see any nasty problems.
The other major concern is consumer CPE. Most of this is made by a relatively small number of manufacturers, who are currently in the process of IPv6 transition. I seriously think we should make recommendations to the IPv6 testing guys. Manufacturers will realise that without a label proclaiming IPv6 support on the box, they will have a hard time selling stuff in the near future. They do seem to have a test which verifies that ECN isn't cut away by a router, but this isn't very precise about ECN behaviour under load.
> So the question I have is whether there is some technique whereby monitoring the timestamps that may already be present in the traffic (and knowing what "sane" RTT's are) that we can start marking traffic in time prevent the worst effects of bloating buffers?
The timestamps are not globally valid. You would need to observe each individual flow and maintain some state about them. What's more, you can't do this using stochastic flow discrimination as some AQMs do to avoid state-related starvation.
Further, I don't think there is even any specification of how quickly the timestamps should advance - they are really there to disambiguate the RTT *to the hosts* in a retransmission-rich environment. One host might advance it every centisecond, another every millisecond, another might increment it using a global packet counter rather than a timer. The only thing the timestamps can't do is go backwards, so that they can serve a secondary function related to LFNs, large windows, and stale packets from previous connections.
So timestamps give the host generating them a (much) better idea of RTT (which is good news for delay-based and hybrid algorithms) though it can still be fooled about minRTT if the queue starts full. They can't realistically be used by routers without making some serious and fragile assumptions.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 16:34 ` Jonathan Morton
@ 2011-03-15 18:13 ` Stephen Hemminger
0 siblings, 0 replies; 70+ messages in thread
From: Stephen Hemminger @ 2011-03-15 18:13 UTC (permalink / raw)
To: Jonathan Morton; +Cc: bloat
On Tue, 15 Mar 2011 18:34:10 +0200
Jonathan Morton <chromatix99@gmail.com> wrote:
>
> On 15 Mar, 2011, at 12:36 pm, Jim Gettys wrote:
>
> > 0) the buffers can be filled by any traffic, not necessarily your own (in fact, often that of others), so improving your behaviour, while admirable, doesn't mean you or others sharing any piece of your won't suffer.
> > 1) the bloated buffers are already all over, and updating hosts is often a very slow process.
> > 2) suffering from this bloat is due to the lack of signalling congestion to congestion avoiding protocols.
>
> I actually agree. I was looking at the various TCPs to see how well they would respond to ECN, and ECN can only be created by some kind of AQM, even if it's a braindead version of RED.
>
> Previously, I pointed out that there is no point in buffering more than about 1-2 seconds of data, full stop - and that includes wireless and very-thin links (eg. analogue modems). In most cases the appropriate buffering level is less still. The fundamental limit is the initial-RTO at 2.5-3 seconds, which happens to correspond to a notional lunar bounce; once the delay exceeds that, you start to get symptoms of catastrophic runaway, from both automatic and human feedback sources.
>
> The main concern actually seems to be "can we turn on ECN safely" together with the related "how do we convince everyone to use ECN". Once ECN is widely deployed, we can more easily convince the routing people that using it might be a good idea, which is a gateway to AQM for people who are otherwise skeptical about AQM. It is clear that concerns still exist for ECN deployment, but that these might only be noticed and fixed by relevant people once deployment actually happens. Chicken and egg... Anecdotally, I have ECN turned on for most of my traffic, I suspect that not all servers I communicate with respond to it, but I don't see any nasty problems.
>
> The other major concern is consumer CPE. Most of this is made by a relatively small number of manufacturers, who are currently in the process of IPv6 transition. I seriously think we should make recommendations to the IPv6 testing guys. Manufacturers will realise that without a label proclaiming IPv6 support on the box, they will have a hard time selling stuff in the near future. They do seem to have a test which verifies that ECN isn't cut away by a router, but this isn't very precise about ECN behaviour under load.
>
> > So the question I have is whether there is some technique whereby monitoring the timestamps that may already be present in the traffic (and knowing what "sane" RTT's are) that we can start marking traffic in time prevent the worst effects of bloating buffers?
>
> The timestamps are not globally valid. You would need to observe each individual flow and maintain some state about them. What's more, you can't do this using stochastic flow discrimination as some AQMs do to avoid state-related starvation.
>
> Further, I don't think there is even any specification of how quickly the timestamps should advance - they are really there to disambiguate the RTT *to the hosts* in a retransmission-rich environment. One host might advance it every centisecond, another every millisecond, another might increment it using a global packet counter rather than a timer. The only thing the timestamps can't do is go backwards, so that they can serve a secondary function related to LFNs, large windows, and stale packets from previous connections.
>
> So timestamps give the host generating them a (much) better idea of RTT (which is good news for delay-based and hybrid algorithms) though it can still be fooled about minRTT if the queue starts full. They can't realistically be used by routers without making some serious and fragile assumptions.
>
> - Jonathan
Although ECN is a good idea, it really doesn't help when the majority of machines have
ECN support disabled. Looks like it is disabled by default in Windows:
http://technet.microsoft.com/en-us/library/bb878127.aspx
--
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 10:36 [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps? Jim Gettys
2011-03-15 14:40 ` Jim Gettys
2011-03-15 16:34 ` Jonathan Morton
@ 2011-03-16 5:41 ` Fred Baker
2011-03-16 6:26 ` Jonathan Morton
2011-03-16 8:55 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
2011-03-16 9:04 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " BeckW
4 siblings, 1 reply; 70+ messages in thread
From: Fred Baker @ 2011-03-16 5:41 UTC (permalink / raw)
To: Jim Gettys; +Cc: bloat
From my perspective, the way to address this is to go back to first principles.
Van/Sally's congestion control algorithms, both the loss sensitive ones and ECN, depend heavily on Jain's work in the 1980's, his patent dated 1994, and the Frame Relay and CLNS work on FECN and "Congestion Experienced". Jain worked from a basic concept, that of a "knee" and a "cliff":
In concept:
A |
g| | "capacity" or "bottleneck bandwidth"
o| |- - - - - - - - - - - - - - - - - - - - -
o | .' \
d | .' knee \ cliff
p | .' \
u | .' \
t | .' \
| .' \
|.' \
+-----------------------------------------
window -->
More likely reality:
|
| "capacity" or "bottleneck bandwidth"
g |- - - - - - - - - - - - - - - - - - - - -
o | __..,--=
o | _,' `.
d | _,' |<-->| |<-->|,
p | ,' knee cliff \
u | ,' \
t | ,' `.._
| / `'-----'
+`-----------------------------------------
window -->
In short, the "knee" corresponds with the least window that maximizes goodput, and the cliff corresponds with the largest window value that maximizes goodput *plus*one* - the least window that results in a reduction of goodput. In real practice, it's more approximate than the theory might suggest, but the concept measurably works.
The question is how you measure whether you have reached it. Van's question of mean queue depth, from my perspective, is a good approximation but misses the point. You can think about the knee in another way; if it is least window that maximizes goodput, it is a point at which increasing your window is unlikely to increase your goodput. From the network's perspective, that is the point at which the queue always has something in it. There are lots of interesting ways to measure and state that, but it's what it comes down to. There is no more unused bandwidth at the bottleneck that adding another packet to the mix will take advantage of. At that point, it's a zero sum game: if one session manages to increase its share of the available capacity, some other session or set of sessions has to slow down.
From that perspective, one could argue that the simplest approach is to note the wall-clock time whenever a class or queue's depth falls below some threshold. If the class or queue goes longer than <mumble> without doing that, flagging an ECN CE or dropping a packet is probably in order. The thing is - you want <mumble> to be variable, so that your mark/drop rate can track a reasonable number. If you do that, the queue will remain somewhere in the neighborhood of the knee, and all of its sessions will as well. The question isn't "what is the magic mean queue depth for min-threshold to be set to"; it's "what mark/drop rate is sufficient to keep the queue somewhat shallow most of the time".
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 5:41 ` Fred Baker
@ 2011-03-16 6:26 ` Jonathan Morton
0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-16 6:26 UTC (permalink / raw)
To: Fred Baker; +Cc: bloat
On 16 Mar, 2011, at 7:41 am, Fred Baker wrote:
> The question isn't "what is the magic mean queue depth for min-threshold to be set to"; it's "what mark/drop rate is sufficient to keep the queue somewhat shallow most of the time".
And that's what BLUE (and SFB) tries to do. If the queue grows beyond some limit, it increments the marking rate. If it becomes empty, it decrements it. The result is an I-type control loop which has a reasonable chance of finding a steady state, if there is one. SFB does it on a per-flow basis.
I also thought of an analogy just now, as I was playing with my train simulator - many people like car analogies, I prefer railway ones. A router is like a bunch of railways meeting at a grand-union flying junction (typically implemented as a cloverleaf in the real world). The more expensive kinds are built with wider curves that let trains go fast even in the junction.
You can have lightweight, fast passenger trains, running loaded in both directions, and these are your VoIP traffic. Among them you might have heavy, slow freight trains, which just happen to weigh about 1500 tons each, but which run empty from the power station back to the mines. You don't want to be on a passenger train stuck behind a freight, so railways build extra tracks, either at intervals or continuously, to keep freight trains in and allow passenger trains past.
But a railway can only carry one train on each track at a time, and tracks are expensive. So sometimes a train still has to wait for another one. They can simply wait one behind the other at signals, or the railways might decide to put a marshalling yard in at the junction, so that many trains can be stored and rearranged for efficient prioritisation.
Meanwhile, a wireless network is more like a bunch of railways which meet at the cheapest, skinniest single-track junction the builders could devise - only one train at a time can use it, and sometimes they even fall off the rails and have to be crowbarred back on. It's a bit of a mess, but the junction is up in the mountains so it's very difficult to improve it.
The problem is that the railway company doesn't like to admit that the trains are slow and unreliable at this junction, so it employs lots of men with crowbars, and tries to avoid the subject when trains arrive several hours late. Yet people *do* notice, especially during the holiday season when *everyone* is travelling and the freight trains are chock-full of parcels - and the snow is just starting to fall in earnest, which freezes the points.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCPflavours - timestamps?
2011-03-15 10:36 [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps? Jim Gettys
` (2 preceding siblings ...)
2011-03-16 5:41 ` Fred Baker
@ 2011-03-16 8:55 ` Richard Scheffenegger
2011-03-16 9:04 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " BeckW
4 siblings, 0 replies; 70+ messages in thread
From: Richard Scheffenegger @ 2011-03-16 8:55 UTC (permalink / raw)
To: Jim Gettys, bloat
Unfortunately, timestamps are only reliable as long as there is no
forward-path loss.
Also, in order to use timestamps as input to any congestion control
strategy, you would want to exclude varations in the return channel from
your measurement (currently, timestamps, and the measurement of RTT is not
designed to do that; however, one-way delay variation measurements can be
done by means of timestamps, if the timestamp clock rate of the opposite
host is known locally. See Chirp-TCP and Mijra Kuehlewind/Bob Briscoes work
in that area. Last, there are security / integrity concerns. Some early
versions of BIC / CUBIC used timestamps directly as input signals. A
malicious stack can modify the timestamp, and thereby influence the reaction
of the senders congestion control algorithm (typically, to get an unfair
large share of the bandwidth for this session).
Linux addressed these problems by a number of fixes, which are however,
completely unfeasible in any other stack...
Mirja and I have written a very first sketch of tcp timestamp signalling
improvements, to enable the use of timestamps for such uses (addressing
integrity/security concerns, to make it a mroe reliable input signal into a
congestion control scheme, and exchanging the local tcp timestamp clock
rate, to enable one-way delay variance measurements; last, the improvements
also allow a more reliable signal during loss episodes, so that synergistic
methods to recovery more rapidely from lost retransmissions (1 rtt after the
2nd loss, instead of 2+ rtts as linux currently does; btw, no other stack
performs lost retransmission detection).
http://tools.ietf.org/html/draft-scheffenegger-tcpm-timestamp-negotiation-01
(again, this draft is still in a very early stage, only sketiching the
envisioned modifications and their potential use).
I invite everyone to join the TCPM session at IETF80 in prague, and
participate in the discussion.
Best regards,
Richard
----- Original Message -----
From: "Jim Gettys" <jg@freedesktop.org>
To: <bloat@lists.bufferbloat.net>
Sent: Tuesday, March 15, 2011 11:36 AM
Subject: [Bloat] Random idea in reaction to all the discussion of
TCPflavours - timestamps?
> I've been watching all the discussion of different TCP flavours with a
> certain amount of disquiet; this is not because I think working on
> improvements to TCP are bad; in fact, it is clear for wireless we could do
> with improvements in algorithms. I'm not trying to discourage work on
> this topic.
>
> My disquiet is otherwise; it is:
> 0) the buffers can be filled by any traffic, not necessarily your own
> (in fact, often that of others), so improving your behaviour, while
> admirable, doesn't mean you or others sharing any piece of your won't
> suffer.
> 1) the bloated buffers are already all over, and updating hosts is
> often a very slow process.
> 2) suffering from this bloat is due to the lack of signalling
> congestion to congestion avoiding protocols.
>
> OK, what does this mean? it means not that we should abandon improving
> TCP; but that doing so won't fundamentally eliminate bufferbloat
> suffering. It won't get us to a fundamentally different place, but only
> to marginally better places in terms of bufferbloating.
>
> The fundamental question, therefore, is how we start marking traffic
> during periods when the buffers fill (either by packet drop or by ECN), to
> provide the missing feedback in congestion avoiding protocol's servo
> system. No matter what flavour of protocol involved, they will then back
> off.
>
> Back last summer, to my surprise, when I asked Van Jacobson about my
> traces, he said all the required proof was already present in my traces,
> since modern Linux (and I presume other) operating systems had time stamps
> in them (the TCP timestamps option).
>
> Here's the off the wall idea. The buffers we observe are often many times
> (orders of magnitude) larger than any rational RTT.
>
> So the question I have is whether there is some technique whereby
> monitoring the timestamps that may already be present in the traffic (and
> knowing what "sane" RTT's are) that we can start marking traffic in time
> prevent the worst effects of bloating buffers?
> - Jim
>
>
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-15 10:36 [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps? Jim Gettys
` (3 preceding siblings ...)
2011-03-16 8:55 ` [Bloat] Random idea in reaction to all the discussion of TCPflavours " Richard Scheffenegger
@ 2011-03-16 9:04 ` BeckW
2011-03-16 22:48 ` Fred Baker
4 siblings, 1 reply; 70+ messages in thread
From: BeckW @ 2011-03-16 9:04 UTC (permalink / raw)
To: bloat
Jim Wrote:
> Back last summer, to my surprise, when I asked Van Jacobson about my
> traces, he said all the required proof was already present in my traces,
> since modern Linux (and I presume other) operating systems had time
> stamps in them (the TCP timestamps option).
>
> Here's the off the wall idea. The buffers we observe are often many
> times (orders of magnitude) larger than any rational RTT.
>
> So the question I have is whether there is some technique whereby
> monitoring the timestamps that may already be present in the traffic
> (and knowing what "sane" RTT's are) that we can start marking traffic in
> time prevent the worst effects of bloating buffers?
This reminds me of a related concept, using the TTL really as 'Time To Live' (in today's IP, it's more of a 'Remaining Hop Count). According to RfC 791, a router that buffers a packet by n seconds must decrease its TTL by n. I doubt that many routers implement this properly.
Wolfgang
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 9:04 ` [Bloat] Random idea in reaction to all the discussion of TCP flavours " BeckW
@ 2011-03-16 22:48 ` Fred Baker
2011-03-16 23:23 ` Jonathan Morton
2011-03-17 8:34 ` BeckW
0 siblings, 2 replies; 70+ messages in thread
From: Fred Baker @ 2011-03-16 22:48 UTC (permalink / raw)
To: <BeckW@telekom.de>; +Cc: bloat
On Mar 16, 2011, at 2:04 AM, <BeckW@telekom.de> <BeckW@telekom.de> wrote:
> This reminds me of a related concept, using the TTL really as 'Time To Live' (in today's IP, it's more of a 'Remaining Hop Count). According to RfC 791, a router that buffers a packet by n seconds must decrease its TTL by n. I doubt that many routers implement this properly.
There is, of course, a fundamental bug in that, noted in RFC 970.
RFC 1812, which I edited, contains this text (that I didn't write):
In this specification, we have reluctantly decided to follow the
strong belief among the router vendors that the time limit
function should be optional. They argued that implementation of
the time limit function is difficult enough that it is currently
not generally done. They further pointed to the lack of
documented cases where this shortcut has caused TCP to corrupt
data (of course, we would expect the problems created to be rare
and difficult to reproduce, so the lack of documented cases
provides little reassurance that there haven't been a number of
undocumented cases).
The corresponding field in IPv6 (RFC 2460) picks up this bit of wisdom:
Hop Limit 8-bit unsigned integer. Decremented by 1 by
each node that forwards the packet. The packet
is discarded if Hop Limit is decremented to
zero.
I think that you will find that every router implements the hop limit properly, and implements the TTL as modified by RFC 1812 properly.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 22:48 ` Fred Baker
@ 2011-03-16 23:23 ` Jonathan Morton
2011-03-17 8:34 ` BeckW
1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Morton @ 2011-03-16 23:23 UTC (permalink / raw)
To: Fred Baker; +Cc: bloat
On 17 Mar, 2011, at 12:48 am, Fred Baker wrote:
>> This reminds me of a related concept, using the TTL really as 'Time To Live' (in today's IP, it's more of a 'Remaining Hop Count). According to RfC 791, a router that buffers a packet by n seconds must decrease its TTL by n. I doubt that many routers implement this properly.
>
> There is, of course, a fundamental bug in that, noted in RFC 970.
>
> RFC 1812, which I edited, contains this text (that I didn't write):
>
> In this specification, we have reluctantly decided to follow the
> strong belief among the router vendors that the time limit
> function should be optional.
The major problem with the original TTL spec was that a router generally doesn't keep a packet for an integer number of seconds (at least, not in anything but the most ancient of hardware). If three separate routers each buffer for 350ms, that's about 1 second elapsed, but there is no way for the routers to indicate this to each other. Yet the TTL field was an integer number of seconds.
The later hop-count spec is much saner.
- Jonathan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Bloat] Random idea in reaction to all the discussion of TCP flavours - timestamps?
2011-03-16 22:48 ` Fred Baker
2011-03-16 23:23 ` Jonathan Morton
@ 2011-03-17 8:34 ` BeckW
1 sibling, 0 replies; 70+ messages in thread
From: BeckW @ 2011-03-17 8:34 UTC (permalink / raw)
To: fred; +Cc: bloat
> [RfC 791 vs RfC 1812 notion of TTL]
I'm aware that it is impossible to implement without complex hardware support but didn't know that the simpler interpretation got the IETF's blessing. So the question how a TCP implementations could profit from observing RfC 791-style TTLs remains hypothetical.
Wolfgang
^ permalink raw reply [flat|nested] 70+ messages in thread