[Codel] fq_codel : interval servo

CoDel AQM discussions
 help / color / mirror / Atom feed

* [Codel] fq_codel : interval servo
@ 2012-08-31  6:55 Eric Dumazet
  2012-08-31 13:41 ` Jim Gettys
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-08-31  6:55 UTC (permalink / raw)
  To: codel

On locally generated TCP traffic (host), we can override the 100 ms
interval value using the more accurate RTT estimation maintained by TCP
stack (tp->srtt)

Datacenter workload benefits using shorter feedback (say if RTT is below
1 ms, we can react 100 times faster to a congestion)

Idea from Yuchung Cheng.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31  6:55 [Codel] fq_codel : interval servo Eric Dumazet
@ 2012-08-31 13:41 ` Jim Gettys
  2012-08-31 13:50 ` [Codel] [RFC] fq_codel : interval servo on hosts Eric Dumazet
  2012-08-31 15:53 ` [Codel] fq_codel : interval servo Rick Jones
  2 siblings, 0 replies; 27+ messages in thread
From: Jim Gettys @ 2012-08-31 13:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On 08/30/2012 11:55 PM, Eric Dumazet wrote:
> On locally generated TCP traffic (host), we can override the 100 ms
> interval value using the more accurate RTT estimation maintained by TCP
> stack (tp->srtt)
>
> Datacenter workload benefits using shorter feedback (say if RTT is below
> 1 ms, we can react 100 times faster to a congestion)

Interesting.  Ethernet flow control might finally have some benefit then
in those environments...
                        - Jim

>
> Idea from Yuchung Cheng.
>
>
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Codel] [RFC] fq_codel : interval servo on hosts
  2012-08-31  6:55 [Codel] fq_codel : interval servo Eric Dumazet
  2012-08-31 13:41 ` Jim Gettys
@ 2012-08-31 13:50 ` Eric Dumazet
  2012-08-31 13:57   ` [Codel] [RFC v2] " Eric Dumazet
  2012-08-31 15:53 ` [Codel] fq_codel : interval servo Rick Jones
  2 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-08-31 13:50 UTC (permalink / raw)
  To: codel; +Cc: Tomas Hruby, Nandita Dukkipati, netdev

On Thu, 2012-08-30 at 23:55 -0700, Eric Dumazet wrote:
> On locally generated TCP traffic (host), we can override the 100 ms
> interval value using the more accurate RTT estimation maintained by TCP
> stack (tp->srtt)
> 
> Datacenter workload benefits using shorter feedback (say if RTT is below
> 1 ms, we can react 100 times faster to a congestion)
> 
> Idea from Yuchung Cheng.
> 

Linux patch would be the following :

I'll do tests next week, but I am sending a raw patch right now if
anybody wants to try it.

Presumably we also want to adjust target as well.

To get more precise srtt values in the datacenter, we might avoid the
'one jiffie slack' on small values in tcp_rtt_estimator(), as we force
m to be 1 before the scaling by 8 :

if (m == 0)
	m = 1;

We only need to force the least significant bit of srtt to be set.


 net/sched/sch_fq_codel.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 9fc1c62..7d2fe35 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -25,6 +25,7 @@
 #include <net/pkt_sched.h>
 #include <net/flow_keys.h>
 #include <net/codel.h>
+#include <linux/tcp.h>
 
 /*	Fair Queue CoDel.
  *
@@ -59,6 +60,7 @@ struct fq_codel_sched_data {
 	u32		perturbation;	/* hash perturbation */
 	u32		quantum;	/* psched_mtu(qdisc_dev(sch)); */
 	struct codel_params cparams;
+	codel_time_t	default_interval;
 	struct codel_stats cstats;
 	u32		drop_overlimit;
 	u32		new_flow_count;
@@ -211,6 +213,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_SUCCESS;
 }
 
+/* Given TCP srtt evaluation, return codel interval.
+ * srtt is given in jiffies, scaled by 8.
+ */
+static codel_time_t tcp_srtt_to_codel(unsigned int srtt)
+{
+	return srtt * ((NSEC_PER_SEC >> (CODEL_SHIFT + 3)) / HZ);
+}
+
 /* This is the specific function called from codel_dequeue()
  * to dequeue a packet from queue. Note: backlog is handled in
  * codel, we dont need to reduce it here.
@@ -220,12 +230,21 @@ static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct fq_codel_flow *flow;
 	struct sk_buff *skb = NULL;
+	struct sock *sk;
 
 	flow = container_of(vars, struct fq_codel_flow, cvars);
 	if (flow->head) {
 		skb = dequeue_head(flow);
 		q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
 		sch->q.qlen--;
+		sk = skb->sk;
+		q->cparams.interval = q->default_interval;
+		if (sk && sk->sk_protocol == IPPROTO_TCP) {
+			u32 srtt = tcp_sk(sk)->srtt;
+
+			if (srtt)
+				q->cparams.interval = tcp_srtt_to_codel(srtt);
+		}
 	}
 	return skb;
 }
@@ -330,7 +349,7 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	if (tb[TCA_FQ_CODEL_INTERVAL]) {
 		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
 
-		q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+		q->default_interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
 	}
 
 	if (tb[TCA_FQ_CODEL_LIMIT])
@@ -441,7 +460,7 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
 			sch->limit) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
-			codel_time_to_us(q->cparams.interval)) ||
+			codel_time_to_us(q->default_interval)) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_ECN,
 			q->cparams.ecn) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-08-31 13:50 ` [Codel] [RFC] fq_codel : interval servo on hosts Eric Dumazet
@ 2012-08-31 13:57   ` Eric Dumazet
  2012-09-01  1:37     ` Yuchung Cheng
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-08-31 13:57 UTC (permalink / raw)
  To: codel; +Cc: Tomas Hruby, Nandita Dukkipati, netdev

On Fri, 2012-08-31 at 06:50 -0700, Eric Dumazet wrote:
> On Thu, 2012-08-30 at 23:55 -0700, Eric Dumazet wrote:
> > On locally generated TCP traffic (host), we can override the 100 ms
> > interval value using the more accurate RTT estimation maintained by TCP
> > stack (tp->srtt)
> > 
> > Datacenter workload benefits using shorter feedback (say if RTT is below
> > 1 ms, we can react 100 times faster to a congestion)
> > 
> > Idea from Yuchung Cheng.
> > 
> 
> Linux patch would be the following :
> 
> I'll do tests next week, but I am sending a raw patch right now if
> anybody wants to try it.
> 
> Presumably we also want to adjust target as well.
> 
> To get more precise srtt values in the datacenter, we might avoid the
> 'one jiffie slack' on small values in tcp_rtt_estimator(), as we force
> m to be 1 before the scaling by 8 :
> 
> if (m == 0)
> 	m = 1;
> 
> We only need to force the least significant bit of srtt to be set.
> 

Hmm, I also need to properly init default_interval after
codel_params_init(&q->cparams) :

 net/sched/sch_fq_codel.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 9fc1c62..f04ff6a 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -25,6 +25,7 @@
 #include <net/pkt_sched.h>
 #include <net/flow_keys.h>
 #include <net/codel.h>
+#include <linux/tcp.h>
 
 /*	Fair Queue CoDel.
  *
@@ -59,6 +60,7 @@ struct fq_codel_sched_data {
 	u32		perturbation;	/* hash perturbation */
 	u32		quantum;	/* psched_mtu(qdisc_dev(sch)); */
 	struct codel_params cparams;
+	codel_time_t	default_interval;
 	struct codel_stats cstats;
 	u32		drop_overlimit;
 	u32		new_flow_count;
@@ -211,6 +213,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_SUCCESS;
 }
 
+/* Given TCP srtt evaluation, return codel interval.
+ * srtt is given in jiffies, scaled by 8.
+ */
+static codel_time_t tcp_srtt_to_codel(unsigned int srtt)
+{
+	return srtt * ((NSEC_PER_SEC >> (CODEL_SHIFT + 3)) / HZ);
+}
+
 /* This is the specific function called from codel_dequeue()
  * to dequeue a packet from queue. Note: backlog is handled in
  * codel, we dont need to reduce it here.
@@ -220,12 +230,21 @@ static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct fq_codel_flow *flow;
 	struct sk_buff *skb = NULL;
+	struct sock *sk;
 
 	flow = container_of(vars, struct fq_codel_flow, cvars);
 	if (flow->head) {
 		skb = dequeue_head(flow);
 		q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
 		sch->q.qlen--;
+		sk = skb->sk;
+		q->cparams.interval = q->default_interval;
+		if (sk && sk->sk_protocol == IPPROTO_TCP) {
+			u32 srtt = tcp_sk(sk)->srtt;
+
+			if (srtt)
+				q->cparams.interval = tcp_srtt_to_codel(srtt);
+		}
 	}
 	return skb;
 }
@@ -330,7 +349,7 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	if (tb[TCA_FQ_CODEL_INTERVAL]) {
 		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
 
-		q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+		q->default_interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
 	}
 
 	if (tb[TCA_FQ_CODEL_LIMIT])
@@ -395,6 +414,7 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 	INIT_LIST_HEAD(&q->new_flows);
 	INIT_LIST_HEAD(&q->old_flows);
 	codel_params_init(&q->cparams);
+	q->default_interval = q->cparams.interval;
 	codel_stats_init(&q->cstats);
 	q->cparams.ecn = true;
 
@@ -441,7 +461,7 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
 			sch->limit) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
-			codel_time_to_us(q->cparams.interval)) ||
+			codel_time_to_us(q->default_interval)) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_ECN,
 			q->cparams.ecn) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31  6:55 [Codel] fq_codel : interval servo Eric Dumazet
  2012-08-31 13:41 ` Jim Gettys
  2012-08-31 13:50 ` [Codel] [RFC] fq_codel : interval servo on hosts Eric Dumazet
@ 2012-08-31 15:53 ` Rick Jones
  2012-08-31 16:23   ` Eric Dumazet
  2012-08-31 16:40   ` Jim Gettys
  2 siblings, 2 replies; 27+ messages in thread
From: Rick Jones @ 2012-08-31 15:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On 08/30/2012 11:55 PM, Eric Dumazet wrote:
> On locally generated TCP traffic (host), we can override the 100 ms
> interval value using the more accurate RTT estimation maintained by TCP
> stack (tp->srtt)
>
> Datacenter workload benefits using shorter feedback (say if RTT is below
> 1 ms, we can react 100 times faster to a congestion)
>
> Idea from Yuchung Cheng.

Mileage varies of course, but what are the odds of a datacenter's 
end-system's NIC(s) being the bottleneck point?  Is it worth pinging a 
couple additional cache lines around (looking at the v2 email, I'm 
ass-u-me-ing that sk_protocol and tp->srtt are on different cache lines)?

If fq_codel is going to be a little bit pregnant and "layer violate" :) 
why stop at TCP?

Is this change rectifying an "unfairness" with the existing fq_codel and 
the 100ms for all when two TCP flows have very different srtts?

Some perhaps overly paranoid questions:

Does it matter that the value of tp->srtt at the time fq_codel dequeues 
will not necessarily be the same as when that segment was queued?

Is there any chance of the socket going away between the time the packet 
was queued and the time it was dequeued? (Or tp->srtt becoming "undefined?")

rick jones

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 15:53 ` [Codel] fq_codel : interval servo Rick Jones
@ 2012-08-31 16:23   ` Eric Dumazet
  2012-08-31 16:59     ` Dave Taht
  2012-08-31 16:40   ` Jim Gettys
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-08-31 16:23 UTC (permalink / raw)
  To: Rick Jones; +Cc: codel

On Fri, 2012-08-31 at 08:53 -0700, Rick Jones wrote:
> On 08/30/2012 11:55 PM, Eric Dumazet wrote:
> > On locally generated TCP traffic (host), we can override the 100 ms
> > interval value using the more accurate RTT estimation maintained by TCP
> > stack (tp->srtt)
> >
> > Datacenter workload benefits using shorter feedback (say if RTT is below
> > 1 ms, we can react 100 times faster to a congestion)
> >
> > Idea from Yuchung Cheng.
> 
> Mileage varies of course, but what are the odds of a datacenter's 
> end-system's NIC(s) being the bottleneck point?  Is it worth pinging a 
> couple additional cache lines around (looking at the v2 email, I'm 
> ass-u-me-ing that sk_protocol and tp->srtt are on different cache lines)?
> 

Its certainly worth pinging additional cache lines.

A host consume almost no cpu in qdisc layer (at least not in fq_codel)

A router wont use this stuff (as skb->sk will be NULL)

> If fq_codel is going to be a little bit pregnant and "layer violate" :) 
> why stop at TCP?

Who said I would stop at TCP ? ;)

> 
> Is this change rectifying an "unfairness" with the existing fq_codel and 
> the 100ms for all when two TCP flows have very different srtts?
> 

codel has to use a single interval value, and we use an average value.
It seems to work quite well.

fq_codel has the opportunity to get a per tcp flow interval value.
And this should give better behavior.

> Some perhaps overly paranoid questions:
> 
> Does it matter that the value of tp->srtt at the time fq_codel dequeues 
> will not necessarily be the same as when that segment was queued?
> 

It matters we use the last srtt value/estimation, which is done in this
patch.

> Is there any chance of the socket going away between the time the packet 
> was queued and the time it was dequeued? (Or tp->srtt becoming "undefined?")

When skb->sk is non NULL, we hold a reference to the socket, it cannot
disappear under us.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 15:53 ` [Codel] fq_codel : interval servo Rick Jones
  2012-08-31 16:23   ` Eric Dumazet
@ 2012-08-31 16:40   ` Jim Gettys
  2012-08-31 16:49     ` Jonathan Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Jim Gettys @ 2012-08-31 16:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: codel

On 08/31/2012 08:53 AM, Rick Jones wrote:
> On 08/30/2012 11:55 PM, Eric Dumazet wrote:
>> On locally generated TCP traffic (host), we can override the 100 ms
>> interval value using the more accurate RTT estimation maintained by TCP
>> stack (tp->srtt)
>>
>> Datacenter workload benefits using shorter feedback (say if RTT is below
>> 1 ms, we can react 100 times faster to a congestion)
>>
>> Idea from Yuchung Cheng.
>
> Mileage varies of course, but what are the odds of a datacenter's
> end-system's NIC(s) being the bottleneck point?
Ergo my comment about Ethernet flow control finally being possibly more
help than hurt; clearly if the bottleneck is kept in the sending host
more of the time, it would help.

I certainly don't know how often the end-system's NIC's are the
bottleneck today without flow control; maybe a datacenter type might
have insight into that.
                    - Jim


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 16:40   ` Jim Gettys
@ 2012-08-31 16:49     ` Jonathan Morton
  2012-08-31 17:15       ` Jim Gettys
  0 siblings, 1 reply; 27+ messages in thread
From: Jonathan Morton @ 2012-08-31 16:49 UTC (permalink / raw)
  To: Jim Gettys; +Cc: codel

On 31 Aug, 2012, at 7:40 pm, Jim Gettys wrote:

> On 08/31/2012 08:53 AM, Rick Jones wrote:
>> On 08/30/2012 11:55 PM, Eric Dumazet wrote:
>>> On locally generated TCP traffic (host), we can override the 100 ms
>>> interval value using the more accurate RTT estimation maintained by TCP
>>> stack (tp->srtt)
>>> 
>>> Datacenter workload benefits using shorter feedback (say if RTT is below
>>> 1 ms, we can react 100 times faster to a congestion)
>>> 
>>> Idea from Yuchung Cheng.
>> 
>> Mileage varies of course, but what are the odds of a datacenter's
>> end-system's NIC(s) being the bottleneck point?

> Ergo my comment about Ethernet flow control finally being possibly more
> help than hurt; clearly if the bottleneck is kept in the sending host
> more of the time, it would help.
> 
> I certainly don't know how often the end-system's NIC's are the
> bottleneck today without flow control; maybe a datacenter type might
> have insight into that.

Consider a fileserver with ganged 10GE NICs serving an office full of GigE workstations.

At 9am on Monday, everyone arrives and switches on their workstation, which (because the org has made them diskless) causes pretty much the same set of data to be sent to each in rapid succession.  The fileserver satisfies all but the first of these from cache, so it can saturate all of it's NICs in theory.  In that case a queue should exist even if there are no downstream bottlenecks.

Alternatively, one floor at a time boots up at once - say the call-centre starts up at 7am, the developers stumble in at 10am, and the management types wander in at 11:30.  :-)  Then the bottleneck is the single 10GE link to each floor, rather than the fileserver's own NICs.

That's all theoretical, of course - I've never built a datacentre network so I don't know how it's done in practice.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 16:23   ` Eric Dumazet
@ 2012-08-31 16:59     ` Dave Taht
  2012-09-01 12:53       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Taht @ 2012-08-31 16:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On Fri, Aug 31, 2012 at 9:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-08-31 at 08:53 -0700, Rick Jones wrote:
>> On 08/30/2012 11:55 PM, Eric Dumazet wrote:
>> > On locally generated TCP traffic (host), we can override the 100 ms
>> > interval value using the more accurate RTT estimation maintained by TCP
>> > stack (tp->srtt)
>> >
>> > Datacenter workload benefits using shorter feedback (say if RTT is below
>> > 1 ms, we can react 100 times faster to a congestion)
>> >
>> > Idea from Yuchung Cheng.
>>
>> Mileage varies of course, but what are the odds of a datacenter's
>> end-system's NIC(s) being the bottleneck point?  Is it worth pinging a
>> couple additional cache lines around (looking at the v2 email, I'm
>> ass-u-me-ing that sk_protocol and tp->srtt are on different cache lines)?
>>
>
> Its certainly worth pinging additional cache lines.
>
> A host consume almost no cpu in qdisc layer (at least not in fq_codel)
>
> A router wont use this stuff (as skb->sk will be NULL)
>
>> If fq_codel is going to be a little bit pregnant and "layer violate" :)
>> why stop at TCP?
>
> Who said I would stop at TCP ? ;)

Heh. "Vith Codel I vill rule ze vorld!"

I have not, of late, been focused on TCP, (torn up about uTP,
actually) and also more on the kinds of problems that occur in the
home gateway.

I realize that 10GigE and datacenter host based work is sexy and fun,
but getting stuff that runs well in today's 1-20Mbit environments is
my own priority, going up to 100Mbit, with something that can be
embedded in a SoC. The latest generation of SoCs all do QoS in
hardware... badly.

Secondly, finding things that would work on the head ends (CMTSes and
DSLAMs) also ranks way up there.

Fixing wifi follows that in priority...

While many of the things under tweak have commonality I'm thinking
more and more it would be saner for me to fork off for a while and do
a "mfq_codel", which would let me A) be able to test codel, fq_codel,
mfq_codel enhancements on the same build and testbed. B) play with
stuff that works better at lower bandwidths and C) with wifi.

Two notes from my slow and often buggy work that I haven't mentioned
on this list yet:

1) The current fq_codel implementation wipes out the the codel state
every time a queue is emptied. This happens rarely at 10GigE speeds,
quite often below 100Mbit. Keeping the state around helps a LOT on
queue depth - for example (with the rest of my buggy patchset) I end
up with a very nice avg 50 packet backlog at 100Mbit, low median,
stddev, etc, with 8 streams, and with 150, 70 or so...

"fixing" that just involved removing codel_init_vars from the "is this
a new queue" routine.

some of the other patches I've thrown around are about seeing odd
behaviors on temporarily empty queues, like:

2) maxpacket can get set to the largest packet ever seen by a queue
(like a TSO packet)

....
                } else {
+                       stats->maxpacket = qdisc_pkt_len(skb);
                        vars->count = 1;
                        vars->rec_inv_sqrt = ~0U >> REC_INV_SQRT_SHIFT;
                        }

So resetting it will A) have a fq_codel queue set to the actual packet
size (acks, for example),
might make codel more accurate on those sort of streams... and B)
fiddling with ethtool to turn off gso/tso/etc was not reflected in
codel's estimates.

B) has occasionally caused a great deal of headscratching.

>
>>
>> Is this change rectifying an "unfairness" with the existing fq_codel and
>> the 100ms for all when two TCP flows have very different srtts?
>>
>
> codel has to use a single interval value, and we use an average value.
> It seems to work quite well.
>
> fq_codel has the opportunity to get a per tcp flow interval value.
> And this should give better behavior.
>
>> Some perhaps overly paranoid questions:
>>
>> Does it matter that the value of tp->srtt at the time fq_codel dequeues
>> will not necessarily be the same as when that segment was queued?
>>
>
> It matters we use the last srtt value/estimation, which is done in this
> patch.
>
>> Is there any chance of the socket going away between the time the packet
>> was queued and the time it was dequeued? (Or tp->srtt becoming "undefined?")
>
> When skb->sk is non NULL, we hold a reference to the socket, it cannot
> disappear under us.
>
>
>
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 16:49     ` Jonathan Morton
@ 2012-08-31 17:15       ` Jim Gettys
  2012-08-31 17:31         ` Rick Jones
  0 siblings, 1 reply; 27+ messages in thread
From: Jim Gettys @ 2012-08-31 17:15 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: codel

On 08/31/2012 09:49 AM, Jonathan Morton wrote:
> On 31 Aug, 2012, at 7:40 pm, Jim Gettys wrote:
>
>> On 08/31/2012 08:53 AM, Rick Jones wrote:
>>> On 08/30/2012 11:55 PM, Eric Dumazet wrote:
>>>> On locally generated TCP traffic (host), we can override the 100 ms
>>>> interval value using the more accurate RTT estimation maintained by TCP
>>>> stack (tp->srtt)
>>>>
>>>> Datacenter workload benefits using shorter feedback (say if RTT is below
>>>> 1 ms, we can react 100 times faster to a congestion)
>>>>
>>>> Idea from Yuchung Cheng.
>>> Mileage varies of course, but what are the odds of a datacenter's
>>> end-system's NIC(s) being the bottleneck point?
>> Ergo my comment about Ethernet flow control finally being possibly more
>> help than hurt; clearly if the bottleneck is kept in the sending host
>> more of the time, it would help.
>>
>> I certainly don't know how often the end-system's NIC's are the
>> bottleneck today without flow control; maybe a datacenter type might
>> have insight into that.
> Consider a fileserver with ganged 10GE NICs serving an office full of GigE workstations.
>
> At 9am on Monday, everyone arrives and switches on their workstation, which (because the org has made them diskless) causes pretty much the same set of data to be sent to each in rapid succession.  The fileserver satisfies all but the first of these from cache, so it can saturate all of it's NICs in theory.  In that case a queue should exist even if there are no downstream bottlenecks.
>
> Alternatively, one floor at a time boots up at once - say the call-centre starts up at 7am, the developers stumble in at 10am, and the management types wander in at 11:30.  :-)  Then the bottleneck is the single 10GE link to each floor, rather than the fileserver's own NICs.
>
> That's all theoretical, of course - I've never built a datacentre network so I don't know how it's done in practice.
>
>  - Jonathan Morton
>
BTW, there is one very common case we all share that will benefit:

Think your home network: you have 1GE (or maybe only 100MBPS) Ethernet
to your other machines...

What is more, consumer ethernet switches do do flow control, whether you
want them to or not.  So you routinely have queues build, even on
ethernet, in those environments today.

            Jim

(I wasn't even aware of Ethernet flow control's existence until 2 years
ago, when I wanted to understand the funny frames wireshark was
reporting on my home network).  Then I read up on it...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 17:15       ` Jim Gettys
@ 2012-08-31 17:31         ` Rick Jones
  2012-08-31 17:44           ` Jim Gettys
  0 siblings, 1 reply; 27+ messages in thread
From: Rick Jones @ 2012-08-31 17:31 UTC (permalink / raw)
  To: Jim Gettys; +Cc: codel

On 08/31/2012 10:15 AM, Jim Gettys wrote:
> What is more, consumer ethernet switches do do flow control, whether you
> want them to or not.

My understanding is that Ethernet flow control (what ever its 
802.1mumble might be?) is negotiated when the link is brought-up, and 
that both sides must agree before it will be active.  So, if you (the 
end station) do not want flow control, you can simply not accept it 
during link-up.

Under Linux, the ethtool utility can be used to affect the configuration 
of "pause" (the name coming from the name of the frames used by flow 
control - "pause frames" if I recall correctly)

rick

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 17:31         ` Rick Jones
@ 2012-08-31 17:44           ` Jim Gettys
  0 siblings, 0 replies; 27+ messages in thread
From: Jim Gettys @ 2012-08-31 17:44 UTC (permalink / raw)
  To: Rick Jones; +Cc: codel

On 08/31/2012 10:31 AM, Rick Jones wrote:
> On 08/31/2012 10:15 AM, Jim Gettys wrote:
>> What is more, consumer ethernet switches do do flow control, whether you
>> want them to or not.
>
> My understanding is that Ethernet flow control (what ever its
> 802.1mumble might be?) is negotiated when the link is brought-up, and
> that both sides must agree before it will be active.  So, if you (the
> end station) do not want flow control, you can simply not accept it
> during link-up.
>
> Under Linux, the ethtool utility can be used to affect the
> configuration of "pause" (the name coming from the name of the frames
> used by flow control - "pause frames" if I recall correctly)

The cheap consumer switches typically have it on; and by default, most
ethernet drivers have it on by default.

That's how/why I found out about pause frames in the first place; I
wasn't looking for it, and the existence of these frames in wireshark
was a surprise to me.

This can cause "interesting" performance problems most consumers won't
understand when switches are layered.

Most enterprise switches will often have it off by default, or plain
never generate flow control frames at all.

                        - Jim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-08-31 13:57   ` [Codel] [RFC v2] " Eric Dumazet
@ 2012-09-01  1:37     ` Yuchung Cheng
  2012-09-01 12:51       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Yuchung Cheng @ 2012-09-01  1:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tomas Hruby, Nandita Dukkipati, netdev, codel

On Fri, Aug 31, 2012 at 6:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-08-31 at 06:50 -0700, Eric Dumazet wrote:
>> On Thu, 2012-08-30 at 23:55 -0700, Eric Dumazet wrote:
>> > On locally generated TCP traffic (host), we can override the 100 ms
>> > interval value using the more accurate RTT estimation maintained by TCP
>> > stack (tp->srtt)
>> >
>> > Datacenter workload benefits using shorter feedback (say if RTT is below
>> > 1 ms, we can react 100 times faster to a congestion)
>> >
>> > Idea from Yuchung Cheng.
>> >
>>
>> Linux patch would be the following :
>>
>> I'll do tests next week, but I am sending a raw patch right now if
>> anybody wants to try it.
>>
>> Presumably we also want to adjust target as well.
>>
>> To get more precise srtt values in the datacenter, we might avoid the
>> 'one jiffie slack' on small values in tcp_rtt_estimator(), as we force
>> m to be 1 before the scaling by 8 :
>>
>> if (m == 0)
>>       m = 1;
>>
>> We only need to force the least significant bit of srtt to be set.
>>
Just curious: tp->srtt is a very rough estimator, e.g., Delayed-ACks
can easily add 40 - 200ms fuzziness. Will this affect short flows?


>
> Hmm, I also need to properly init default_interval after
> codel_params_init(&q->cparams) :
>
>  net/sched/sch_fq_codel.c |   24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index 9fc1c62..f04ff6a 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -25,6 +25,7 @@
>  #include <net/pkt_sched.h>
>  #include <net/flow_keys.h>
>  #include <net/codel.h>
> +#include <linux/tcp.h>
>
>  /*     Fair Queue CoDel.
>   *
> @@ -59,6 +60,7 @@ struct fq_codel_sched_data {
>         u32             perturbation;   /* hash perturbation */
>         u32             quantum;        /* psched_mtu(qdisc_dev(sch)); */
>         struct codel_params cparams;
> +       codel_time_t    default_interval;
>         struct codel_stats cstats;
>         u32             drop_overlimit;
>         u32             new_flow_count;
> @@ -211,6 +213,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>         return NET_XMIT_SUCCESS;
>  }
>
> +/* Given TCP srtt evaluation, return codel interval.
> + * srtt is given in jiffies, scaled by 8.
> + */
> +static codel_time_t tcp_srtt_to_codel(unsigned int srtt)
> +{
> +       return srtt * ((NSEC_PER_SEC >> (CODEL_SHIFT + 3)) / HZ);
> +}
> +
>  /* This is the specific function called from codel_dequeue()
>   * to dequeue a packet from queue. Note: backlog is handled in
>   * codel, we dont need to reduce it here.
> @@ -220,12 +230,21 @@ static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
>         struct fq_codel_sched_data *q = qdisc_priv(sch);
>         struct fq_codel_flow *flow;
>         struct sk_buff *skb = NULL;
> +       struct sock *sk;
>
>         flow = container_of(vars, struct fq_codel_flow, cvars);
>         if (flow->head) {
>                 skb = dequeue_head(flow);
>                 q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
>                 sch->q.qlen--;
> +               sk = skb->sk;
> +               q->cparams.interval = q->default_interval;
> +               if (sk && sk->sk_protocol == IPPROTO_TCP) {
> +                       u32 srtt = tcp_sk(sk)->srtt;
> +
> +                       if (srtt)
> +                               q->cparams.interval = tcp_srtt_to_codel(srtt);
> +               }
>         }
>         return skb;
>  }
> @@ -330,7 +349,7 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
>         if (tb[TCA_FQ_CODEL_INTERVAL]) {
>                 u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
>
> -               q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
> +               q->default_interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
>         }
>
>         if (tb[TCA_FQ_CODEL_LIMIT])
> @@ -395,6 +414,7 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
>         INIT_LIST_HEAD(&q->new_flows);
>         INIT_LIST_HEAD(&q->old_flows);
>         codel_params_init(&q->cparams);
> +       q->default_interval = q->cparams.interval;
>         codel_stats_init(&q->cstats);
>         q->cparams.ecn = true;
>
> @@ -441,7 +461,7 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
>             nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
>                         sch->limit) ||
>             nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
> -                       codel_time_to_us(q->cparams.interval)) ||
> +                       codel_time_to_us(q->default_interval)) ||
>             nla_put_u32(skb, TCA_FQ_CODEL_ECN,
>                         q->cparams.ecn) ||
>             nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-01  1:37     ` Yuchung Cheng
@ 2012-09-01 12:51       ` Eric Dumazet
  2012-09-04 15:10         ` Nandita Dukkipati
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-09-01 12:51 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Tomas Hruby, Nandita Dukkipati, netdev, codel

On Fri, 2012-08-31 at 18:37 -0700, Yuchung Cheng wrote:

> Just curious: tp->srtt is a very rough estimator, e.g., Delayed-ACks
> can easily add 40 - 200ms fuzziness. Will this affect short flows?

Good point

Delayed acks shouldnt matter, because they happen when flow had been
idle for a while.

I guess we should clamp the srtt to the default interval

if (srtt)
	q->cparams.interval = min(tcp_srtt_to_codel(srtt),
				  q->default_interval);




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-08-31 16:59     ` Dave Taht
@ 2012-09-01 12:53       ` Eric Dumazet
  2012-09-02 18:08         ` Dave Taht
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-09-01 12:53 UTC (permalink / raw)
  To: Dave Taht; +Cc: codel

On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote:

> I realize that 10GigE and datacenter host based work is sexy and fun,
> but getting stuff that runs well in today's 1-20Mbit environments is
> my own priority, going up to 100Mbit, with something that can be
> embedded in a SoC. The latest generation of SoCs all do QoS in
> hardware... badly.

Maybe 'datacenter' word was badly chosen and you obviously jumped on it,
because it meant different things for you.

Point was that when your machine has flows with quite different RTT, 1
ms on your local LAN, and 100 ms on different continent, current control
law might clamp long distance communications, or have slow response time
for the LAN traffic.

The shortest path you have, the sooner you should drop packets because
losses have much less impact on latencies.

Yuchung idea sounds very good and my intuition is it will give
tremendous results for standard linux qdisc setups ( a single qdisc per
device)

To get similar effects, you could use two (or more) fq codels per
ethernet device.

One fq_codel with interval = 1 or 5 ms for LAN communications
One fq_codel with interval = 100 ms for other communications  
tc filters to select the right qdisc by destination addresses

Then we are a bit far from codel spirit (no knob qdisc)

I am pretty sure you noticed that if your ethernet adapter is only used
for LAN communications, you have to setup codel interval to a much
smaller value than the 100 ms default to get reasonably fast answer to
congestion.

Just make this automatic, because people dont want to think about it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-09-01 12:53       ` Eric Dumazet
@ 2012-09-02 18:08         ` Dave Taht
  2012-09-02 18:17           ` Dave Taht
  2012-09-02 23:23           ` Eric Dumazet
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Taht @ 2012-09-02 18:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On Sat, Sep 1, 2012 at 5:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote:
>
>> I realize that 10GigE and datacenter host based work is sexy and fun,
>> but getting stuff that runs well in today's 1-20Mbit environments is
>> my own priority, going up to 100Mbit, with something that can be
>> embedded in a SoC. The latest generation of SoCs all do QoS in
>> hardware... badly.
>
> Maybe 'datacenter' word was badly chosen and you obviously jumped on it,
> because it meant different things for you.

I am hypersensitive about optimizing for sub-ms problems when there are
huge multi-second problems like in cable, wifi, and cellular. Recent paper:

http://conferences.sigcomm.org/sigcomm/2012/paper/cellnet/p1.pdf

Sorry.

If the srtt idea can scale UP as well as down sanely, cool. I'm
concerned about how different TCPs might react to this and have a
long comment about the placement of this at this layer at the bottom
of this email.

> Point was that when your machine has flows with quite different RTT, 1
> ms on your local LAN, and 100 ms on different continent, current control
> law might clamp long distance communications, or have slow response time
> for the LAN traffic.

fq_codel, far less likely, and if you have a collision between long distance
and local streams in a single queue, there, what will happen if you fiddle
with srrt?

> The shortest path you have, the sooner you should drop packets because
> losses have much less impact on latencies.

Sure.

> Yuchung idea sounds very good and my intuition is it will give
> tremendous results for standard linux qdisc setups ( a single qdisc per
> device)

I tend to agree.

> To get similar effects, you could use two (or more) fq codels per
> ethernet device.

Ugh.

> One fq_codel with interval = 1 or 5 ms for LAN communications
> One fq_codel with interval = 100 ms for other communications
and one mfq_codel with a calculated maxpacket, weird interval, etc
for wifi.

> tc filters to select the right qdisc by destination addresses

Meh. A simple default might be "Am I going out the default route for this?"
> Then we are a bit far from codel spirit (no knob qdisc)
>
> I am pretty sure you noticed that if your ethernet adapter is only used
> for LAN communications, you have to setup codel interval to a much
> smaller value than the 100 ms default to get reasonably fast answer to
> congestion.

At 100Mbit, (as I've noted elsewhere), BQL choses defaults about double
optimum (6-7k), and gso is currently left on. With those disabled, I tend to run
a pretty congested network, and rarely notice.  That does not mean that
reaction time isn't an issue, it is merely masked so well that I don't care.

> Just make this automatic, because people dont want to think about it.

Like you, I want one qdisc to rule them all, with sane defaults.

I do feel it is very necessary to add in one pfifo_fast-like behavior in
fq_codel: deprioritizing background traffic, in its own
set of fq'd flows. Simple way to do that is to have a bkweight of,
say 20, and only check "q->slow_flows" on that interval of packet
deliveries.

This is the only way I can think of to survive bittorrent-like flows, and to
capture the intent of traffic marked background.

However, I did want to talk to the using-codel-to-solve-everything issue
for fixing host bufferbloat...

Fixing host bufferbloat by adding local tcp awareness is a neat idea,
don't let me stop you! But...

Codel will push stuff down to, but not below, 5ms of latency (or
target). In fq_codel you will typically end up with 1 packet outstanding in
each active queue under heavy load. At 10Mbit it's pretty easy to
have it strain mightily and fail to get to 5ms, particularly on torrent-like
workloads.

The "right" amount of host latency to aim for is ... 0, or as close to it as
you can get.  Fiddling with codel target and interval on the host to
get less host latency is well and good, but you can't get to 0 that way...

The best queue on a host is no extra queue.

I spent some time evaluating linux fq_codel vs the ns2 nfq_codel version I
just got working. In 150 bidirectional competing streams, at 100Mbit,
it retained about 30% less packets in queue (110 vs 140). Next up
on my list is longer RTTs and wifi, but all else was pretty equivalent.

The effects of fiddling with /proc/sys/net/ipv4/tcp_limit_output_bytes
was even more remarkable. At 6000, I would get down to
a nice steady 71-81 packets in queue on that 150 stream workload.

So, I started thinking through and playing with how TSQ works:

At one hop 100Mbit, with a BQL of 3000 and a tcp_limit_output_bytes of 6000,
all offloads off, nfq_codel on both ends, I get single stream throughoutput
of 92.85Mbit.  Backlog in qdisc is, 0.

2 netperf streams, bidirectional: 91.47 each, darn close to theoretical, less
than one packet in the backlog.

4 streams backlogs a little over 3. (and sums to 91.94 in each direction)

8, backlog of 8. (optimal throughput)

Repeating the 8 stream test with tcp_output_limit of 1500, I get
packets outstanding of around 3, and optimal throughput. (1 stream test:
42Mbit throughput (obviously starved), 150 streams: 82...)

8 streams, limit set to 127k, I get 50 packets outstanding in the queue,
and the same throughput. (150 streams, ~100)

So I might argue that a more "right" number for tcp_output_bytes is
not 128k per TCP socket, but (BQL_limit*2/active_sockets), in conjunction
with fq_codel. I realize that that raises interesting questions as to when
to use TSO/GSO, and how to schedule tcp packet releases, and pushes
the window reduction issue all the way up into the tcp stack rather
than responding to indications from the qdisc... but it does
get you closer to a 0 backlog in qdisc.

And *usually* the bottleneck link is not on the host but on something
inbetween, and that's where your signalling comes from, anyway.

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-09-02 18:08         ` Dave Taht
@ 2012-09-02 18:17           ` Dave Taht
  2012-09-02 23:28             ` Eric Dumazet
  2012-09-02 23:23           ` Eric Dumazet
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Taht @ 2012-09-02 18:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On Sun, Sep 2, 2012 at 11:08 AM, Dave Taht <dave.taht@gmail.com> wrote:

In reviewing this mail I realized I used three different names
for tcp_limit_output_bytes, corrected below...

> Codel will push stuff down to, but not below, 5ms of latency (or
> target). In fq_codel you will typically end up with 1 packet outstanding in
> each active queue under heavy load. At 10Mbit it's pretty easy to
> have it strain mightily and fail to get to 5ms, particularly on torrent-like
> workloads.
>
> The "right" amount of host latency to aim for is ... 0, or as close to it as
> you can get.  Fiddling with codel target and interval on the host to
> get less host latency is well and good, but you can't get to 0 that way...
>
> The best queue on a host is no extra queue.
>
> I spent some time evaluating linux fq_codel vs the ns2 nfq_codel version I
> just got working. In 150 bidirectional competing streams, at 100Mbit,
> it retained about 30% less packets in queue (110 vs 140). Next up
> on my list is longer RTTs and wifi, but all else was pretty equivalent.
>
> The effects of fiddling with /proc/sys/net/ipv4/tcp_limit_output_bytes
> was even more remarkable. At 6000, I would get down to
> a nice steady 71-81 packets in queue on that 150 stream workload.
>
> So, I started thinking through and playing with how TSQ works:
>
> At one hop 100Mbit, with a BQL of 3000 and a tcp_limit_output_bytes of 6000,
> all offloads off, nfq_codel on both ends, I get single stream throughoutput
> of 92.85Mbit.  Backlog in qdisc is, 0.
>
> 2 netperf streams, bidirectional: 91.47 each, darn close to theoretical, less
> than one packet in the backlog.
>
> 4 streams backlogs a little over 3. (and sums to 91.94 in each direction)
>
> 8, backlog of 8. (optimal throughput)
>
> Repeating the 8 stream test with tcp_limit_output_bytes of 1500, I get
> packets outstanding of around 3, and optimal throughput. (1 stream test:
> 42Mbit throughput (obviously starved), 150 streams: 82...)
>
> 8 streams, limit set to 127k, I get 50 packets outstanding in the queue,
> and the same throughput. (150 streams, ~100)
>
> So I might argue that a more "right" number for tcp_limit_output_bytes is
> not 128k per TCP socket, but (BQL_limit*2/active_sockets), in conjunction
> with fq_codel. I realize that that raises interesting questions as to when
> to use TSO/GSO, and how to schedule tcp packet releases, and pushes
> the window reduction issue all the way up into the tcp stack rather
> than responding to indications from the qdisc... but it does
> get you closer to a 0 backlog in qdisc.
>
> And *usually* the bottleneck link is not on the host but on something
> inbetween, and that's where your signalling comes from, anyway.
>
>
> --
> Dave Täht
> http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
> with fq_codel!"



-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-09-02 18:08         ` Dave Taht
  2012-09-02 18:17           ` Dave Taht
@ 2012-09-02 23:23           ` Eric Dumazet
  2012-09-03  0:18             ` Dave Taht
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-09-02 23:23 UTC (permalink / raw)
  To: Dave Taht; +Cc: codel

On Sun, 2012-09-02 at 11:08 -0700, Dave Taht wrote:
> On Sat, Sep 1, 2012 at 5:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote:
> >
> >> I realize that 10GigE and datacenter host based work is sexy and fun,
> >> but getting stuff that runs well in today's 1-20Mbit environments is
> >> my own priority, going up to 100Mbit, with something that can be
> >> embedded in a SoC. The latest generation of SoCs all do QoS in
> >> hardware... badly.
> >
> > Maybe 'datacenter' word was badly chosen and you obviously jumped on it,
> > because it meant different things for you.
> 
> I am hypersensitive about optimizing for sub-ms problems when there are
> huge multi-second problems like in cable, wifi, and cellular. Recent paper:
> 
> http://conferences.sigcomm.org/sigcomm/2012/paper/cellnet/p1.pdf

Yes. Take a deep breath, please.

In France, we use to say : "Rome ne s'est pas faite en un jour"

It means you wont solve all your problems at once.

Step by step we improve things, and we discover new issues or
possibilities.

So denying one particular improvement is not good because it doesnt
solve the 'Bufferbloat in the known Universe' is not very helpful.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-09-02 18:17           ` Dave Taht
@ 2012-09-02 23:28             ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-09-02 23:28 UTC (permalink / raw)
  To: Dave Taht; +Cc: codel

On Sun, 2012-09-02 at 11:17 -0700, Dave Taht wrote:
> On Sun, Sep 2, 2012 at 11:08 AM, Dave Taht <dave.taht@gmail.com> wrote:
> 
> In reviewing this mail I realized I used three different names
> for tcp_limit_output_bytes, corrected below...
> 

This TSQ limit is a per flow limit, and a workaround for people still
using pfifo_fast, so that a single TCP elephant flow doesnt hurt too
much.

Our expectation is to replace pfifo_fast using a fq_codel like qdisc

In linux Plumber Conference, the last part of my pres was to say that I
was working on this.

Dont waste your time on TSQ, use the real thing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] fq_codel : interval servo
  2012-09-02 23:23           ` Eric Dumazet
@ 2012-09-03  0:18             ` Dave Taht
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Taht @ 2012-09-03  0:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: codel

On Sun, Sep 2, 2012 at 4:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2012-09-02 at 11:08 -0700, Dave Taht wrote:
>> On Sat, Sep 1, 2012 at 5:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote:
>> >
>> >> I realize that 10GigE and datacenter host based work is sexy and fun,
>> >> but getting stuff that runs well in today's 1-20Mbit environments is
>> >> my own priority, going up to 100Mbit, with something that can be
>> >> embedded in a SoC. The latest generation of SoCs all do QoS in
>> >> hardware... badly.
>> >
>> > Maybe 'datacenter' word was badly chosen and you obviously jumped on it,
>> > because it meant different things for you.
>>
>> I am hypersensitive about optimizing for sub-ms problems when there are
>> huge multi-second problems like in cable, wifi, and cellular. Recent paper:
>>
>> http://conferences.sigcomm.org/sigcomm/2012/paper/cellnet/p1.pdf
>
> Yes. Take a deep breath, please.
>
> In France, we use to say : "Rome ne s'est pas faite en un jour"
>
> It means you wont solve all your problems at once.

Comforting.

> Step by step we improve things, and we discover new issues or
> possibilities.
>
> So denying one particular improvement is not good because it doesnt
> solve the 'Bufferbloat in the known Universe' is not very helpful.

I didn't say that. I LOVE that you and lots of people are fiddling
with various ideas around the core
codel concepts, and clearly said I think it's neat...

My larger question, poorly put, was:

what would applying ideas to codel based on measured tcp srrt,
do in the larger known universe as (for example) outlined by the data in that
paper, along with different tcps of any flavor at any rtt?

your original thought was to limit it to the local lan somehow.


-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-01 12:51       ` Eric Dumazet
@ 2012-09-04 15:10         ` Nandita Dukkipati
  2012-09-04 15:25           ` Jonathan Morton
  2012-09-04 15:34           ` Eric Dumazet
  0 siblings, 2 replies; 27+ messages in thread
From: Nandita Dukkipati @ 2012-09-04 15:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tomas Hruby, netdev, codel

The idea of using srtt as interval makes sense to me if alongside we
also hash flows with similar RTTs into same bucket. But with just the
change in interval, I am not sure how codel is expected to behave.

My understanding is: the interval (usually set to worst case expected
RTT) is used to measure the standing queue or the "bad" queue. Suppose
1ms and 100ms RTT flows get hashed to same bucket, then the interval
with this patch will flip flop between 1ms and 100ms. How is this
expected to measure a standing queue? In fact I think the 1ms flow may
land up measuring the burstiness or the "good" queue created by the
long RTT flows, and this isn't desirable.

On Sat, Sep 1, 2012 at 5:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-08-31 at 18:37 -0700, Yuchung Cheng wrote:
>
>> Just curious: tp->srtt is a very rough estimator, e.g., Delayed-ACks
>> can easily add 40 - 200ms fuzziness. Will this affect short flows?
>
> Good point
>
> Delayed acks shouldnt matter, because they happen when flow had been
> idle for a while.
>
> I guess we should clamp the srtt to the default interval
>
> if (srtt)
>         q->cparams.interval = min(tcp_srtt_to_codel(srtt),
>                                   q->default_interval);
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 15:10         ` Nandita Dukkipati
@ 2012-09-04 15:25           ` Jonathan Morton
  2012-09-04 15:39             ` Eric Dumazet
  2012-09-04 15:34           ` Eric Dumazet
  1 sibling, 1 reply; 27+ messages in thread
From: Jonathan Morton @ 2012-09-04 15:25 UTC (permalink / raw)
  To: Nandita Dukkipati; +Cc: netdev, codel, Tomas Hruby

I think that in most cases, a long RTT flow and a short RTT flow on the same interface means that the long RTT flow isn't bottlenecked here, and therefore won't ever build up a significant queue - and that means you would want to track over the shorter interval. Is that a reasonable assumption?

The key to knowledge is not to rely on others to teach you it. 

On 4 Sep 2012, at 18:10, Nandita Dukkipati <nanditad@google.com> wrote:

> The idea of using srtt as interval makes sense to me if alongside we
> also hash flows with similar RTTs into same bucket. But with just the
> change in interval, I am not sure how codel is expected to behave.
> 
> My understanding is: the interval (usually set to worst case expected
> RTT) is used to measure the standing queue or the "bad" queue. Suppose
> 1ms and 100ms RTT flows get hashed to same bucket, then the interval
> with this patch will flip flop between 1ms and 100ms. How is this
> expected to measure a standing queue? In fact I think the 1ms flow may
> land up measuring the burstiness or the "good" queue created by the
> long RTT flows, and this isn't desirable.
> 
> 
> On Sat, Sep 1, 2012 at 5:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Fri, 2012-08-31 at 18:37 -0700, Yuchung Cheng wrote:
>> 
>>> Just curious: tp->srtt is a very rough estimator, e.g., Delayed-ACks
>>> can easily add 40 - 200ms fuzziness. Will this affect short flows?
>> 
>> Good point
>> 
>> Delayed acks shouldnt matter, because they happen when flow had been
>> idle for a while.
>> 
>> I guess we should clamp the srtt to the default interval
>> 
>> if (srtt)
>>        q->cparams.interval = min(tcp_srtt_to_codel(srtt),
>>                                  q->default_interval);
>> 
>> 
>> 
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 15:10         ` Nandita Dukkipati
  2012-09-04 15:25           ` Jonathan Morton
@ 2012-09-04 15:34           ` Eric Dumazet
  2012-09-04 16:40             ` Dave Taht
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-09-04 15:34 UTC (permalink / raw)
  To: Nandita Dukkipati; +Cc: Tomas Hruby, netdev, codel

On Tue, 2012-09-04 at 08:10 -0700, Nandita Dukkipati wrote:
> The idea of using srtt as interval makes sense to me if alongside we
> also hash flows with similar RTTs into same bucket. But with just the
> change in interval, I am not sure how codel is expected to behave.
> 
> My understanding is: the interval (usually set to worst case expected
> RTT) is used to measure the standing queue or the "bad" queue. Suppose
> 1ms and 100ms RTT flows get hashed to same bucket, then the interval
> with this patch will flip flop between 1ms and 100ms. How is this
> expected to measure a standing queue? In fact I think the 1ms flow may
> land up measuring the burstiness or the "good" queue created by the
> long RTT flows, and this isn't desirable.
> 

Well, how things settle with a pure codel, mixing flows of very
different RTT then ?

It seems there is a high resistance on SFQ/fq_codel model because of the
probabilities of flows sharing a bucket.

So what about removing the stochastic thing and switch to a hash with
collision resolution ?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 15:25           ` Jonathan Morton
@ 2012-09-04 15:39             ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-09-04 15:39 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Nandita Dukkipati, netdev, codel, Tomas Hruby

On Tue, 2012-09-04 at 18:25 +0300, Jonathan Morton wrote:
> I think that in most cases, a long RTT flow and a short RTT flow on
> the same interface means that the long RTT flow isn't bottlenecked
> here, and therefore won't ever build up a significant queue - and that
> means you would want to track over the shorter interval. Is that a
> reasonable assumption?
> 

This would be reasonable, but if we have a shorter interval, this means
we could drop packets of the long RTT flow sooner than expected.

Thats because the drop_next value is setup on the previous packet, and
not based on the 'next packet'

Re-evaluating drop_next at the right time would need more cpu cycles.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 15:34           ` Eric Dumazet
@ 2012-09-04 16:40             ` Dave Taht
  2012-09-04 16:54               ` Eric Dumazet
  2012-09-04 16:57               ` Eric Dumazet
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Taht @ 2012-09-04 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tomas Hruby, Nandita Dukkipati, netdev, codel

On Tue, Sep 4, 2012 at 8:34 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-09-04 at 08:10 -0700, Nandita Dukkipati wrote:
>> The idea of using srtt as interval makes sense to me if alongside we
>> also hash flows with similar RTTs into same bucket. But with just the
>> change in interval, I am not sure how codel is expected to behave.
>>
>> My understanding is: the interval (usually set to worst case expected
>> RTT) is used to measure the standing queue or the "bad" queue. Suppose
>> 1ms and 100ms RTT flows get hashed to same bucket, then the interval
>> with this patch will flip flop between 1ms and 100ms. How is this
>> expected to measure a standing queue? In fact I think the 1ms flow may
>> land up measuring the burstiness or the "good" queue created by the
>> long RTT flows, and this isn't desirable.

Experiments would be good.

>
> Well, how things settle with a pure codel, mixing flows of very
> different RTT then ?

Elephants are shot statistically more often than mice.

> It seems there is a high resistance on SFQ/fq_codel model because of the
> probabilities of flows sharing a bucket.

I was going to do this in a separate email, because it is a little off-topic.

fq_codel has a standing queue problem, based on the fact that when a
queue empties, codel.h resets. This made sense for the single FIFO
codel but not multi-queued fq_codel. So after we hit X high rate
flows, target can never be achieved, even straining mightily, and we
end up with a standing queue again.

Easily seen with like 150 bidirectional flows at 10 or 100Mbit.

(as queues go, it's still pretty good queue. And: I've fiddled with
various means of draining multi-queue behavior thus far, and they
ended up unstable/unfair)

> So what about removing the stochastic thing and switch to a hash with
> collision resolution ?

Was considered and discarded in the original SFQ paper as being too
computationally intensive
(in 1993). Worth revisiting.

http://www2.rdrop.com/~paulmck/scalability/paper/sfq.2002.06.04.pdf

>
>

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 16:40             ` Dave Taht
@ 2012-09-04 16:54               ` Eric Dumazet
  2012-09-04 16:57               ` Eric Dumazet
  1 sibling, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-09-04 16:54 UTC (permalink / raw)
  To: Dave Taht; +Cc: Tomas Hruby, Nandita Dukkipati, netdev, codel

On Tue, 2012-09-04 at 09:40 -0700, Dave Taht wrote:

> >
> > Well, how things settle with a pure codel, mixing flows of very
> > different RTT then ?
> 
> Elephants are shot statistically more often than mice.

This doesnt answer the question.

long/short RTT have nothing to do with elephant and mice.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Codel] [RFC v2] fq_codel : interval servo on hosts
  2012-09-04 16:40             ` Dave Taht
  2012-09-04 16:54               ` Eric Dumazet
@ 2012-09-04 16:57               ` Eric Dumazet
  1 sibling, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-09-04 16:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: Tomas Hruby, Nandita Dukkipati, netdev, codel

On Tue, 2012-09-04 at 09:40 -0700, Dave Taht wrote:

> fq_codel has a standing queue problem, based on the fact that when a
> queue empties, codel.h resets. This made sense for the single FIFO
> codel but not multi-queued fq_codel. So after we hit X high rate
> flows, target can never be achieved, even straining mightily, and we
> end up with a standing queue again.
> 
> Easily seen with like 150 bidirectional flows at 10 or 100Mbit.
> 
> (as queues go, it's still pretty good queue. And: I've fiddled with
> various means of draining multi-queue behavior thus far, and they
> ended up unstable/unfair)

No idea of what you mean by "codel.h resets".

Please use small mails, one idea by mail.




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-09-04 16:57 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-31  6:55 [Codel] fq_codel : interval servo Eric Dumazet
2012-08-31 13:41 ` Jim Gettys
2012-08-31 13:50 ` [Codel] [RFC] fq_codel : interval servo on hosts Eric Dumazet
2012-08-31 13:57   ` [Codel] [RFC v2] " Eric Dumazet
2012-09-01  1:37     ` Yuchung Cheng
2012-09-01 12:51       ` Eric Dumazet
2012-09-04 15:10         ` Nandita Dukkipati
2012-09-04 15:25           ` Jonathan Morton
2012-09-04 15:39             ` Eric Dumazet
2012-09-04 15:34           ` Eric Dumazet
2012-09-04 16:40             ` Dave Taht
2012-09-04 16:54               ` Eric Dumazet
2012-09-04 16:57               ` Eric Dumazet
2012-08-31 15:53 ` [Codel] fq_codel : interval servo Rick Jones
2012-08-31 16:23   ` Eric Dumazet
2012-08-31 16:59     ` Dave Taht
2012-09-01 12:53       ` Eric Dumazet
2012-09-02 18:08         ` Dave Taht
2012-09-02 18:17           ` Dave Taht
2012-09-02 23:28             ` Eric Dumazet
2012-09-02 23:23           ` Eric Dumazet
2012-09-03  0:18             ` Dave Taht
2012-08-31 16:40   ` Jim Gettys
2012-08-31 16:49     ` Jonathan Morton
2012-08-31 17:15       ` Jim Gettys
2012-08-31 17:31         ` Rick Jones
2012-08-31 17:44           ` Jim Gettys

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox