From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-we0-f171.google.com (mail-we0-f171.google.com
	[74.125.82.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 41F1E20061E
	for <codel@lists.bufferbloat.net>; Thu,  3 May 2012 15:38:25 -0700 (PDT)
Received: by wejx9 with SMTP id x9so2610358wej.16
	for <codel@lists.bufferbloat.net>; Thu, 03 May 2012 15:38:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=uaJwwZ85GuVthdxcExSeNPs5fln6Kf7vwwVOhorijyE=;
	b=cFHjIVd9/xsADggs38S/kqFcHqtrtfsvJgMoR4i15OWKxzterDY8m3wsIBNfU8snpW
	1PgrMdlw/xeN/PrjPauvGR2dRa2bwmYlRjjCnIoS91zofl8BYvztwF0DCi+6+VCFgMmt
	ZsRAzo1s9laupaR6CXArVU38Mkrt++7aV7FzeCj9vERC9547w7sFG1jf0nGLHj1zIjbZ
	fUuFKb0g7YnbAeECNODGsRIO+jhWI22xuDOblus0bOWx/JFUZwhRfEvfcXeyvmsylqV0
	l024Pas2fUS3FgeyP/b+vlYihG4l/t+MLMv6+umGh2Zvj7FktmTRw+IyzVbDJrIEOkYv
	h/Jg==
MIME-Version: 1.0
Received: by 10.216.142.226 with SMTP id i76mr2548506wej.28.1336084702540;
	Thu, 03 May 2012 15:38:22 -0700 (PDT)
Received: by 10.223.112.66 with HTTP; Thu, 3 May 2012 15:38:22 -0700 (PDT)
In-Reply-To: <1336069030.3752.44.camel@edumazet-glaptop>
References: <1336067533-16923-1-git-send-email-dave.taht@bufferbloat.net>
	<1336067533-16923-2-git-send-email-dave.taht@bufferbloat.net>
	<1336069030.3752.44.camel@edumazet-glaptop>
Date: Thu, 3 May 2012 15:38:22 -0700
Message-ID: <CAA93jw6YAf+02Asrd1iExan=zQvu9ozoNzZY4J9BfWHZioxcbg@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Cc: codel@lists.bufferbloat.net,
	=?ISO-8859-1?Q?Dave_T=E4ht?= <dave.taht@bufferbloat.net>
Subject: Re: [Codel] [PATCH] Preliminary codel implementation
X-BeenThere: codel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: CoDel AQM discussions <codel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/codel>,
	<mailto:codel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/private/codel>
List-Post: <mailto:codel@lists.bufferbloat.net>
List-Help: <mailto:codel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/codel>,
	<mailto:codel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Thu, 03 May 2012 22:38:26 -0000

On Thu, May 3, 2012 at 11:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrot=
e:
> On Thu, 2012-05-03 at 10:52 -0700, Dave T=E4ht wrote:
>> ---
>> =A0include/linux/pkt_sched.h | =A0 29 +++++
>> =A0net/sched/Kconfig =A0 =A0 =A0 =A0 | =A0 11 ++
>> =A0net/sched/Makefile =A0 =A0 =A0 =A0| =A0 =A01 +
>> =A0net/sched/sch_codel.c =A0 =A0 | =A0296 ++++++++++++++++++++++++++++++=
+++++++++++++++
>> =A04 files changed, 337 insertions(+)
>> =A0create mode 100644 net/sched/sch_codel.c
>>
>
>
> Hi Dave.
>
> Was this tested or is just a draft ?

It compiled, seemed logically correct (since disproven), passedpackets
and didn't crash. :) I was happy with the night's work for all of 10
minutes.

I do massively appreciate your feedback and advice, as these are areas
of the kernel that are opaque to most.

>
>> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
>> index 0d5b793..c21b720 100644
>> --- a/include/linux/pkt_sched.h
>> +++ b/include/linux/pkt_sched.h
>> @@ -633,4 +633,33 @@ struct tc_qfq_stats {
>> =A0 =A0 =A0 __u32 lmax;
>> =A0};
>>
>> +/* CODEL */
>> +
>> +enum {
>> + =A0 =A0 TCA_CODEL_UNSPEC,
>> + =A0 =A0 TCA_CODEL_PARMS,
>> + =A0 =A0 TCA_CODEL_TARGET,
>> + =A0 =A0 TCA_CODEL_DEPTH,
>> + =A0 =A0 TCA_CODEL_MINBYTES,
>> + =A0 =A0 TCA_CODEL_INTERVAL,
>> + =A0 =A0 __TCA_CODEL_MAX
>> +};
>> +
>> +#define TCA_CODEL_MAX =A0 =A0 =A0 =A0(__TCA_CODEL_MAX - 1)
>> +#define TC_CODEL_ECN 1
>> +
>> +struct tc_codel_qopt {
>> + =A0 =A0 =A0 =A0__u32 flags; /* flags (e.g. ecn) */
>> + =A0 =A0 __u32 target; =A0 /* max delay, in us */
>> + =A0 =A0 =A0 =A0__u32 depth; /* queue depth in packets */
>> + =A0 =A0 =A0 =A0__u32 minbytes; =A0 =A0 =A0/* MTU (usually) */
>> + =A0 =A0 =A0 =A0__u32 interval; =A0 =A0 =A0/* Sliding min time window w=
idth (us) */
>> +};
>> +
>> +struct tc_codel_stats {
>> + =A0 =A0 __u64 drops;
>> + =A0 =A0 __u64 marks;
>> +};
>> +
>> +
>> =A0#endif
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index 2590e91..8106c42 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -250,6 +250,17 @@ config NET_SCH_QFQ
>>
>> =A0 =A0 =A0 =A0 If unsure, say N.
>>
>> +config NET_SCH_CODEL
>> + =A0 =A0 tristate "Controlled Delay AQM (CODEL)"
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Say Y here if you want to use the Controlled Delay (CODEL)
>> + =A0 =A0 =A0 packet scheduling algorithm.
>> +
>> + =A0 =A0 =A0 To compile this driver as a module, choose M here: the mod=
ule
>> + =A0 =A0 =A0 will be called sch_codel.
>> +
>> + =A0 =A0 =A0 If unsure, say N.
>> +
>> =A0config NET_SCH_INGRESS
>> =A0 =A0 =A0 tristate "Ingress Qdisc"
>> =A0 =A0 =A0 depends on NET_CLS_ACT
>> diff --git a/net/sched/Makefile b/net/sched/Makefile
>> index dc5889c..41130b5 100644
>> --- a/net/sched/Makefile
>> +++ b/net/sched/Makefile
>> @@ -36,6 +36,7 @@ obj-$(CONFIG_NET_SCH_DRR) =A0 +=3D sch_drr.o
>> =A0obj-$(CONFIG_NET_SCH_MQPRIO) +=3D sch_mqprio.o
>> =A0obj-$(CONFIG_NET_SCH_CHOKE) =A0+=3D sch_choke.o
>> =A0obj-$(CONFIG_NET_SCH_QFQ) =A0 =A0+=3D sch_qfq.o
>> +obj-$(CONFIG_NET_SCH_CODEL) =A0+=3D sch_codel.o
>>
>> =A0obj-$(CONFIG_NET_CLS_U32) =A0 =A0+=3D cls_u32.o
>> =A0obj-$(CONFIG_NET_CLS_ROUTE4) +=3D cls_route.o
>> diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
>> new file mode 100644
>> index 0000000..159df47
>> --- /dev/null
>> +++ b/net/sched/sch_codel.c
>> @@ -0,0 +1,296 @@
>> +/*
>> + * net/sched/sch_codel.c =A0 =A0 A Codel implementation
>> + *
>> + * =A0 =A0 =A0 =A0 =A0 This program is free software; you can redistrib=
ute it and/or
>> + * =A0 =A0 =A0 =A0 =A0 modify it under the terms of the GNU General Pub=
lic License
>> + * =A0 =A0 =A0 =A0 =A0 as published by the Free Software Foundation; ei=
ther version
>> + * =A0 =A0 =A0 =A0 =A0 2 of the License, or (at your option) any later =
version.
>> + *
>> + * Based on ns2 simulation code presented by Kathie Nichols
>> + * Authors: =A0Dave T=E4ht <d@taht.net>
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/types.h>
>> +#include <linux/kernel.h>
>> +#include <linux/errno.h>
>> +#include <linux/ktime.h>
>> +#include <linux/skbuff.h>
>> +#include <net/pkt_sched.h>
>> +
>> +#define MS2TIME(a) (ns_to_ktime( (u64) a*1000000))
>> +#define DEFAULT_CODEL_DEPTH 1000
>> +
>> +/* Per-queue state (codel_queue_t instance variables) */
>> +
>> +struct codel_sched_data {
>> + =A0 =A0 u32 flags;
>> + =A0 =A0 u32 minbytes;
>> + =A0 =A0 u32 count; /* packets dropped since we went into drop state */
>> + =A0 =A0 bool dropping; =A0/* 1 if in drop state, might just add to fla=
gs */
>> + =A0 =A0 ktime_t target;
>> + =A0 =A0 ktime_t interval;
>> + =A0 =A0 =A0 =A0/* time to declare above q->target (0 if below)*/
>> + =A0 =A0 ktime_t first_above_time;
>> + =A0 =A0 ktime_t drop_next; /* time to drop next packet */
>> +};
>> +
>> +struct codel_skb_cb {
>> + =A0 =A0 ktime_t enqueue_time;
>> + =A0 =A0 char data[16];
>> +};
>> +
>> +static inline struct codel_skb_cb *get_codel_cb(const struct sk_buff *s=
kb)
>> +{
>> + =A0 =A0 return (struct codel_skb_cb *)skb->cb;
>> +}
>
> You cant do that. You MUST keep the generic qdisc cb unchanged
>

I would prefer to have the timestamp on entry to the overall system,
if there was some way to leverage the skb_shared_timestamps->syststamp
facility.

But, otherwise noted and corrected.

> {
> =A0 =A0 =A0 =A0qdisc_cb_private_validate(skb, sizeof(struct codel_skb_cb)=
);
> =A0 =A0 =A0 =A0return (struct codel_skb_cb *)qdisc_skb_cb(skb)->data;
> }
>
>
>
>
>> +
>> +static inline ktime_t get_enqueue_time(const struct sk_buff *skb) {
>> + =A0 =A0 struct codel_skb_cb *cb =3D get_codel_cb(skb);
>
> =A0 =A0 =A0 =A0const struct codel_skb_cb *cb =3D get_codel_cb(skb);
>
>> + =A0 =A0 return cb->enqueue_time;
>
> =A0 =A0 =A0 =A0or : return get_codel_cb(skb)->enqueue_time
>
>> +}
>> +

OK

>> +static inline ktime_t set_enqueue_time(struct sk_buff *skb, ktime_t t )=
 {
>> + =A0 =A0 struct codel_skb_cb *cb =3D get_codel_cb(skb);
>> + =A0 =A0 cb->enqueue_time =3D t;
>
> =A0 =A0 =A0 =A0get_codel_cb(skb)->enqueue_time =3D t;
>
>> + =A0 =A0 return t;
>> +}
>> +


I read that and my eyes cross.

>> +static inline ktime_t control_law(const struct codel_sched_data *q, kti=
me_t t)
>> +{
>> + =A0 =A0 t.tv64 =3D t.tv64 + q->interval.tv64 / int_sqrt(q->count);
>
> thats a killer for lot of machines... (divide plus the int_sqrt())


If you think this is high overhead compared to everything else going
on in a box, I'm geniunely puzzled... 3 pages of function calls have
to be traversed to get to here...

I am tending to agree now that the interval needent be 64bit.

>> + =A0 =A0 return t;
>> +}
>> +
>> +/*
>> +static int codel_prob_mark(const struct codel_sched_data *q)
>> +{
>> + =A0 =A0 return q->flags & TC_CODEL_ECN;
>
> Not clear if ECN plays well with CoDel at all.

Not clear to me either, wanted to leave room in the API to play with it lat=
er.

>> +}
>> +*/
>> +
>> + /* wrappers for ultimate statistics collection */
>> +
>> +static int codel_drop(struct sk_buff *skb, struct Qdisc *sch) {
>> + =A0 =A0 return qdisc_drop(skb,sch);
>> +}
>> +
>> +/*
>> +static int codel_queue_drop(struct Qdisc *sch) {
>> + =A0 =A0 return qdisc_drop(skb,sch);
>> +}
>> +*/
>

Aside from drops/marks as statistics, I was wanting something like
'avg closure rate', and 'dropping intervals'

To try and explain the first I'd have to refer to a 1/10th brilliant
paper about the microscopic behavior of tcp that Andrew and I were
discussing that I can't find right now.  Andrew?

>> +struct sk_buff *codel_dequeue_head(struct Qdisc *sch) {
>> + =A0 =A0 return(qdisc_dequeue_head(sch));
>> +}
>> +
>> +bool should_drop(struct sk_buff *skb, struct Qdisc *sch, ktime_t now)
>> +{
>> + =A0 =A0 struct codel_sched_data *q =3D qdisc_priv(sch);
>> + =A0 =A0 bool drop =3D 0;
>
> =A0 =A0 =A0 =A0bool drop =3D false;
>
>> + =A0 =A0 if (skb =3D=3D NULL) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 q->first_above_time.tv64 =3D 0;
>> + =A0 =A0 } else {
>> + =A0 =A0 =A0 =A0 =A0 =A0 ktime_t sojourn_time =3D ktime_sub(now, get_en=
queue_time(skb));
>> + =A0 =A0 =A0 =A0 =A0 =A0 if (sojourn_time.tv64 <
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->target.tv64 || sch->qstats.backlog =
< q->minbytes) {
>> +/* went below so we=92ll stay below for at least q->interval */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->first_above_time.tv64 =3D 0=
;
>> + =A0 =A0 =A0 =A0 =A0 =A0 } else {
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (q->first_above_time.tv64 =
=3D=3D 0) {
>> +
>> +/* just went above from below. If we stay above
>> + * for at least q->interval we=92ll say it=92s ok to drop
>> + */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->first_above=
_time =3D
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 ktime_add(now,q->interval);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else if (now.tv64 >=3D q->fi=
rst_above_time.tv64) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 drop =3D 1;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 }
>> + =A0 =A0 return drop;
>> +}
>> +
>> +static struct sk_buff *codel_dequeue(struct Qdisc *sch)
>> +{
>> + =A0 =A0 struct codel_sched_data *q =3D qdisc_priv(sch);
>> + =A0 =A0 struct sk_buff *skb =3D codel_dequeue_head(sch);
>> + =A0 =A0 ktime_t now;
>> + =A0 =A0 bool drop;
>
> missing newline

KernelStyle is a newline after variable declarations? ok... thx
>
>> + =A0 =A0 if (skb =3D=3D NULL) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 q->dropping =3D 0;
>> + =A0 =A0 =A0 =A0 =A0 =A0 q->first_above_time.tv64 =3D 0;
>> + =A0 =A0 =A0 =A0 =A0 =A0 return skb;
>> + =A0 =A0 }
>> + =A0 =A0 now =3D ktime_get();
>> + =A0 =A0 drop =3D should_drop(skb, sch, now);
>> + =A0 =A0 if (q->dropping) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 if (drop) {
>> +/* sojourn time below target - leave dropping state */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->dropping =3D 0;
>> + =A0 =A0 =A0 =A0 =A0 =A0 } else if (now.tv64 >=3D q->drop_next.tv64) {
>> +/*
>> + * It=92s time for the next drop. Drop the current packet and dequeue t=
he next.
>> + * The dequeue might take us out of dropping state. If not, schedule th=
e
>> + * next drop. A large backlog might result in drop rates so high that t=
he next
>> + * drop should happen now, hence the =91while=92 loop.
>> + */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 while (now.tv64 >=3D q->drop_n=
ext.tv64 && q->dropping) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 codel_drop(skb=
, sch);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->count++;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb =3D codel_=
dequeue_head(sch);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (should_dro=
p(skb,sch,now)) {
>> +/* leave dropping state */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 q->dropping =3D 0;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else {
>> +/* and schedule the next drop */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 q->drop_next =3D
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 control_law(q,q->drop_next);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 =A0 =A0 =A0 =A0 } else if (drop &&
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0((now.tv64 - q->drop_ne=
xt.tv64 <
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 16 * q->interv=
al.tv64) ||
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (now.tv64 - q->first_a=
bove_time.tv64 >=3D
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A02 * q->interval.tv6=
4))) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 codel_drop(skb, sch);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb =3D codel_dequeue_head(sch=
);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* engage state machine */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 drop =3D should_drop(skb,sch,n=
ow);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->dropping =3D 1;
>> +
>> +/* if min went above target close to when we last went below it
>> + * assume that the drop rate that controlled the queue on the
>> + * last cycle is a good starting point to control it now.
>> + */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (now.tv64 - q->drop_next.tv=
64 <
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 16 * q->interv=
al.tv64) {
>
> doing ktime_t compares is tricky. Here its not done properly.

yea, um, I screwed this up globally. Rethinking now...

> Are you sure we need 64bit and ns resolution ?

No. However there are several things that I have in my head
that overcomplicated things.

a typical 5ms target for drop is a LOT of packets on 10GigE. Am trying
to think ahead a decade or two or three and to a tiny equivalent of
DCB in the 1000s of processors connected directly to your brain
implant.

> IMHO 'unsigned long' values are the best, for 32bit arches.
>
> CoDel paper mentions ms resolution...
> Even if you use us resolution, we can have about 2000 secs of window in
> an unsigned long.

I wasn't as concerned as getting into the oort cloud as to handling
really fast gear.

>
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int c =3D q->c=
ount - 1;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->count =3D c=
 < 1 ? 1 : c;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else {
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->count =3D 1=
;
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 q->drop_next =3D control_law(q=
,now);
>> + =A0 =A0 =A0 =A0 =A0 =A0 }
>> + =A0 =A0 }
>> + =A0 =A0 return skb;
>> +}
>> +
>> +
>> +static int codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>> +{
>
> I am worried about the latency spike in dequeue(), if we need to drop
> hundred of packets (while device is idle... If BQL is in use... its
> really unfortunate)

I'm concerned about it too, I don't think anything else currently
loops in that call. as best as I recall that's under rcu protection?

>
> Could we try to check if we can drop right now one or two packets that
> were queued in previous ->queue() calls, to smooth the drop processing ?

I have been speculating wildly about ways to shorten the tail of
codel's compensation, and (especially) only shoot the elephants, not
mice, using fq techniques, but have to wait to see this stuff run as
is first.

Do you have a suggestion? I'm never big on dropping more than one
packet in a flow at a time.

>> + =A0 =A0 if (likely(skb_queue_len(&sch->q) < sch->limit)) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 set_enqueue_time(skb,ktime_get());
>> + =A0 =A0 =A0 =A0 =A0 =A0 return qdisc_enqueue_tail(skb, sch);
>> + =A0 =A0 }
>> + =A0 =A0 return qdisc_reshape_fail(skb, sch);
>> +}
>> +
>> +static int codel_change(struct Qdisc *sch, struct nlattr *opt)
>> +{
>> + =A0 =A0 struct codel_sched_data *q =3D qdisc_priv(sch);
>> + =A0 =A0 struct tc_codel_qopt *ctl =3D nla_data(opt);
>> + =A0 =A0 unsigned int qlen;
>> +
>> + =A0 =A0 if (opt->nla_len < nla_attr_size(sizeof(*ctl)))
>> + =A0 =A0 =A0 =A0 =A0 =A0 return -EINVAL;
>> + =A0 =A0 if (ctl->depth && (ctl->depth < 2 || ctl->depth > 65536))
>> + =A0 =A0 =A0 =A0 =A0 =A0 return -EINVAL;
>> + =A0 =A0 if (ctl->minbytes && (ctl->minbytes < 64 || ctl->minbytes > 65=
536))
>> + =A0 =A0 =A0 =A0 =A0 =A0 return -EINVAL;
>> + =A0 =A0 sch_tree_lock(sch);
>> + =A0 =A0 if (ctl->minbytes) q->minbytes =3D ctl->minbytes;
>
> =A0 =A0 =A0 =A0if (cond) =A0 =A0// NEWLINE
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0statement;
>
>> + =A0 =A0 if (ctl->flags) q->flags =3D ctl->flags;
>> + =A0 =A0 if (ctl->target) q->target =3D ns_to_ktime((u64) ctl->target *=
 1000);
>> + =A0 =A0 if (ctl->interval) q->interval =3D
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ns_to_ktime((u64) ctl->interva=
l * 1000);
>> +
>> + =A0 =A0 /* something saner than this for depth is probably needed */
>> +
>> + =A0 =A0 if (ctl->depth) sch->limit =3D ctl->depth;
>> + =A0 =A0 qlen =3D sch->q.qlen;
>> +// =A0 while (sch->q.qlen > ctl->depth)
>> +// =A0 =A0 =A0 =A0 =A0 codel_drop(skb,sch);
>> +// =A0 qdisc_tree_decrease_qlen(sch, qlen - sch->q.qlen); //?

I don't get the how/when/why of this commented out bit.
Can you clue me up?

>> + =A0 =A0 q->drop_next.tv64 =3D q->first_above_time.tv64 =3D 0;
>> + =A0 =A0 q->dropping =3D 0; /* exit dropping state */
>> + =A0 =A0 sch_tree_unlock(sch);
>> + =A0 =A0 return 0;
>> +}
>> +
>> +static int codel_init(struct Qdisc *sch, struct nlattr *opt)
>> +{
>> + =A0 =A0 struct codel_sched_data *q =3D qdisc_priv(sch);
>> + =A0 =A0 q->target =3D MS2TIME(5);
>> + =A0 =A0 sch->limit =3D =A0DEFAULT_CODEL_DEPTH;
>> + =A0 =A0 q->minbytes =3D 1500;
>
> =A0 =A0 =A0 =A0Please dont hard code this value, use ETH_DATA_LEN, or bet=
ter :
>
> psched_mtu(qdisc_dev(sch));

Done.

>
>> + =A0 =A0 q->interval =3D MS2TIME(100);
>> + =A0 =A0 if (opt) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 int err =3D codel_change(sch, opt);
>> + =A0 =A0 =A0 =A0 =A0 =A0 if (err)
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return err;
>> + =A0 =A0 }
>> +
>> +/* =A0 if (sch->limit >=3D 1)
>> + =A0 =A0 =A0 =A0 =A0 =A0 sch->flags |=3D TCQ_F_CAN_BYPASS;
>> + =A0 =A0 else
>> + =A0 =A0 =A0 =A0 =A0 =A0 sch->flags &=3D ~TCQ_F_CAN_BYPASS;
>> +*/
>> + =A0 =A0 return 0;
>> +}

It was unclear to me the use of the above functionality I found in
other qdiscs. In particular, it seems like codel could run with an
unlimited queue just fine, I was uncomfortable with doing that as a
first pass.

>> +
>> +static int codel_dump(struct Qdisc *sch, struct sk_buff *skb)
>> +{
>> + =A0 =A0 struct codel_sched_data *q =3D qdisc_priv(sch);
>> + =A0 =A0 struct tc_codel_qopt opt;
>> + =A0 =A0 opt.target =3D (u32) ktime_to_us(q->target);
>> + =A0 =A0 opt.interval =A0=3D (u32) ktime_to_us(q->interval);
>> + =A0 =A0 opt.depth =A0=3D sch->limit;
>> + =A0 =A0 opt.flags =A0=3D q->flags;
>> + =A0 =A0 NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
>
> NLA_PUT() doesnt exist anymore in net-next tree, please rebase on
> net-next

I saw it was transitioning, but I would prefer to make this stable on
something I have deployed, get more people running it day to day...
then submit to net-next.

>> + =A0 =A0 return skb->len;
>> +
>> +nla_put_failure:
>> +// =A0 nlmsg_trim(skb, b);
>> + =A0 =A0 return -1;
>> +}
>> +
>> +static void
>> +codel_reset(struct Qdisc *sch)
>> +{
>> + =A0 =A0 struct sk_buff *skb;
>> +
>> + =A0 =A0 while ((skb =3D codel_dequeue(sch)) !=3D NULL)
>> + =A0 =A0 =A0 =A0 =A0 =A0 kfree_skb(skb);
>
>
>> +}
>> +
>> +struct Qdisc_ops codel_qdisc_ops __read_mostly =3D {
>> + =A0 =A0 .id =A0 =A0 =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 "codel",
>> + =A0 =A0 .priv_size =A0 =A0 =A0=3D =A0 =A0 =A0 sizeof(struct codel_sche=
d_data),
>> + =A0 =A0 .enqueue =A0 =A0 =A0 =A0=3D =A0 =A0 =A0 codel_enqueue,
>> + =A0 =A0 .dequeue =A0 =A0 =A0 =A0=3D =A0 =A0 =A0 codel_dequeue,
>> + =A0 =A0 .peek =A0 =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 qdisc_peek_head,
>> +/* =A0 .drop =A0 =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 codel_queue_drop, */

I'm puzzled as to what this is supposed to really do and
who it's callers are.

>> + =A0 =A0 .init =A0 =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 codel_init,
>> + =A0 =A0 .reset =A0 =A0 =A0 =A0 =A0=3D =A0 =A0 =A0 codel_reset,
>> + =A0 =A0 .change =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 codel_change,
>> + =A0 =A0 .dump =A0 =A0 =A0 =A0 =A0 =3D =A0 =A0 =A0 codel_dump,
>> + =A0 =A0 .owner =A0 =A0 =A0 =A0 =A0=3D =A0 =A0 =A0 THIS_MODULE,
>> +};
>> +EXPORT_SYMBOL(codel_qdisc_ops);
>> +
>> +static int __init codel_module_init(void)
>> +{
>> + =A0 =A0 =A0 =A0return register_qdisc(&codel_qdisc_ops);
>> +}
>> +static void __exit codel_module_exit(void)
>> +{
>> + =A0 =A0 =A0 =A0unregister_qdisc(&codel_qdisc_ops);
>> +}
>> +module_init(codel_module_init)
>> +module_exit(codel_module_exit)
>> +MODULE_LICENSE("GPL");
>> +
>> _______________________________________________
>> Codel mailing list
>> Codel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/codel
>
>
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel


--=20
Dave T=E4ht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://www.bufferbloat.net