* [Cake] act_connmark + dscp
@ 2019-03-05 14:35 Kevin Darbyshire-Bryant
2019-03-06 15:21 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-05 14:35 UTC (permalink / raw)
To: cake
[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]
Before I go too far down this road (and to avoid the horror of actually trying to code it) here’s what I’m trying to achieve.
act_connmark + dscp is designed to copy a DSCP code to/from conntrack mark. It uses 8 bits of the mark field, currently the most significant byte.
Bits 31-26: DSCP
Bit 25: Spare/Future
Bit 24: Valid DSCP set
The valid bit is set when the ‘getdscp’ function has written a DSCP value into the conntrack (& hence skb) mark. This allows us & other skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that a DSCP value has been placed in the field. We cannot simply use a non-zero DSCP because zero is a valid DSCP.
’setdscp’ restores the DSCP field from the stored DSCP if valid bit is set.
Modes:
getdscp - If ct->mark doesn’t contain a valid DSCP, copy the DSCP field from the current skb (if valid for that skb ie. ip4/ip6) and set the valid bit.
bleach/aka wash - As part of the getdscp routine, optionally null out the DSCP bits. We need a bleach routine because we need to see the DSCP bits intact so we can copy them to the mark. If cake washes them then we don’t get to see them, defeating the entire point.
setdscp - if skb->mark valid bit set, copy the skb->mark stored DSCP to the diffserv bits of the current skb (if appropriate)
I’ve attached a *VERY* rough starting point for how far I’ve got - not even compiled yet… I doubt it does. Not a natural (kernel) programmer so everything is a struggle.
Two things I’m falling over:
1) Really not sure how to get parameters in to set the ‘getdscp/setdscp/wash’ options.
2) Is it possible to have different tc actions on an interface and a mirred fed ifb interface? That could be a real killer actually. We need to do different things on egress vs ingress.
3) (Spanish Inquisition moment) The original connmark code goes and does a nf_ct_get_tuplepr if the nf_ct_get fails. Why would the nf_ct_get fail?
Cheers,
Kevin D-B
012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A
[-- Attachment #2: act_connmark_hack.patch --]
[-- Type: application/octet-stream, Size: 1880 bytes --]
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 8475913f2070..d2acb63d1e5d 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -62,6 +62,52 @@ static int tcf_connmark_act(struct sk_buff *skb, const struct tc_action *a,
c = nf_ct_get(skb, &ctinfo);
if (c) {
+ if (getdscp & proto) {
+ if (!(c->mark & 0x01000000)) {
+ /* store the DSCP bits into c->mark */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ dscp = ipv4_get_dsfield(ip_hdr(skb));
+ break;
+ case NFPROTO_IPV6:
+ dscp = ipv6_get_dsfield(ip_hdr(skb));
+ break;
+ default:
+ dscp = 0;
+ break;
+ }
+ c->mark &= 0x00ffffff;
+ c->mark |= (0x01 | dscp) << 24;
+ }
+ if (bleach) {
+ switch (proto) {
+ case NFPROTO_IPV4:
+ if (ipv4_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK)
+ ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+ break;
+ case NFPROTO_IPV6:
+ if (ipv6_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK)
+ ipv6_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+ break;
+ default:
+ break;
+ }
+ }
+ } else if (setdscp & proto && (c->mark & 0x01000000)) {
+ /* restore the DSCP bits from c->mark into diffserv */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ if (ipv4_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ case NFPROTO_IPV6:
+ if (ipv6_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv6_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ default:
+ break;
+ }
+ }
skb->mark = c->mark;
/* using overlimits stats to count how many packets marked */
ca->tcf_qstats.overlimits++;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-05 14:35 [Cake] act_connmark + dscp Kevin Darbyshire-Bryant
@ 2019-03-06 15:21 ` Toke Høiland-Jørgensen
2019-03-06 16:47 ` John Sager
2019-03-06 18:40 ` Kevin Darbyshire-Bryant
0 siblings, 2 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-06 15:21 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant, cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
> Before I go too far down this road (and to avoid the horror of
> actually trying to code it) here’s what I’m trying to achieve.
>
>
> act_connmark + dscp is designed to copy a DSCP code to/from conntrack mark. It uses 8 bits of the mark field, currently the most significant byte.
>
> Bits 31-26: DSCP
> Bit 25: Spare/Future
> Bit 24: Valid DSCP set
>
> The valid bit is set when the ‘getdscp’ function has written a DSCP
> value into the conntrack (& hence skb) mark. This allows us & other
> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
> a DSCP value has been placed in the field. We cannot simply use a
> non-zero DSCP because zero is a valid DSCP.
If someone installs the action, the field is supposedly always copied;
so why do we need another flag?
Instead, I'd suggest adding user options with mask and shift to make it
possible to configure where in the mark field the values are stored...
> ’setdscp’ restores the DSCP field from the stored DSCP if valid bit is set.
>
>
> Modes:
>
> getdscp - If ct->mark doesn’t contain a valid DSCP, copy the DSCP
> field from the current skb (if valid for that skb ie. ip4/ip6) and set
> the valid bit.
>
> bleach/aka wash - As part of the getdscp routine, optionally null out
> the DSCP bits. We need a bleach routine because we need to see the
> DSCP bits intact so we can copy them to the mark. If cake washes them
> then we don’t get to see them, defeating the entire point.
>
> setdscp - if skb->mark valid bit set, copy the skb->mark stored DSCP
> to the diffserv bits of the current skb (if appropriate)
TC actions are executed before the qdisc, so I don't think bleach/wash
is really needed? Also, it can already be achieved with the pedit
action. So I think 'get' and 'set' are enough.
> I’ve attached a *VERY* rough starting point for how far I’ve got - not
> even compiled yet… I doubt it does. Not a natural (kernel) programmer
> so everything is a struggle.
>
> Two things I’m falling over:
>
> 1) Really not sure how to get parameters in to set the
> ‘getdscp/setdscp/wash’ options.
You'll need to add new netlink parameter values and parse them like we
do in cake (after the existing struct-copy-based option passing). I
think a single option that can be set to *either* "get" *or* "set" will
be simpler, then you don't have to worry about what happens if someone
enables both.
> 2) Is it possible to have different tc actions on an interface and a
> mirred fed ifb interface? That could be a real killer actually. We
> need to do different things on egress vs ingress.
Yes, you can chain actions, and they are per-interface.
> 3) (Spanish Inquisition moment) The original connmark code goes and
> does a nf_ct_get_tuplepr if the nf_ct_get fails. Why would the
> nf_ct_get fail?
Presumably because nf_ct_get just reads a field in the skb; which may
not have been set yet in some cases, necessitating a full lookup.
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-06 15:21 ` Toke Høiland-Jørgensen
@ 2019-03-06 16:47 ` John Sager
2019-03-07 9:50 ` Toke Høiland-Jørgensen
2019-03-06 18:40 ` Kevin Darbyshire-Bryant
1 sibling, 1 reply; 19+ messages in thread
From: John Sager @ 2019-03-06 16:47 UTC (permalink / raw)
To: cake
You should be able to infer get (egress action) and set (ingress action)
from which qdisc it is attached to. Is it possible to do that when the
action function is invoked (from parameters of tcf_connmark_act() or does it
need a static flag set when tc attaches the action?
In act_conntrack.c there is another case further down where ct->mark is
copied to skb->mark so you need to cover both cases (inline function?).
John
On 06/03/2019 15:21, Toke Høiland-Jørgensen wrote:
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>> Before I go too far down this road (and to avoid the horror of
>> actually trying to code it) here’s what I’m trying to achieve.
>>
>>
>> act_connmark + dscp is designed to copy a DSCP code to/from conntrack mark. It uses 8 bits of the mark field, currently the most significant byte.
>>
>> Bits 31-26: DSCP
>> Bit 25: Spare/Future
>> Bit 24: Valid DSCP set
>>
>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>> value into the conntrack (& hence skb) mark. This allows us & other
>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>> a DSCP value has been placed in the field. We cannot simply use a
>> non-zero DSCP because zero is a valid DSCP.
>
> If someone installs the action, the field is supposedly always copied;
> so why do we need another flag?
>
> Instead, I'd suggest adding user options with mask and shift to make it
> possible to configure where in the mark field the values are stored...
>
>> ’setdscp’ restores the DSCP field from the stored DSCP if valid bit is set.
>>
>>
>> Modes:
>>
>> getdscp - If ct->mark doesn’t contain a valid DSCP, copy the DSCP
>> field from the current skb (if valid for that skb ie. ip4/ip6) and set
>> the valid bit.
>>
>> bleach/aka wash - As part of the getdscp routine, optionally null out
>> the DSCP bits. We need a bleach routine because we need to see the
>> DSCP bits intact so we can copy them to the mark. If cake washes them
>> then we don’t get to see them, defeating the entire point.
>>
>> setdscp - if skb->mark valid bit set, copy the skb->mark stored DSCP
>> to the diffserv bits of the current skb (if appropriate)
>
> TC actions are executed before the qdisc, so I don't think bleach/wash
> is really needed? Also, it can already be achieved with the pedit
> action. So I think 'get' and 'set' are enough.
>
>> I’ve attached a *VERY* rough starting point for how far I’ve got - not
>> even compiled yet… I doubt it does. Not a natural (kernel) programmer
>> so everything is a struggle.
>>
>> Two things I’m falling over:
>>
>> 1) Really not sure how to get parameters in to set the
>> ‘getdscp/setdscp/wash’ options.
>
> You'll need to add new netlink parameter values and parse them like we
> do in cake (after the existing struct-copy-based option passing). I
> think a single option that can be set to *either* "get" *or* "set" will
> be simpler, then you don't have to worry about what happens if someone
> enables both.
>
>> 2) Is it possible to have different tc actions on an interface and a
>> mirred fed ifb interface? That could be a real killer actually. We
>> need to do different things on egress vs ingress.
>
> Yes, you can chain actions, and they are per-interface.
>
>> 3) (Spanish Inquisition moment) The original connmark code goes and
>> does a nf_ct_get_tuplepr if the nf_ct_get fails. Why would the
>> nf_ct_get fail?
>
> Presumably because nf_ct_get just reads a field in the skb; which may
> not have been set yet in some cases, necessitating a full lookup.
>
> -Toke
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-06 15:21 ` Toke Høiland-Jørgensen
2019-03-06 16:47 ` John Sager
@ 2019-03-06 18:40 ` Kevin Darbyshire-Bryant
2019-03-07 10:10 ` Toke Høiland-Jørgensen
1 sibling, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-06 18:40 UTC (permalink / raw)
To: cake
[-- Attachment #1: Type: text/plain, Size: 2970 bytes --]
> On 6 Mar 2019, at 15:21, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>> Before I go too far down this road (and to avoid the horror of
>> actually trying to code it) here’s what I’m trying to achieve.
>>
>>
>> act_connmark + dscp is designed to copy a DSCP code to/from conntrack mark. It uses 8 bits of the mark field, currently the most significant byte.
>>
>> Bits 31-26: DSCP
>> Bit 25: Spare/Future
>> Bit 24: Valid DSCP set
>>
>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>> value into the conntrack (& hence skb) mark. This allows us & other
>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>> a DSCP value has been placed in the field. We cannot simply use a
>> non-zero DSCP because zero is a valid DSCP.
>
> If someone installs the action, the field is supposedly always copied;
> so why do we need another flag?
I’m trying to limit the number of times expensive iptables mangle rules have to run.
Egress path:
Packet comes in (internal to device or forward)
iptables mangle - check fwmark ’set’ bit
if not set
jump to a possibly extensive set of rules that mangle the DSCP
else
do nothing
Packet arrives at act_connmark dscpset
looks at fwmark ’set’ bit
if not set
copy DSCP to fwmark & set the ’set’ bit.
else
do nothing
cake gets hold of it - selects a tin based on fwmark contained DSCP
Do the routine again for the next packet in the same connection and you’ll skip the iptables mangle rules but still have cake classify based on the fwmark stored DSCP. Without that flag you’ll have to run the iptables mangle rules for every packet and update the fwmark too.
I personally think that cake should also have the fwmark/DSCP decode routine on ingress. e.g.
Ingress
Packet arrives
act_connmark restores the fwmark
if fwmark/dscp set then optionally restores diffserv from fwmark
Cake looks for fwmark/dscp set bit
if true
use fwmark DSCP for tin select
else
use diffserv field as before
Cake possibly washes
Without the ’set’ bit, act_connmark has to restore the diffserv field on every (ip) packet and cake possibly has to wash it out again.
The reality is that I enjoyed doing this in the cake codebase. I cannot say the same for act_connmark in fact I hate it so much I’m stopping. The mental effort for a non-programmer and more importantly a non-kernel programmer is exhausting & I’m completely disillusioned. I really need to concentrate on the job that means I can pay the mortgage, which isn’t bashing my head against the kernel.
Anyway 4 files - 2 are patches against current cake & tc and a ‘my_layer_cake’ qos script that’s ‘fwmark/cake’ aware. 4th file is the start of a hack on act_connmark. Do with them as you will, I never want to see the last one again.
[-- Attachment #2: 0001-Automagically-use-update-DSCP-contained-in-fwmark.patch --]
[-- Type: application/octet-stream, Size: 6679 bytes --]
From 2bf78196b242226f8778695fea922eb5341c0f21 Mon Sep 17 00:00:00 2001
From: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Date: Fri, 1 Mar 2019 11:35:49 +0000
Subject: [PATCH] Automagically use & update DSCP contained in fwmark
and then I had a really bad idea!
Which is - if we're in fwmark & egress mode & we don't have a mark so
we've fallen through to DSCP, set a mark based on the DSCP.
That way, reply packets will be automagically fw marked for us on their
return as tc act connmark will have restored the mark which we can then
use.
With the 'icing' keyword, copy the dscp coded into the fwmark into the
real packets (the opposite of wash) in ingress mode.
The fwmark encodes the dscp thus (we steal the top byte)
ct->mark &= 0x00ffffff; - mask out top byte
ct->mark |= (0x01 | dscp) << 24; or in the DSCP (held in top 6 bits) and
set lowest bit as a 'cake set this fwmark' flag
options from tc:
fwmark - enable fwmark usage
getdscp - update conntrack mark with the DSCP if no existing CAKE mark
typically used with egress qdisc
setdscp - update DSCP on packets from CAKE mark typically used with
ingress qdisc
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
---
pkt_sched.h | 2 +
sch_cake.c | 112 ++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 94 insertions(+), 20 deletions(-)
diff --git a/pkt_sched.h b/pkt_sched.h
index a2f570c..e3c2ac4 100644
--- a/pkt_sched.h
+++ b/pkt_sched.h
@@ -879,6 +879,8 @@ enum {
TCA_CAKE_ACK_FILTER,
TCA_CAKE_SPLIT_GSO,
TCA_CAKE_FWMARK,
+ TCA_CAKE_GETDSCP,
+ TCA_CAKE_SETDSCP,
__TCA_CAKE_MAX
};
#define TCA_CAKE_MAX (__TCA_CAKE_MAX - 1)
diff --git a/sch_cake.c b/sch_cake.c
index 052ac05..e27a8b1 100644
--- a/sch_cake.c
+++ b/sch_cake.c
@@ -269,7 +269,9 @@ enum {
CAKE_FLAG_INGRESS = BIT(2),
CAKE_FLAG_WASH = BIT(3),
CAKE_FLAG_SPLIT_GSO = BIT(4),
- CAKE_FLAG_FWMARK = BIT(5)
+ CAKE_FLAG_FWMARK = BIT(5), /* USEFWMARK */
+ CAKE_FLAG_GETDSCP = BIT(6), /* STOREDSCP */
+ CAKE_FLAG_SETDSCP = BIT(7) /* RESTOREDSCP */
};
/* COBALT operates the Codel and BLUE algorithms in parallel, in order to
@@ -1616,19 +1618,36 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
return idx + (tin << 16);
}
-static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+static void cake_update_diffserv(struct sk_buff *skb, u8 dscp)
+{
+ switch (skb->protocol) {
+ case htons(ETH_P_IP):
+ if ((ipv4_get_dsfield(ip_hdr(skb)) & ~INET_ECN_MASK) != dscp)
+ ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, dscp);
+ break;
+ case htons(ETH_P_IPV6):
+ if ((ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK) != dscp)
+ ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, dscp);
+ break;
+ default:
+ break;
+ }
+
+}
+
+static u8 cake_handle_diffserv(struct sk_buff *skb, bool wash)
{
u8 dscp;
switch (skb->protocol) {
case htons(ETH_P_IP):
- dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+ dscp = ipv4_get_dsfield(ip_hdr(skb)) & ~INET_ECN_MASK;
if (wash && dscp)
ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
return dscp;
case htons(ETH_P_IPV6):
- dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+ dscp = ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK;
if (wash && dscp)
ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
return dscp;
@@ -1642,37 +1661,68 @@ static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
}
}
+static void cake_update_ct_mark(struct sk_buff *skb, u8 dscp)
+{
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (!ct)
+ return;
+
+ ct->mark &= ~(0xff << 24);
+ ct->mark |= (0x01 | dscp) << 24;
+ nf_conntrack_event_cache(IPCT_MARK, ct);
+#endif
+}
+
static struct cake_tin_data *cake_select_tin(struct Qdisc *sch,
struct sk_buff *skb)
{
struct cake_sched_data *q = qdisc_priv(sch);
- u32 tin;
+ bool wash;
u8 dscp;
+ u8 tin;
- /* Tin selection: Default to diffserv-based selection, allow overriding
- * using firewall marks or skb->priority.
- */
- dscp = cake_handle_diffserv(skb,
- q->rate_flags & CAKE_FLAG_WASH);
+ wash = !!(q->rate_flags & CAKE_FLAG_WASH);
+
+ if (q->tin_mode == CAKE_DIFFSERV_BESTEFFORT) {
- if (q->tin_mode == CAKE_DIFFSERV_BESTEFFORT)
tin = 0;
+ if (wash)
+ cake_update_diffserv(skb, 0);
- else if (q->rate_flags & CAKE_FLAG_FWMARK && /* use fw mark */
- skb->mark &&
- skb->mark <= q->tin_cnt)
- tin = q->tin_order[skb->mark - 1];
+ } else if (TC_H_MAJ(skb->priority) == sch->handle && /* use priority */
+ TC_H_MIN(skb->priority) > 0 &&
+ TC_H_MIN(skb->priority) <= q->tin_cnt) {
- else if (TC_H_MAJ(skb->priority) == sch->handle &&
- TC_H_MIN(skb->priority) > 0 &&
- TC_H_MIN(skb->priority) <= q->tin_cnt)
tin = q->tin_order[TC_H_MIN(skb->priority) - 1];
+ if (wash)
+ cake_update_diffserv(skb, 0);
- else {
- tin = q->tin_index[dscp];
+ } else if (q->rate_flags & CAKE_FLAG_FWMARK && /* use fw mark */
+ skb->mark & (0x01 << 24)) {
+
+ dscp = skb->mark >> 24 & ~INET_ECN_MASK;
+ tin = q->tin_index[dscp >> 2];
+
+ if (wash)
+ cake_update_diffserv(skb, 0);
+ else if (q->rate_flags & CAKE_FLAG_SETDSCP)
+ cake_update_diffserv(skb, dscp);
+
+ } else { /* fallback to DSCP */
+ /* extract the Diffserv Precedence field, if it exists */
+ /* and clear DSCP bits if washing */
+ dscp = cake_handle_diffserv(skb, wash);
+ tin = q->tin_index[dscp >> 2];
if (unlikely(tin >= q->tin_cnt))
tin = 0;
+
+ if (q->rate_flags & CAKE_FLAG_FWMARK && q->rate_flags & CAKE_FLAG_GETDSCP)
+ cake_update_ct_mark(skb, dscp);
}
return &q->tins[tin];
@@ -2760,6 +2810,20 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
q->rate_flags &= ~CAKE_FLAG_FWMARK;
}
+ if (tb[TCA_CAKE_GETDSCP]) {
+ if (!!nla_get_u32(tb[TCA_CAKE_GETDSCP]))
+ q->rate_flags |= CAKE_FLAG_GETDSCP;
+ else
+ q->rate_flags &= ~CAKE_FLAG_GETDSCP;
+ }
+
+ if (tb[TCA_CAKE_SETDSCP]) {
+ if (!!nla_get_u32(tb[TCA_CAKE_SETDSCP]))
+ q->rate_flags |= CAKE_FLAG_SETDSCP;
+ else
+ q->rate_flags &= ~CAKE_FLAG_SETDSCP;
+ }
+
if (q->tins) {
sch_tree_lock(sch);
cake_reconfigure(sch);
@@ -2944,6 +3008,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
!!(q->rate_flags & CAKE_FLAG_FWMARK)))
goto nla_put_failure;
+ if (nla_put_u32(skb, TCA_CAKE_GETDSCP,
+ !!(q->rate_flags & CAKE_FLAG_GETDSCP)))
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_CAKE_SETDSCP,
+ !!(q->rate_flags & CAKE_FLAG_SETDSCP)))
+ goto nla_put_failure;
+
return nla_nest_end(skb, opts);
nla_put_failure:
--
2.17.2 (Apple Git-113)
[-- Attachment #3: 0001-tc-cake-add-fwmark-getdscp-setdscp-options.patch --]
[-- Type: application/octet-stream, Size: 5830 bytes --]
From 5f5f9d02280616e1e85bc87456ca46e27cd28ca9 Mon Sep 17 00:00:00 2001
From: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Date: Wed, 27 Feb 2019 14:46:05 +0000
Subject: [PATCH] cake: add fwmark & getdscp & setdscp options
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
---
include/uapi/linux/pkt_sched.h | 3 ++
man/man8/tc-cake.8 | 19 +++++++++++
tc/q_cake.c | 60 ++++++++++++++++++++++++++++++++++
3 files changed, 82 insertions(+)
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 89ee47c2..766cd9a1 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -991,6 +991,9 @@ enum {
TCA_CAKE_INGRESS,
TCA_CAKE_ACK_FILTER,
TCA_CAKE_SPLIT_GSO,
+ TCA_CAKE_FWMARK,
+ TCA_CAKE_GETDSCP,
+ TCA_CAKE_SETDSCP,
__TCA_CAKE_MAX
};
#define TCA_CAKE_MAX (__TCA_CAKE_MAX - 1)
diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
index c62e5547..7db0f142 100644
--- a/man/man8/tc-cake.8
+++ b/man/man8/tc-cake.8
@@ -73,6 +73,12 @@ TIME |
]
.br
[
+.BR fwmark
+|
+.BR nofwmark*
+]
+.br
+[
.BR split-gso*
|
.BR no-split-gso
@@ -623,6 +629,19 @@ override mechanism; if a host ID is assigned, it will be used as both source and
destination host.
+.SH OVERRIDING CLASSIFICATION WITH NETFILTER CONNMARKS
+
+In addition to TC FILTER tin classification, firewall marks may also optionally
+be used. The priority order (highest to lowest) for tin selection is TC filter,
+firewall mark and then DSCP.
+.PP
+.B fwmark
+
+.br
+ Enables CONNMARK based tin selection. Valid CONNMARKS range from 1 to the
+maximum number of tins i.e. 3 tins for diffserv3, 4 tins for diffserv4.
+Values outside the valid range are ignored and CAKE will fall back to using
+DSCP for tin selection.
.SH EXAMPLES
# tc qdisc delete root dev eth0
diff --git a/tc/q_cake.c b/tc/q_cake.c
index e827e3f1..9c892a3b 100644
--- a/tc/q_cake.c
+++ b/tc/q_cake.c
@@ -79,6 +79,9 @@ static void explain(void)
" dual-srchost | dual-dsthost | triple-isolate* ]\n"
" [ nat | nonat* ]\n"
" [ wash | nowash* ]\n"
+" [ fwmark | nofwmark* ]\n"
+" [ getdscp | nogetdscp* ]\n"
+" [ setdscp | nosetdscp* ]\n"
" [ split-gso* | no-split-gso ]\n"
" [ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
" [ memlimit LIMIT ]\n"
@@ -106,6 +109,9 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
int autorate = -1;
int ingress = -1;
int overhead = 0;
+ int getdscp = -1;
+ int setdscp = -1;
+ int fwmark = -1;
int wash = -1;
int nat = -1;
int atm = -1;
@@ -161,6 +167,18 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
split_gso = 1;
} else if (strcmp(*argv, "no-split-gso") == 0) {
split_gso = 0;
+ } else if (strcmp(*argv, "fwmark") == 0) {
+ fwmark = 1;
+ } else if (strcmp(*argv, "nofwmark") == 0) {
+ fwmark = 0;
+ } else if (strcmp(*argv, "getdscp") == 0) {
+ getdscp = 1;
+ } else if (strcmp(*argv, "nogetdscp") == 0) {
+ getdscp = 0;
+ } else if (strcmp(*argv, "setdscp") == 0) {
+ setdscp = 1;
+ } else if (strcmp(*argv, "nosetdscp") == 0) {
+ setdscp = 0;
} else if (strcmp(*argv, "flowblind") == 0) {
flowmode = CAKE_FLOW_NONE;
} else if (strcmp(*argv, "srchost") == 0) {
@@ -383,6 +401,15 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
if (split_gso != -1)
addattr_l(n, 1024, TCA_CAKE_SPLIT_GSO, &split_gso,
sizeof(split_gso));
+ if (fwmark != -1)
+ addattr_l(n, 1024, TCA_CAKE_FWMARK, &fwmark,
+ sizeof(fwmark));
+ if (getdscp != -1)
+ addattr_l(n, 1024, TCA_CAKE_GETDSCP, &getdscp,
+ sizeof(getdscp));
+ if (setdscp != -1)
+ addattr_l(n, 1024, TCA_CAKE_SETDSCP, &setdscp,
+ sizeof(setdscp));
if (ingress != -1)
addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
if (ack_filter != -1)
@@ -415,6 +442,9 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
int overhead = 0;
int autorate = 0;
int ingress = 0;
+ int getdscp = 0;
+ int setdscp = 0;
+ int fwmark = 0;
int wash = 0;
int raw = 0;
int mpu = 0;
@@ -500,6 +530,18 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
RTA_PAYLOAD(tb[TCA_CAKE_SPLIT_GSO]) >= sizeof(__u32)) {
split_gso = rta_getattr_u32(tb[TCA_CAKE_SPLIT_GSO]);
}
+ if (tb[TCA_CAKE_FWMARK] &&
+ RTA_PAYLOAD(tb[TCA_CAKE_FWMARK]) >= sizeof(__u32)) {
+ fwmark = rta_getattr_u32(tb[TCA_CAKE_FWMARK]);
+ }
+ if (tb[TCA_CAKE_GETDSCP] &&
+ RTA_PAYLOAD(tb[TCA_CAKE_GETDSCP]) >= sizeof(__u32)) {
+ getdscp = rta_getattr_u32(tb[TCA_CAKE_GETDSCP]);
+ }
+ if (tb[TCA_CAKE_SETDSCP] &&
+ RTA_PAYLOAD(tb[TCA_CAKE_SETDSCP]) >= sizeof(__u32)) {
+ setdscp = rta_getattr_u32(tb[TCA_CAKE_SETDSCP]);
+ }
if (tb[TCA_CAKE_RAW]) {
raw = 1;
}
@@ -532,6 +574,24 @@ static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
print_string(PRINT_FP, NULL, "no-split-gso ", NULL);
print_bool(PRINT_JSON, "split_gso", NULL, split_gso);
+ if (fwmark)
+ print_string(PRINT_FP, NULL, "fwmark ", NULL);
+ else
+ print_string(PRINT_FP, NULL, "nofwmark ", NULL);
+ print_bool(PRINT_JSON, "fwmark", NULL, fwmark);
+
+ if (getdscp)
+ print_string(PRINT_FP, NULL, "getdscp ", NULL);
+ else
+ print_string(PRINT_FP, NULL, "nogetdscp ", NULL);
+ print_bool(PRINT_JSON, "getdscp", NULL, getdscp);
+
+ if (setdscp)
+ print_string(PRINT_FP, NULL, "setdscp ", NULL);
+ else
+ print_string(PRINT_FP, NULL, "nosetdscp ", NULL);
+ print_bool(PRINT_JSON, "setdscp", NULL, setdscp);
+
if (interval)
print_string(PRINT_FP, NULL, "rtt %s ",
sprint_time(interval, b2));
--
2.17.2 (Apple Git-113)
[-- Attachment #4: my_layer_cake.qos --]
[-- Type: application/octet-stream, Size: 5131 bytes --]
#!/bin/sh
# Cero3 Shaper
# A cake shaper and AQM solution that allows several diffserv marking schemes
# for ethernet gateways
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#
# Copyright (C) 2012-5 Michael D. Taht, Toke Høiland-Jørgensen, Sebastian Moeller
#sm: TODO pass in the cake diffserv keyword
. ${SQM_LIB_DIR}/defaults.sh
QDISC=cake
# Default traffic classication is passed in INGRESS_CAKE_OPTS and EGRESS_CAKE_OPTS, defined in defaults.sh now
egress() {
SILENT=1 $TC qdisc del dev $IFACE root
$TC qdisc add dev $IFACE root handle cacf: $( get_stab_string ) cake \
bandwidth ${UPLINK}kbit $( get_cake_lla_string ) ${EGRESS_CAKE_OPTS} ${EQDISC_OPTS}
}
ingress() {
SILENT=1 $TC qdisc del dev $IFACE handle ffff: ingress
$TC qdisc add dev $IFACE handle ffff: ingress
SILENT=1 $TC qdisc del dev $DEV root
[ "$IGNORE_DSCP_INGRESS" -eq "1" ] && INGRESS_CAKE_OPTS="$INGRESS_CAKE_OPTS besteffort"
[ "$ZERO_DSCP_INGRESS" -eq "1" ] && INGRESS_CAKE_OPTS="$INGRESS_CAKE_OPTS wash"
$TC qdisc add dev $DEV root handle cace: $( get_stab_string ) cake \
bandwidth ${DOWNLINK}kbit $( get_cake_lla_string ) ${INGRESS_CAKE_OPTS} ${IQDISC_OPTS}
$IP link set dev $DEV up
# redirect all IP packets arriving in $IFACE to ifb0
# and restore connmark this may be used by cake+
$TC filter add dev $IFACE parent ffff: protocol all prio 10 u32 \
match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev $DEV
# Configure iptables chain to mark packets
ipt -t mangle -N QOS_MARK_${IFACE}
# Change DSCP of relevant hosts/packets - this will be picked up by cake+ and placed in the firewall connmark
# also the DSCP is used as the tin selector.
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.12 -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid4 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p udp -s ::c/::ffff:ffff:ffff:ffff -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid6 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
# Send cake+ unmarked connections to the marking chain - Cake+ uses top byte as the
# i've been marked & here's the dscp placeholder.
# top 6 bits are DSCP, LSB is DSCP is valid flag
ipt -t mangle -A PREROUTING -i $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
ipt -t mangle -A POSTROUTING -o $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
}
sqm_start() {
[ -n "$IFACE" ] || return 1
do_modules
verify_qdisc $QDISC "cake" || return 1
sqm_debug "Starting ${SCRIPT}"
[ -z "$DEV" ] && DEV=$( get_ifb_for_if ${IFACE} )
if [ "${UPLINK}" -ne 0 ];
then
egress
sqm_debug "egress shaping activated"
else
sqm_debug "egress shaping deactivated"
SILENT=1 $TC qdisc del dev ${IFACE} root
fi
if [ "${DOWNLINK}" -ne 0 ];
then
verify_qdisc ingress "ingress" || return 1
ingress
sqm_debug "ingress shaping activated"
else
sqm_debug "ingress shaping deactivated"
SILENT=1 $TC qdisc del dev ${DEV} root
SILENT=1 $TC qdisc del dev ${IFACE} ingress
fi
return 0
}
[-- Attachment #5: 0001-start-of-act_connmark-hack.patch --]
[-- Type: application/octet-stream, Size: 3387 bytes --]
From 91f2c94d333fbd391dbde30eff48ee1e859bbdf8 Mon Sep 17 00:00:00 2001
From: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Date: Tue, 5 Mar 2019 14:48:30 +0000
Subject: [PATCH] start of act_connmark hack
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
---
net/sched/act_connmark.c | 61 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 8475913f2070..033e54dbee28 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -41,6 +41,7 @@ static int tcf_connmark_act(struct sk_buff *skb, const struct tc_action *a,
struct nf_conntrack_zone zone;
struct nf_conn *c;
int proto;
+ u8 dscp;
spin_lock(&ca->tcf_lock);
tcf_lastuse_update(&ca->tcf_tm);
@@ -62,6 +63,36 @@ static int tcf_connmark_act(struct sk_buff *skb, const struct tc_action *a,
c = nf_ct_get(skb, &ctinfo);
if (c) {
+ if (dscpops & GET_DSCP && proto && !(c->mark & 0x01000000)) {
+ /* mark does not cotnain DSCP so store DSCP bits into c->mark */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ dscp = ipv4_get_dsfield(ip_hdr(skb));
+ break;
+ case NFPROTO_IPV6:
+ dscp = ipv6_get_dsfield(ip_hdr(skb));
+ break;
+ default:
+ dscp = 0;
+ break;
+ }
+ c->mark &= 0x00ffffff;
+ c->mark |= (0x01 | dscp) << 24;
+ } else if (dscpops & SET_DSCP && proto && (c->mark & 0x01000000)) {
+ /* mark contains DSCP so restore DSCP bits from c->mark into diffserv */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ if (ipv4_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ case NFPROTO_IPV6:
+ if (ipv6_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv6_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ default:
+ break;
+ }
+ }
skb->mark = c->mark;
/* using overlimits stats to count how many packets marked */
ca->tcf_qstats.overlimits++;
@@ -82,6 +113,36 @@ static int tcf_connmark_act(struct sk_buff *skb, const struct tc_action *a,
c = nf_ct_tuplehash_to_ctrack(thash);
/* using overlimits stats to count how many packets marked */
ca->tcf_qstats.overlimits++;
+ if (dscpops & GET_DSCP && proto && !(c->mark & 0x01000000)) {
+ /* store the DSCP bits into c->mark */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ dscp = ipv4_get_dsfield(ip_hdr(skb));
+ break;
+ case NFPROTO_IPV6:
+ dscp = ipv6_get_dsfield(ip_hdr(skb));
+ break;
+ default:
+ dscp = 0;
+ break;
+ }
+ c->mark &= 0x00ffffff;
+ c->mark |= (0x01 | dscp) << 24;
+ } else if (dscpops & SET_DSCP && proto && (c->mark & 0x01000000)) {
+ /* restore the DSCP bits from c->mark into diffserv */
+ switch (proto) {
+ case NFPROTO_IPV4:
+ if (ipv4_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ case NFPROTO_IPV6:
+ if (ipv6_get_dsfield(ip_hdr(skb) & ~INET_ECN_MASK) != c->mark >> 24 & ~INET_ECN_MASK)
+ ipv6_change_dsfield(ip_hdr(skb), INET_ECN_MASK, c->mark >> 24 & ~INET_ECN_MASK);
+ break;
+ default:
+ break;
+ }
+ }
skb->mark = c->mark;
nf_ct_put(c);
--
2.17.2 (Apple Git-113)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-06 16:47 ` John Sager
@ 2019-03-07 9:50 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-07 9:50 UTC (permalink / raw)
To: John Sager, cake
John Sager <john@sager.me.uk> writes:
> You should be able to infer get (egress action) and set (ingress action)
> from which qdisc it is attached to. Is it possible to do that when the
> action function is invoked (from parameters of tcf_connmark_act() or does it
> need a static flag set when tc attaches the action?
From the TC action point of view there is no difference between what we
call egress and ingress actions. They are just things that happen when
an action is invoked on a qdisc. So we definitely need configuration
flags, and I think it is simpler to just name them 'get' and 'set'
rather than attach any directional meaning to them...
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-06 18:40 ` Kevin Darbyshire-Bryant
@ 2019-03-07 10:10 ` Toke Høiland-Jørgensen
2019-03-07 15:56 ` Kevin Darbyshire-Bryant
0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-07 10:10 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant, cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>> On 6 Mar 2019, at 15:21, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>>
>>> Before I go too far down this road (and to avoid the horror of
>>> actually trying to code it) here’s what I’m trying to achieve.
>>>
>>>
>>> act_connmark + dscp is designed to copy a DSCP code to/from conntrack mark. It uses 8 bits of the mark field, currently the most significant byte.
>>>
>>> Bits 31-26: DSCP
>>> Bit 25: Spare/Future
>>> Bit 24: Valid DSCP set
>>>
>>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>>> value into the conntrack (& hence skb) mark. This allows us & other
>>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>>> a DSCP value has been placed in the field. We cannot simply use a
>>> non-zero DSCP because zero is a valid DSCP.
>>
>> If someone installs the action, the field is supposedly always copied;
>> so why do we need another flag?
>
> I’m trying to limit the number of times expensive iptables mangle
> rules have to run.
Right, I see your point, but I'm worried that this can risk becoming a
source of hard-to-debug bugs if this bit happens to get set by accident
in other places. So, I would suggest to at least make it optional (and
configurable). So how about the following configuration options:
- fwmark mask (32-bit value)
specifies a mask to apply to the fwmark field before all operations
- fwmark shift (valid values 0-32)
specifies how many bits to left-shift the DSCP values before putting
them into the fwmark (and how many bits to right-shift the value read
from fwmark before writing it to the DSCP field); this could also be
inferred from the mask rather than be a separate option (shift =
lowest set bit of mask)
- get_dscp (boolean; cannot be set along with set_dscp)
if set, the DSCP field will be copied to the fwmark field, subject to
mask and shift
- set_dscp (boolean; cannot be set along with get_dscp)
if set, the value in fwmark will be copied to the DSCP field, subject
to mask and shift
- state mask (32-bit value; probably needs better name)
if set: the get_dscp action will OR the resulting fwmark before
storing it. the set_dscp value will AND the fwmark with this value
before doing anything, and abort if the result is false.
I think this would allow you to implement what you described, without
hard-coding any behaviour. Right?
Does anyone else have any opinions / objections to the above API?
> The reality is that I enjoyed doing this in the cake codebase. I
> cannot say the same for act_connmark in fact I hate it so much I’m
> stopping. The mental effort for a non-programmer and more importantly
> a non-kernel programmer is exhausting & I’m completely disillusioned.
> I really need to concentrate on the job that means I can pay the
> mortgage, which isn’t bashing my head against the kernel.
Fair enough; no reason to do this if you're not enjoying it! We can
iterate on the API, and I guess I can write the code at some point in
the future if no one else beats me to it. No promises on when, though,
so if someone else feels like tackling it, please go ahead :)
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-07 10:10 ` Toke Høiland-Jørgensen
@ 2019-03-07 15:56 ` Kevin Darbyshire-Bryant
2019-03-07 17:40 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-07 15:56 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
> On 7 Mar 2019, at 10:10, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
<snipping some context>
>>>>
>>>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>>>> value into the conntrack (& hence skb) mark. This allows us & other
>>>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>>>> a DSCP value has been placed in the field. We cannot simply use a
>>>> non-zero DSCP because zero is a valid DSCP.
>>>
>>> If someone installs the action, the field is supposedly always copied;
>>> so why do we need another flag?
>>
>> I’m trying to limit the number of times expensive iptables mangle
>> rules have to run.
>
> Right, I see your point, but I'm worried that this can risk becoming a
> source of hard-to-debug bugs if this bit happens to get set by accident
> in other places. So, I would suggest to at least make it optional (and
> configurable). So how about the following configuration options:
Phew, I explained myself clearly enough (just) that time :-). There’s a compromise here to be had between setting/using a DSCP for connection(fwmark) vs setting/using a DSCP per packet. Mostly it’s a (improved) performance vs DSCP accuracy compromise and assumes DSCP isn’t going to change after the connection has been established - some people have rules that mangle DSCP based on amount of data transferred. Personal view: anything that permits some sort of classification restoration on ingress has to be an improvement on what we have now.
>
> - fwmark mask (32-bit value)
> specifies a mask to apply to the fwmark field before all operations
Yes - makes sense.
>
> - fwmark shift (valid values 0-32)
> specifies how many bits to left-shift the DSCP values before putting
> them into the fwmark (and how many bits to right-shift the value read
> from fwmark before writing it to the DSCP field); this could also be
> inferred from the mask rather than be a separate option (shift =
> lowest set bit of mask)
I think the inferred choice is best. I can see all manner of confusing behaviour with mismatches between mask & shift. Also it’s one less parameter to pass … and in my dream world have to add some of these parameters into cake as well so it can interpret the DSCP containing fwmark as well… the fewer the better :-)
>
> - get_dscp (boolean; cannot be set along with set_dscp)
> if set, the DSCP field will be copied to the fwmark field, subject to
> mask and shift
Yes
>
> - set_dscp (boolean; cannot be set along with get_dscp)
> if set, the value in fwmark will be copied to the DSCP field, subject
> to mask and shift
Yes
>
> - state mask (32-bit value; probably needs better name)
> if set: the get_dscp action will OR the resulting fwmark before
> storing it. the set_dscp value will AND the fwmark with this value
> before doing anything, and abort if the result is false.
Nearly: If set; the get_dscp action will AND the fwmark with this value and abort if true. If false it will OR the resulting fwmark before storing it. the set_dscp action will AND the fwmark with this value before doing anything and will abort if result is false.
The ‘get_dscp action AND with fwmark and abort if true’ is the function that allows DSCP values to ‘stick' into a connection, thus obviating the need for iptables rules to mangle the DSCP on every egress packet.
I can see a conflict between people who want the DSCP copied into fwmark no matter what and people such as myself who wish it to only happen if fwmark doesn’t have it set already.
Is the solution to have a ‘get_check_state’ option that enables the ‘get_dscp action and with fwmark ’state mask’ abort if true check? Maybe even simpler to have a ‘get_state mask’ and a ’set_state mask’ - AND the fwmark with the get_state mask, if false copy the dscp, else abort.
>
> I think this would allow you to implement what you described, without
> hard-coding any behaviour. Right?
Not quite - see above.
>
> Does anyone else have any opinions / objections to the above API?
>
>
>> The reality is that I enjoyed doing this in the cake codebase. I
>> cannot say the same for act_connmark in fact I hate it so much I’m
>> stopping. The mental effort for a non-programmer and more importantly
>> a non-kernel programmer is exhausting & I’m completely disillusioned.
>> I really need to concentrate on the job that means I can pay the
>> mortgage, which isn’t bashing my head against the kernel.
>
> Fair enough; no reason to do this if you're not enjoying it! We can
> iterate on the API, and I guess I can write the code at some point in
> the future if no one else beats me to it. No promises on when, though,
> so if someone else feels like tackling it, please go ahead :)
>
> -Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-07 15:56 ` Kevin Darbyshire-Bryant
@ 2019-03-07 17:40 ` Toke Høiland-Jørgensen
2019-03-08 11:13 ` Kevin Darbyshire-Bryant
0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-07 17:40 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>> On 7 Mar 2019, at 10:10, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> <snipping some context>
>>>>>
>>>>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>>>>> value into the conntrack (& hence skb) mark. This allows us & other
>>>>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>>>>> a DSCP value has been placed in the field. We cannot simply use a
>>>>> non-zero DSCP because zero is a valid DSCP.
>>>>
>>>> If someone installs the action, the field is supposedly always copied;
>>>> so why do we need another flag?
>>>
>>> I’m trying to limit the number of times expensive iptables mangle
>>> rules have to run.
>>
>> Right, I see your point, but I'm worried that this can risk becoming a
>> source of hard-to-debug bugs if this bit happens to get set by accident
>> in other places. So, I would suggest to at least make it optional (and
>> configurable). So how about the following configuration options:
>
> Phew, I explained myself clearly enough (just) that time :-). There’s
> a compromise here to be had between setting/using a DSCP for
> connection(fwmark) vs setting/using a DSCP per packet. Mostly it’s a
> (improved) performance vs DSCP accuracy compromise and assumes DSCP
> isn’t going to change after the connection has been established - some
> people have rules that mangle DSCP based on amount of data
> transferred. Personal view: anything that permits some sort of
> classification restoration on ingress has to be an improvement on what
> we have now.
So I don't think we in general can assume that all packets of the same
flow has the same DSCP mark. If we want to enforce this, that is another
matter entirely.
>> - fwmark shift (valid values 0-32)
>> specifies how many bits to left-shift the DSCP values before putting
>> them into the fwmark (and how many bits to right-shift the value read
>> from fwmark before writing it to the DSCP field); this could also be
>> inferred from the mask rather than be a separate option (shift =
>> lowest set bit of mask)
>
> I think the inferred choice is best. I can see all manner of confusing
> behaviour with mismatches between mask & shift. Also it’s one less
> parameter to pass … and in my dream world have to add some of these
> parameters into cake as well so it can interpret the DSCP containing
> fwmark as well… the fewer the better :-)
Yeah, I tend to lean towards that as well. Which could also work for
cake: simply interpret the FWMARK parameter as a mask, and shift by the
lowest set bit.
>> - get_dscp (boolean; cannot be set along with set_dscp)
>> if set, the DSCP field will be copied to the fwmark field, subject to
>> mask and shift
>
> Yes
>
>>
>> - set_dscp (boolean; cannot be set along with get_dscp)
>> if set, the value in fwmark will be copied to the DSCP field, subject
>> to mask and shift
>
> Yes
>
>>
>> - state mask (32-bit value; probably needs better name)
>> if set: the get_dscp action will OR the resulting fwmark before
>> storing it. the set_dscp value will AND the fwmark with this value
>> before doing anything, and abort if the result is false.
>
> Nearly: If set; the get_dscp action will AND the fwmark with this
> value and abort if true. If false it will OR the resulting fwmark
> before storing it. the set_dscp action will AND the fwmark with this
> value before doing anything and will abort if result is false.
>
> The ‘get_dscp action AND with fwmark and abort if true’ is the
> function that allows DSCP values to ‘stick' into a connection, thus
> obviating the need for iptables rules to mangle the DSCP on every
> egress packet.
>
> I can see a conflict between people who want the DSCP copied into
> fwmark no matter what and people such as myself who wish it to only
> happen if fwmark doesn’t have it set already.
>
> Is the solution to have a ‘get_check_state’ option that enables the
> ‘get_dscp action and with fwmark ’state mask’ abort if true check?
> Maybe even simpler to have a ‘get_state mask’ and a ’set_state mask’ -
> AND the fwmark with the get_state mask, if false copy the dscp, else
> abort.
Right, if you want to be able to do all those combinations of things we
quickly run into combinatorial explosion, or we risk hard-coding a
policy that some users won't like. However, if we do want to express
these kinds of things, maybe this is better implemented as additional
options to the existing iptables matches?
I.e., we could add
--set-mark-from-dscp MASK
to the MARK target, and
--set-dscp-from-fwmark MASK
to the DSCP target. Then you can implement whatever policy you want
using iptables rules... Any reason that wouldn't work?
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-07 17:40 ` Toke Høiland-Jørgensen
@ 2019-03-08 11:13 ` Kevin Darbyshire-Bryant
2019-03-08 11:28 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-08 11:13 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
> On 7 Mar 2019, at 17:40, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>>> On 7 Mar 2019, at 10:10, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> <snipping some context>
>>>>>>
>>>>>> The valid bit is set when the ‘getdscp’ function has written a DSCP
>>>>>> value into the conntrack (& hence skb) mark. This allows us & other
>>>>>> skb->mark/ct->mark applications (eg iptables, cake qdisc) to know that
>>>>>> a DSCP value has been placed in the field. We cannot simply use a
>>>>>> non-zero DSCP because zero is a valid DSCP.
>>>>>
>>>>> If someone installs the action, the field is supposedly always copied;
>>>>> so why do we need another flag?
>>>>
>>>> I’m trying to limit the number of times expensive iptables mangle
>>>> rules have to run.
>>>
>>> Right, I see your point, but I'm worried that this can risk becoming a
>>> source of hard-to-debug bugs if this bit happens to get set by accident
>>> in other places. So, I would suggest to at least make it optional (and
>>> configurable). So how about the following configuration options:
>>
>> Phew, I explained myself clearly enough (just) that time :-). There’s
>> a compromise here to be had between setting/using a DSCP for
>> connection(fwmark) vs setting/using a DSCP per packet. Mostly it’s a
>> (improved) performance vs DSCP accuracy compromise and assumes DSCP
>> isn’t going to change after the connection has been established - some
>> people have rules that mangle DSCP based on amount of data
>> transferred. Personal view: anything that permits some sort of
>> classification restoration on ingress has to be an improvement on what
>> we have now.
>
> So I don't think we in general can assume that all packets of the same
> flow has the same DSCP mark. If we want to enforce this, that is another
> matter entirely.
>
>>> - fwmark shift (valid values 0-32)
>>> specifies how many bits to left-shift the DSCP values before putting
>>> them into the fwmark (and how many bits to right-shift the value read
>>> from fwmark before writing it to the DSCP field); this could also be
>>> inferred from the mask rather than be a separate option (shift =
>>> lowest set bit of mask)
>>
>> I think the inferred choice is best. I can see all manner of confusing
>> behaviour with mismatches between mask & shift. Also it’s one less
>> parameter to pass … and in my dream world have to add some of these
>> parameters into cake as well so it can interpret the DSCP containing
>> fwmark as well… the fewer the better :-)
>
> Yeah, I tend to lean towards that as well. Which could also work for
> cake: simply interpret the FWMARK parameter as a mask, and shift by the
> lowest set bit.
>
>>> - get_dscp (boolean; cannot be set along with set_dscp)
>>> if set, the DSCP field will be copied to the fwmark field, subject to
>>> mask and shift
>>
>> Yes
>>
>>>
>>> - set_dscp (boolean; cannot be set along with get_dscp)
>>> if set, the value in fwmark will be copied to the DSCP field, subject
>>> to mask and shift
>>
>> Yes
>>
>>>
>>> - state mask (32-bit value; probably needs better name)
>>> if set: the get_dscp action will OR the resulting fwmark before
>>> storing it. the set_dscp value will AND the fwmark with this value
>>> before doing anything, and abort if the result is false.
>>
>> Nearly: If set; the get_dscp action will AND the fwmark with this
>> value and abort if true. If false it will OR the resulting fwmark
>> before storing it. the set_dscp action will AND the fwmark with this
>> value before doing anything and will abort if result is false.
>>
>> The ‘get_dscp action AND with fwmark and abort if true’ is the
>> function that allows DSCP values to ‘stick' into a connection, thus
>> obviating the need for iptables rules to mangle the DSCP on every
>> egress packet.
>>
>> I can see a conflict between people who want the DSCP copied into
>> fwmark no matter what and people such as myself who wish it to only
>> happen if fwmark doesn’t have it set already.
>>
>> Is the solution to have a ‘get_check_state’ option that enables the
>> ‘get_dscp action and with fwmark ’state mask’ abort if true check?
>> Maybe even simpler to have a ‘get_state mask’ and a ’set_state mask’ -
>> AND the fwmark with the get_state mask, if false copy the dscp, else
>> abort.
>
> Right, if you want to be able to do all those combinations of things we
> quickly run into combinatorial explosion, or we risk hard-coding a
> policy that some users won't like. However, if we do want to express
> these kinds of things, maybe this is better implemented as additional
> options to the existing iptables matches?
>
> I.e., we could add
>
> --set-mark-from-dscp MASK
Yes, that (I think) could work. I could wrap my own iptables based ‘have I copied the required DSCP to the fwmark’ rules around that by setting my own flag bit in the fwmark.
but..
>
> to the MARK target, and
>
> --set-dscp-from-fwmark MASK
>
> to the DSCP target. Then you can implement whatever policy you want
> using iptables rules... Any reason that wouldn't work?
On its own I don’t think that would work for ingress traffic - iptables happens too late. So on planet Kevin I still need some sort of flag held in the fwmark that says ‘I hold a DSCP value’ so cake can use it and act_connmarkdscp can (optionally) restore it to the diffserv field.
I suspect we’re going around in circles around what I would like which is “a bit DSCP fuzzy but lighter on CPU ‘cos I don’t have to hit iptables mangle rules as much” v what I think you would like is ’update the fwmark DSCP every time but that also requires iptables to mangle the DSCP for every packet’
I think the options of get_mask & set_mask can accommodate both behaviour choices.
Phew this is hard - come on all you out there… i know you exist! What am I missing/misunderstanding? Ryan on this list previously said "Perfect! And to me, this functionality truly is the icing on (the) cake that makes it the complete bufferbloat/QoS system I've been
dreaming of for ingress.” Thoughts, comments or do Toke & myself make an amusing enough side show? :-)
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-08 11:13 ` Kevin Darbyshire-Bryant
@ 2019-03-08 11:28 ` Toke Høiland-Jørgensen
2019-03-08 14:03 ` Kevin Darbyshire-Bryant
0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-08 11:28 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
> On its own I don’t think that would work for ingress traffic -
> iptables happens too late. So on planet Kevin I still need some sort
> of flag held in the fwmark that says ‘I hold a DSCP value’ so cake can
> use it and act_connmarkdscp can (optionally) restore it to the
> diffserv field.
>
> I suspect we’re going around in circles around what I would like which
> is “a bit DSCP fuzzy but lighter on CPU ‘cos I don’t have to hit
> iptables mangle rules as much” v what I think you would like is
> ’update the fwmark DSCP every time but that also requires iptables to
> mangle the DSCP for every packet’
Well I think my problem is that I don't really have a use case for this
myself. So I need to understand your use case better in order to have an
opinion on how best to implement it so that:
1. We can accommodate what you are trying to do
and
2. We can also accommodate other related use cases, and we don't set
policy in the kernel.
In particular, requirement 2 is why I'm pushing back against hard-coding
a mask anywhere...
So could you maybe post your current ruleset and explain what it is you
are trying to achieve at a high level, and why? :)
Also, you keep mentioning "must be lighter on CPU". Do you have any
performance numbers to show the impact of your current ruleset? Would be
easier to assess any performance impact if we have some baseline numbers
to compare against...
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-08 11:28 ` Toke Høiland-Jørgensen
@ 2019-03-08 14:03 ` Kevin Darbyshire-Bryant
2019-03-09 14:08 ` Toke Høiland-Jørgensen
2019-03-09 20:21 ` John Sager
0 siblings, 2 replies; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-08 14:03 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
> On 8 Mar 2019, at 11:28, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>> On its own I don’t think that would work for ingress traffic -
>> iptables happens too late. So on planet Kevin I still need some sort
>> of flag held in the fwmark that says ‘I hold a DSCP value’ so cake can
>> use it and act_connmarkdscp can (optionally) restore it to the
>> diffserv field.
>>
>> I suspect we’re going around in circles around what I would like which
>> is “a bit DSCP fuzzy but lighter on CPU ‘cos I don’t have to hit
>> iptables mangle rules as much” v what I think you would like is
>> ’update the fwmark DSCP every time but that also requires iptables to
>> mangle the DSCP for every packet’
>
> Well I think my problem is that I don't really have a use case for this
> myself. So I need to understand your use case better in order to have an
> opinion on how best to implement it so that:
>
> 1. We can accommodate what you are trying to do
>
> and
>
> 2. We can also accommodate other related use cases, and we don't set
> policy in the kernel.
OK, what I am trying to do is classify incoming connections into relevant cake tins to impose some bandwidth fairness. e.g. classify bittorrent & things that are downloads into the Bulk tin, and prioritise stuff that is streaming video into the Video tin. Incoming DSCP has a) been washed and b) is unreliable anyway so is unhelpful in this case. iptables runs too late, so having rules to classify incoming stuff is pointless.
tc filters run early enough to use the tc skbedit major/minor number to influence cake’s tin decisions. But tc filters, a) don’t get to see de-natted ipv4 addresses, b) daisy chain, so all filters must be traversed. I can’t find my original tc filter ‘de-prio bittorrent’ but it was a very simple ‘does this destination port match?, yes skbedit to select bulk tin’ - I wanted to do more but the daisy chaining & lack of de-natting made this technique useless.
Then I recently discovered act_connmark (http://linux-ip.net/gl/tc-filters/tc-filters-node2.html) - the thinking being I could use iptables on egress to set fwmarks to classify a connection and have the ingress packets magically follow. This worked but still required 3 tc filter actions to cope with 4 tins:
$TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x01 fw action skbedit priority ${MAJOR}1
$TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x03 fw action skbedit priority ${MAJOR}3
$TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x04 fw action skbedit priority ${MAJOR}4
It also requires similar tc filters on the egress path in addition to the iptables rules.
Could that be improved? Yes, sort of. eBPF to the rescue-ish. I could write an eBPF classifier action program to directly copy the fwmark to the priority field which cake would pick up. I would have stopped there but as I’ve said in a previous email, the eBPF needed to know (hard code) the cake instance major numbers and there was the whole mystery tour of writing/building it.
The other problem with the above magic tin encode into the fwmark routine is that it ignored any good citizens that were using the correct DSCP (e.g. dropbear). I would need to write iptables rules to classify existing DSCP codepoints into the matching tin for fwmark. So ideally I needed the DSCP to drive things and still act as a key into the fwmark mechanism.
The overriding (if required) of DSCP could be done in iptables and to avoid going through the iptables DSCP decision/mangling for every packet I could use a flag within the fwmark to indicate the decision had previously been made and stored for this connection.
The current rules are:
# Configure iptables chain to mark packets
ipt -t mangle -N QOS_MARK_${IFACE}
# Change DSCP of initial relevant hosts/packets - this will be picked up by cake+ and placed in the firewall connmark
# also the DSCP is used as the tin selector.
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.12 -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Bulk4 dst -j DSCP --set-dscp-class CS1 -m comment --comment "Bulk CS1 ipset"
iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid4 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Voice4 dst -j DSCP --set-dscp-class CS6 -m comment --comment "Voice CS6 ipset"
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p udp -s ::c/::ffff:ffff:ffff:ffff -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Bulk6 dst -j DSCP --set-dscp-class CS1 -m comment --comment "Bulk CS1 ipset"
ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid6 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Voice6 dst -j DSCP --set-dscp-class CS6 -m comment --comment "Voice CS6 ipset"
# Send cake+ unmarked connections to the marking chain - Cake+ uses top byte as the
# i've been marked & here's the dscp placeholder.
# top 6 bits are DSCP, LSB is DSCP is valid flag
ipt -t mangle -A PREROUTING -i $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
ipt -t mangle -A POSTROUTING -o $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
The initial egress packet for a connection will go through the above chain (--mark 0x00/0x01000000 -g QOS_MARK_${IFACE}) where the DSCP value is change if required.
Cake will see this initial packet, inspect the fwmark, and because it hasn’t been set will both copy the dscp into the mark and set the ‘fwdscp marked’ bit.
Subsequent egress packets will neither go through the iptables DSCP mangle or the cake ‘update the fwmark’ routine. Instead, cake will use the fwmark as the tin selector.
The ingress path is different. First off act_connmark restores any connection mark to the packet. Cake will inspect the fwmark for the ‘fwdscp marked’ bit. If it is set, then the dscp coded in the firewall mark is used for tin selection. Optionally the encoded DSCP is restored to the packet’s diffserv, but I personally don’t use that functionality as I’m only interested in ’tin fair’ use of the link. And that’s it.
I’m doing 2 things.
1) Classifying traffic into tins on ingress based on the egress DSCP values contained in fwmarks.
2) Basing the fwmark contained DSCP on the initial packet of the connection, possibly after being modified once by iptables rules.
>
> In particular, requirement 2 is why I'm pushing back against hard-coding
> a mask anywhere…
I think with ‘fwmark mask’, ‘get_dscp’, ’set_dscp’, ‘get_state mask’, ’set_state mask’ nothing *is* hard coded.
>
> So could you maybe post your current ruleset and explain what it is you
> are trying to achieve at a high level, and why? :)
I hope I’ve done that.
>
> Also, you keep mentioning "must be lighter on CPU". Do you have any
> performance numbers to show the impact of your current ruleset? Would be
> easier to assess any performance impact if we have some baseline numbers
> to compare against…
Let me see if I can quantify that in some way.
>
> -Toke
Cheers,
Kevin D-B
012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-08 14:03 ` Kevin Darbyshire-Bryant
@ 2019-03-09 14:08 ` Toke Høiland-Jørgensen
2019-03-10 15:21 ` Kevin Darbyshire-Bryant
2019-03-09 20:21 ` John Sager
1 sibling, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-09 14:08 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
> OK, what I am trying to do is classify incoming connections into
> relevant cake tins to impose some bandwidth fairness. e.g. classify
> bittorrent & things that are downloads into the Bulk tin, and
> prioritise stuff that is streaming video into the Video tin. Incoming
> DSCP has a) been washed and b) is unreliable anyway so is unhelpful in
> this case. iptables runs too late, so having rules to classify
> incoming stuff is pointless.
Right, I see.
[... snip .. ]
> Then I recently discovered act_connmark
> (http://linux-ip.net/gl/tc-filters/tc-filters-node2.html) - the
> thinking being I could use iptables on egress to set fwmarks to
> classify a connection and have the ingress packets magically follow.
> This worked but still required 3 tc filter actions to cope with 4
> tins:
>
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x01 fw action skbedit priority ${MAJOR}1
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x03 fw action skbedit priority ${MAJOR}3
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x04 fw action skbedit priority ${MAJOR}4
Right, so this can be replaced with the fwmark action we already added
(and that I just pushed an update to so it supports masking the value
before selecting a tin).
> The overriding (if required) of DSCP could be done in iptables and to
> avoid going through the iptables DSCP decision/mangling for every
> packet I could use a flag within the fwmark to indicate the decision
> had previously been made and stored for this connection.
[ ... ]
> I’m doing 2 things.
>
> 1) Classifying traffic into tins on ingress based on the egress DSCP
> values contained in fwmarks.
>
> 2) Basing the fwmark contained DSCP on the initial packet of the
> connection, possibly after being modified once by iptables rules.
So I tried prototyping what it would actually look like to do all this
in iptables. The result is below (in iptables-restore format). I haven't
tested it, but I believe something along the lines of this will work,
when used along with the CAKE fwmark support (setting a mask of 0xFF
when configuring CAKE).
Now, the obvious eyesore on this is the need to replicate CAKEs diffserv
mappings in iptables rules (21 rules in this case, for the diffserv4
mapping). As long as this only runs once per connection I don't actually
think it's much of a performance issue for normal use, but obviously
there could be pathological cases, and it's also annoying to have to do
that.
So, first question becomes: Do you agree that the firewall rules below
would solve your use case (ignoring the ugliness of having to replicate
the diffserv parsing in iptables)? Or am I missing something?
-Toke
*mangle
:PREROUTING ACCEPT [0:0]
:APPLY-MARKS - [0:0]
:MARK-DSCP - [0:0]
:MARK-POLICY - [0:0]
# Run on inside iface - eth0 in this case
-A PREROUTING -i eth0 -J APPLY-MARKS
# Make sure we have the marks from conntrack
-A APPLY-MARKS -J CONNMARK --restore-mark --nfmask 0xFF --ctmask 0xFF
# Abort if our "already set" bit is set
-A APPLY-MARKS -m mark --mark 0x80/0x80 -j RETURN
# If a DSCP value is set, use DSCP-based marking
-A APPLY-MARKS -m dscp ! --dscp 0 -J MARK-DSCP
# Otherwise, set our own policy
-A APPLY-MARKS -m dscp --dscp 0 -J MARK-POLICY
# Set our "already set" bit, and store things back into conntrack
-A APPLY-MARKS -J MARK --set-mark 0x80/0x80
-A APPLY-MARKS -J CONNMARK --save-mark --nfmask 0xFF --ctmask 0xFF
# DSCP-based fwmark setting
# add whatever DSCP coverage you want here; the below is CAKE's diffserv4
# fwmark offsets into CAKE's tin_order, and are 1-indexed, so tin selection
# is offset by 1, and bulk and besteffort are swapped compared to the array
# in sch_cake.c
# We assume CAKE is configured with an fwmark mask of 0x7F
-A MARK-DSCP -m dscp --dscp 1 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 4 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 8 -J MARK --set-mark 0x1/0x7F
-A MARK-DSCP -m dscp --dscp 16 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 18 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 20 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 22 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 24 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 26 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 28 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 30 -J MARK --set-mark 0x3/0x7F
-A MARK-DSCP -m dscp --dscp 34 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 36 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 38 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 32 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 40 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 44 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 46 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 48 -J MARK --set-mark 0x4/0x7F
-A MARK-DSCP -m dscp --dscp 56 -J MARK --set-mark 0x4/0x7F
# default best effort
-A MARK-DSCP -m mark --mark 0x0/0x7F -J --set-mark 0x2/0x7F
# Policy-based fwmark setting
-A MARK-POLICY -p tcp -s 192.168.219.5 -m comment --comment "Skybox Bulk" -j MARK --set-mark 0x1/0x7F
# add more policy rules here; anything not marked will use the DSCP bits
# of each packet; optional catch-all to avoid that and make everything BE:
#-A MARK-DSCP -m mark --mark 0x0/0x7F -J --set-mark 0x2/0x7F
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-08 14:03 ` Kevin Darbyshire-Bryant
2019-03-09 14:08 ` Toke Høiland-Jørgensen
@ 2019-03-09 20:21 ` John Sager
1 sibling, 0 replies; 19+ messages in thread
From: John Sager @ 2019-03-09 20:21 UTC (permalink / raw)
To: cake
I wonder if you've dismissed eBPF too quickly. Reading around the subject
that's the way the kernel seems to be going for both network actions and
various other purposes. I wonder if passing the info about cake could be
done via eBPF maps. I can't see your original eBPF example at it's
disappeared off github.
John
On 08/03/2019 14:03, Kevin Darbyshire-Bryant wrote:
>
>
> OK, what I am trying to do is classify incoming connections into relevant cake tins to impose some bandwidth fairness. e.g. classify bittorrent & things that are downloads into the Bulk tin, and prioritise stuff that is streaming video into the Video tin. Incoming DSCP has a) been washed and b) is unreliable anyway so is unhelpful in this case. iptables runs too late, so having rules to classify incoming stuff is pointless.
>
> tc filters run early enough to use the tc skbedit major/minor number to influence cake’s tin decisions. But tc filters, a) don’t get to see de-natted ipv4 addresses, b) daisy chain, so all filters must be traversed. I can’t find my original tc filter ‘de-prio bittorrent’ but it was a very simple ‘does this destination port match?, yes skbedit to select bulk tin’ - I wanted to do more but the daisy chaining & lack of de-natting made this technique useless.
>
> Then I recently discovered act_connmark (http://linux-ip.net/gl/tc-filters/tc-filters-node2.html) - the thinking being I could use iptables on egress to set fwmarks to classify a connection and have the ingress packets magically follow. This worked but still required 3 tc filter actions to cope with 4 tins:
>
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x01 fw action skbedit priority ${MAJOR}1
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x03 fw action skbedit priority ${MAJOR}3
> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x04 fw action skbedit priority ${MAJOR}4
>
> It also requires similar tc filters on the egress path in addition to the iptables rules.
>
> Could that be improved? Yes, sort of. eBPF to the rescue-ish. I could write an eBPF classifier action program to directly copy the fwmark to the priority field which cake would pick up. I would have stopped there but as I’ve said in a previous email, the eBPF needed to know (hard code) the cake instance major numbers and there was the whole mystery tour of writing/building it.
>
> The other problem with the above magic tin encode into the fwmark routine is that it ignored any good citizens that were using the correct DSCP (e.g. dropbear). I would need to write iptables rules to classify existing DSCP codepoints into the matching tin for fwmark. So ideally I needed the DSCP to drive things and still act as a key into the fwmark mechanism.
>
> The overriding (if required) of DSCP could be done in iptables and to avoid going through the iptables DSCP decision/mangling for every packet I could use a flag within the fwmark to indicate the decision had previously been made and stored for this connection.
>
>
> The current rules are:
>
> # Configure iptables chain to mark packets
> ipt -t mangle -N QOS_MARK_${IFACE}
>
> # Change DSCP of initial relevant hosts/packets - this will be picked up by cake+ and placed in the firewall connmark
> # also the DSCP is used as the tin selector.
>
> iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.5 -m comment --comment "Skybox DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
> iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.10 -m comment --comment "Bluray DSCP CS2 Video" -j DSCP --set-dscp-class CS2
> iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> iptables -t mangle -A QOS_MARK_${IFACE} -p udp -s 192.168.219.12 -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> iptables -t mangle -A QOS_MARK_${IFACE} -p tcp -s 192.168.219.12 -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
>
> iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Bulk4 dst -j DSCP --set-dscp-class CS1 -m comment --comment "Bulk CS1 ipset"
> iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid4 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
> iptables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Voice4 dst -j DSCP --set-dscp-class CS6 -m comment --comment "Voice CS6 ipset"
>
> ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> ip6tables -t mangle -A QOS_MARK_${IFACE} -p udp -s ::c/::ffff:ffff:ffff:ffff -m udp --sport 6981 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 4443 -m comment --comment "BT DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
> ip6tables -t mangle -A QOS_MARK_${IFACE} -p tcp -s ::c/::ffff:ffff:ffff:ffff -m tcp --dport 443 -m comment --comment "HTTPS uploads DSCP CS1 Bulk" -j DSCP --set-dscp-class CS1
>
> ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Bulk6 dst -j DSCP --set-dscp-class CS1 -m comment --comment "Bulk CS1 ipset"
> ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Vid6 dst -j DSCP --set-dscp-class CS2 -m comment --comment "Vid CS2 ipset"
> ip6tables -t mangle -A QOS_MARK_${IFACE} -m set --match-set Voice6 dst -j DSCP --set-dscp-class CS6 -m comment --comment "Voice CS6 ipset"
>
> # Send cake+ unmarked connections to the marking chain - Cake+ uses top byte as the
> # i've been marked & here's the dscp placeholder.
> # top 6 bits are DSCP, LSB is DSCP is valid flag
> ipt -t mangle -A PREROUTING -i $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
> ipt -t mangle -A POSTROUTING -o $IFACE -m mark --mark 0x00/0x01000000 -g QOS_MARK_${IFACE}
>
>
> The initial egress packet for a connection will go through the above chain (--mark 0x00/0x01000000 -g QOS_MARK_${IFACE}) where the DSCP value is change if required.
>
> Cake will see this initial packet, inspect the fwmark, and because it hasn’t been set will both copy the dscp into the mark and set the ‘fwdscp marked’ bit.
>
> Subsequent egress packets will neither go through the iptables DSCP mangle or the cake ‘update the fwmark’ routine. Instead, cake will use the fwmark as the tin selector.
>
>
> The ingress path is different. First off act_connmark restores any connection mark to the packet. Cake will inspect the fwmark for the ‘fwdscp marked’ bit. If it is set, then the dscp coded in the firewall mark is used for tin selection. Optionally the encoded DSCP is restored to the packet’s diffserv, but I personally don’t use that functionality as I’m only interested in ’tin fair’ use of the link. And that’s it.
>
> I’m doing 2 things.
>
> 1) Classifying traffic into tins on ingress based on the egress DSCP values contained in fwmarks.
> 2) Basing the fwmark contained DSCP on the initial packet of the connection, possibly after being modified once by iptables rules.
>
>
>>
>> In particular, requirement 2 is why I'm pushing back against hard-coding
>> a mask anywhere…
>
> I think with ‘fwmark mask’, ‘get_dscp’, ’set_dscp’, ‘get_state mask’, ’set_state mask’ nothing *is* hard coded.
>
>>
>> So could you maybe post your current ruleset and explain what it is you
>> are trying to achieve at a high level, and why? :)
>
> I hope I’ve done that.
>
>>
>> Also, you keep mentioning "must be lighter on CPU". Do you have any
>> performance numbers to show the impact of your current ruleset? Would be
>> easier to assess any performance impact if we have some baseline numbers
>> to compare against…
>
> Let me see if I can quantify that in some way.
>
>>
>> -Toke
>
>
> Cheers,
>
> Kevin D-B
>
> 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A
>
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-09 14:08 ` Toke Høiland-Jørgensen
@ 2019-03-10 15:21 ` Kevin Darbyshire-Bryant
2019-03-10 23:56 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-10 15:21 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
> On 9 Mar 2019, at 14:08, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>> OK, what I am trying to do is classify incoming connections into
>> relevant cake tins to impose some bandwidth fairness. e.g. classify
>> bittorrent & things that are downloads into the Bulk tin, and
>> prioritise stuff that is streaming video into the Video tin. Incoming
>> DSCP has a) been washed and b) is unreliable anyway so is unhelpful in
>> this case. iptables runs too late, so having rules to classify
>> incoming stuff is pointless.
>
> Right, I see.
>
> [... snip .. ]
>
>> Then I recently discovered act_connmark
>> (http://linux-ip.net/gl/tc-filters/tc-filters-node2.html) - the
>> thinking being I could use iptables on egress to set fwmarks to
>> classify a connection and have the ingress packets magically follow.
>> This worked but still required 3 tc filter actions to cope with 4
>> tins:
>>
>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x01 fw action skbedit priority ${MAJOR}1
>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x03 fw action skbedit priority ${MAJOR}3
>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x04 fw action skbedit priority ${MAJOR}4
>
> Right, so this can be replaced with the fwmark action we already added
> (and that I just pushed an update to so it supports masking the value
> before selecting a tin).
Yes. I’d point out the (hopefully obvious) that the flag mask needs to be one bit bigger than you might immediately think. e.g. diffserv4 needs to store 5 values (0-4), 3 bits. 0 is being used as an implied ’tin is not set, fall back to DSCP’. One could store DSCP+1 of course and use the same logic.
>
>> The overriding (if required) of DSCP could be done in iptables and to
>> avoid going through the iptables DSCP decision/mangling for every
>> packet I could use a flag within the fwmark to indicate the decision
>> had previously been made and stored for this connection.
>
> [ ... ]
>
>> I’m doing 2 things.
>>
>> 1) Classifying traffic into tins on ingress based on the egress DSCP
>> values contained in fwmarks.
>>
>> 2) Basing the fwmark contained DSCP on the initial packet of the
>> connection, possibly after being modified once by iptables rules.
>
> So I tried prototyping what it would actually look like to do all this
> in iptables. The result is below (in iptables-restore format). I haven't
> tested it, but I believe something along the lines of this will work,
> when used along with the CAKE fwmark support (setting a mask of 0xFF
> when configuring CAKE).
>
> Now, the obvious eyesore on this is the need to replicate CAKEs diffserv
> mappings in iptables rules (21 rules in this case, for the diffserv4
> mapping). As long as this only runs once per connection I don't actually
> think it's much of a performance issue for normal use, but obviously
> there could be pathological cases, and it's also annoying to have to do
> that.
>
> So, first question becomes: Do you agree that the firewall rules below
> would solve your use case (ignoring the ugliness of having to replicate
> the diffserv parsing in iptables)? Or am I missing something?
I’ve had a quick look over it and think it would work.
The ugliness of doing the diffserv parsing is what prompted the idea of storing the DSCP directly and I felt the stored tin selection was effectively abstracting the diffserv field anyway. Storing the DSCP is more compatible with differing egress v ingress mappings (eg. egress diffserv4, ingress diffserv3 though I can’t really think of a use case for that)
Of course using fwmark as tin number selector in cake doesn’t preclude some other mechanism of storing & recovering DSCP to/from firewall mark e.g. the previously discussed act-connmark+dscp which would help anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That of course makes you happier since cake doesn’t embed itself further into conntrack.
>
> -Toke
>
>
>
> *mangle
> :PREROUTING ACCEPT [0:0]
> :APPLY-MARKS - [0:0]
> :MARK-DSCP - [0:0]
> :MARK-POLICY - [0:0]
>
> # Run on inside iface - eth0 in this case
> -A PREROUTING -i eth0 -J APPLY-MARKS
>
> # Make sure we have the marks from conntrack
> -A APPLY-MARKS -J CONNMARK --restore-mark --nfmask 0xFF --ctmask 0xFF
>
> # Abort if our "already set" bit is set
> -A APPLY-MARKS -m mark --mark 0x80/0x80 -j RETURN
>
> # If a DSCP value is set, use DSCP-based marking
> -A APPLY-MARKS -m dscp ! --dscp 0 -J MARK-DSCP
>
> # Otherwise, set our own policy
> -A APPLY-MARKS -m dscp --dscp 0 -J MARK-POLICY
>
> # Set our "already set" bit, and store things back into conntrack
> -A APPLY-MARKS -J MARK --set-mark 0x80/0x80
> -A APPLY-MARKS -J CONNMARK --save-mark --nfmask 0xFF --ctmask 0xFF
>
> # DSCP-based fwmark setting
> # add whatever DSCP coverage you want here; the below is CAKE's diffserv4
> # fwmark offsets into CAKE's tin_order, and are 1-indexed, so tin selection
> # is offset by 1, and bulk and besteffort are swapped compared to the array
> # in sch_cake.c
> # We assume CAKE is configured with an fwmark mask of 0x7F
>
> -A MARK-DSCP -m dscp --dscp 1 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 4 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 8 -J MARK --set-mark 0x1/0x7F
> -A MARK-DSCP -m dscp --dscp 16 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 18 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 20 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 22 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 24 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 26 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 28 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 30 -J MARK --set-mark 0x3/0x7F
> -A MARK-DSCP -m dscp --dscp 34 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 36 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 38 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 32 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 40 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 44 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 46 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 48 -J MARK --set-mark 0x4/0x7F
> -A MARK-DSCP -m dscp --dscp 56 -J MARK --set-mark 0x4/0x7F
> # default best effort
> -A MARK-DSCP -m mark --mark 0x0/0x7F -J --set-mark 0x2/0x7F
>
>
> # Policy-based fwmark setting
> -A MARK-POLICY -p tcp -s 192.168.219.5 -m comment --comment "Skybox Bulk" -j MARK --set-mark 0x1/0x7F
> # add more policy rules here; anything not marked will use the DSCP bits
> # of each packet; optional catch-all to avoid that and make everything BE:
> #-A MARK-DSCP -m mark --mark 0x0/0x7F -J --set-mark 0x2/0x7F
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-10 15:21 ` Kevin Darbyshire-Bryant
@ 2019-03-10 23:56 ` Toke Høiland-Jørgensen
2019-03-11 10:51 ` Kevin Darbyshire-Bryant
0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-10 23:56 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>> On 9 Mar 2019, at 14:08, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>>
>>> OK, what I am trying to do is classify incoming connections into
>>> relevant cake tins to impose some bandwidth fairness. e.g. classify
>>> bittorrent & things that are downloads into the Bulk tin, and
>>> prioritise stuff that is streaming video into the Video tin. Incoming
>>> DSCP has a) been washed and b) is unreliable anyway so is unhelpful in
>>> this case. iptables runs too late, so having rules to classify
>>> incoming stuff is pointless.
>>
>> Right, I see.
>>
>> [... snip .. ]
>>
>>> Then I recently discovered act_connmark
>>> (http://linux-ip.net/gl/tc-filters/tc-filters-node2.html) - the
>>> thinking being I could use iptables on egress to set fwmarks to
>>> classify a connection and have the ingress packets magically follow.
>>> This worked but still required 3 tc filter actions to cope with 4
>>> tins:
>>>
>>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x01 fw action skbedit priority ${MAJOR}1
>>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x03 fw action skbedit priority ${MAJOR}3
>>> $TC filter add dev $IFACE parent $MAJOR protocol ip handle 0x04 fw action skbedit priority ${MAJOR}4
>>
>> Right, so this can be replaced with the fwmark action we already added
>> (and that I just pushed an update to so it supports masking the value
>> before selecting a tin).
>
> Yes. I’d point out the (hopefully obvious) that the flag mask needs
> to be one bit bigger than you might immediately think. e.g. diffserv4
> needs to store 5 values (0-4), 3 bits. 0 is being used as an implied
> ’tin is not set, fall back to DSCP’. One could store DSCP+1 of course
> and use the same logic.
Yeah, or one can just squat on a whole byte like I do in the example ;)
>>
>>> The overriding (if required) of DSCP could be done in iptables and to
>>> avoid going through the iptables DSCP decision/mangling for every
>>> packet I could use a flag within the fwmark to indicate the decision
>>> had previously been made and stored for this connection.
>>
>> [ ... ]
>>
>>> I’m doing 2 things.
>>>
>>> 1) Classifying traffic into tins on ingress based on the egress DSCP
>>> values contained in fwmarks.
>>>
>>> 2) Basing the fwmark contained DSCP on the initial packet of the
>>> connection, possibly after being modified once by iptables rules.
>>
>> So I tried prototyping what it would actually look like to do all this
>> in iptables. The result is below (in iptables-restore format). I haven't
>> tested it, but I believe something along the lines of this will work,
>> when used along with the CAKE fwmark support (setting a mask of 0xFF
>> when configuring CAKE).
>>
>> Now, the obvious eyesore on this is the need to replicate CAKEs diffserv
>> mappings in iptables rules (21 rules in this case, for the diffserv4
>> mapping). As long as this only runs once per connection I don't actually
>> think it's much of a performance issue for normal use, but obviously
>> there could be pathological cases, and it's also annoying to have to do
>> that.
>>
>> So, first question becomes: Do you agree that the firewall rules below
>> would solve your use case (ignoring the ugliness of having to replicate
>> the diffserv parsing in iptables)? Or am I missing something?
>
> I’ve had a quick look over it and think it would work.
>
> The ugliness of doing the diffserv parsing is what prompted the idea
> of storing the DSCP directly and I felt the stored tin selection was
> effectively abstracting the diffserv field anyway.
Right, but that means that the CAKE interpretation of the fwmark would
have to change from something that selects the tin, to something that is
treated as a DSCP mark. I think this was the part that I was missing
before. I don't think this is a good idea, as that means we tie the
marks to one particular use case.
> Storing the DSCP is more compatible with differing egress v ingress
> mappings (eg. egress diffserv4, ingress diffserv3 though I can’t
> really think of a use case for that)
I think that if someone wants to do something like that, we are way out
of "simple use case that we want to actively support" territory, and can
legitimately ask people to go write a BPF filter or something :)
> Of course using fwmark as tin number selector in cake doesn’t preclude
> some other mechanism of storing & recovering DSCP to/from firewall
> mark e.g. the previously discussed act-connmark+dscp which would help
> anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That
> of course makes you happier since cake doesn’t embed itself further
> into conntrack.
Yeah, I definitely don't think CAKE has any business writing DSCP values
into the mark. However, as I said before, there may be a case for adding
an option to write the tin selection back to conntrack. Something like
the patch below would do it (with an option to control it, of course),
but it does incur a dependency on another conntrack header, so I'm not
sure if it will be acceptable to upstream. Also, we would need to figure
out how the option should work.
The alternative would be to use another mechanism; the iptables rules
replication is one example. Another could be adding a conntrack helper
to eBPF...
-Toke
diff --git a/sch_cake.c b/sch_cake.c
index a8fa224..c6b7dd9 100644
--- a/sch_cake.c
+++ b/sch_cake.c
@@ -78,6 +78,7 @@
#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_ecache.h>
#include <net/netfilter/nf_conntrack_zones.h>
#include <net/netfilter/nf_conntrack.h>
#endif
@@ -1646,6 +1647,27 @@ static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
}
}
+static void cake_set_tin_connmark(struct cake_sched_data *q,
+ struct sk_buff *skb, u32 tin)
+{
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+ u32 newmark;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (ct) {
+ newmark = (ct->mark & ~q->fwmark_mask);
+ newmark ^= (tin << q->fwmark_shft) & q->fwmark_mask;
+
+ if (ct->mark != newmark) {
+ ct->mark = newmark;
+ nf_conntrack_event_cache(IPCT_MARK, ct);
+ }
+ }
+#endif
+}
+
static struct cake_tin_data *cake_select_tin(struct Qdisc *sch,
struct sk_buff *skb)
{
@@ -1678,6 +1700,8 @@ static struct cake_tin_data *cake_select_tin(struct Qdisc *sch,
tin = 0;
}
+ cake_set_tin_connmark(q, skb, tin);
+
return &q->tins[tin];
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-10 23:56 ` Toke Høiland-Jørgensen
@ 2019-03-11 10:51 ` Kevin Darbyshire-Bryant
2019-03-11 13:00 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-11 10:51 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
I’m tired after the last 3 days working on $paidwork (4 jobs in 3 days) so can’t think straight, anyway...
> On 10 Mar 2019, at 23:56, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
<some context snipped>
>>
>> Yes. I’d point out the (hopefully obvious) that the flag mask needs
>> to be one bit bigger than you might immediately think. e.g. diffserv4
>> needs to store 5 values (0-4), 3 bits. 0 is being used as an implied
>> ’tin is not set, fall back to DSCP’. One could store DSCP+1 of course
>> and use the same logic.
>
> Yeah, or one can just squat on a whole byte like I do in the example ;)
As do I for my ‘DSCP + set/not set flag’ :-)
>
>>
>> The ugliness of doing the diffserv parsing is what prompted the idea
>> of storing the DSCP directly and I felt the stored tin selection was
>> effectively abstracting the diffserv field anyway.
>
> Right, but that means that the CAKE interpretation of the fwmark would
> have to change from something that selects the tin, to something that is
> treated as a DSCP mark. I think this was the part that I was missing
> before. I don't think this is a good idea, as that means we tie the
> marks to one particular use case.
I did say it was incompatible with the existing tin storing idea right from day 1 and why I was so keen/worried by the fact it had already gone upstream.
>
>> Storing the DSCP is more compatible with differing egress v ingress
>> mappings (eg. egress diffserv4, ingress diffserv3 though I can’t
>> really think of a use case for that)
>
> I think that if someone wants to do something like that, we are way out
> of "simple use case that we want to actively support" territory, and can
> legitimately ask people to go write a BPF filter or something :)
>
>> Of course using fwmark as tin number selector in cake doesn’t preclude
>> some other mechanism of storing & recovering DSCP to/from firewall
>> mark e.g. the previously discussed act-connmark+dscp which would help
>> anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That
>> of course makes you happier since cake doesn’t embed itself further
>> into conntrack.
>
> Yeah, I definitely don't think CAKE has any business writing DSCP values
> into the mark. However, as I said before, there may be a case for adding
I’m going forward on the basis that *cake* ’storing DSCP within fwmark’ idea has died and will be trying to cobble something together within act_connmark as that has a wider potential client base than just cake.
> an option to write the tin selection back to conntrack. Something like
> the patch below would do it (with an option to control it, of course),
> but it does incur a dependency on another conntrack header, so I'm not
> sure if it will be acceptable to upstream. Also, we would need to figure
> out how the option should work.
I think before expending any further mental effort on that, the question should be asked of the kernel people. I don’t see the point in working out the semantics of something that ultimately won’t be accepted.
> The alternative would be to use another mechanism; the iptables rules
> replication is one example. Another could be adding a conntrack helper
> to eBPF…
>
> -Toke
>
>
> diff --git a/sch_cake.c b/sch_cake.c
> index a8fa224..c6b7dd9 100644
> --- a/sch_cake.c
> +++ b/sch_cake.c
> @@ -78,6 +78,7 @@
>
> #if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> #include <net/netfilter/nf_conntrack_core.h>
> +#include <net/netfilter/nf_conntrack_ecache.h>
> #include <net/netfilter/nf_conntrack_zones.h>
> #include <net/netfilter/nf_conntrack.h>
> #endif
> @@ -1646,6 +1647,27 @@ static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
> }
> }
>
> +static void cake_set_tin_connmark(struct cake_sched_data *q,
> + struct sk_buff *skb, u32 tin)
> +{
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> + enum ip_conntrack_info ctinfo;
> + struct nf_conn *ct;
> + u32 newmark;
> +
> + ct = nf_ct_get(skb, &ctinfo);
> + if (ct) {
> + newmark = (ct->mark & ~q->fwmark_mask);
> + newmark ^= (tin << q->fwmark_shft) & q->fwmark_mask;
> +
> + if (ct->mark != newmark) {
> + ct->mark = newmark;
> + nf_conntrack_event_cache(IPCT_MARK, ct);
> + }
> + }
> +#endif
> +}
> +
> static struct cake_tin_data *cake_select_tin(struct Qdisc *sch,
> struct sk_buff *skb)
> {
> @@ -1678,6 +1700,8 @@ static struct cake_tin_data *cake_select_tin(struct Qdisc *sch,
> tin = 0;
> }
>
> + cake_set_tin_connmark(q, skb, tin);
> +
> return &q->tins[tin];
> }
Cheers,
Kevin D-B
012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-11 10:51 ` Kevin Darbyshire-Bryant
@ 2019-03-11 13:00 ` Toke Høiland-Jørgensen
2019-03-11 14:11 ` Kevin Darbyshire-Bryant
0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-11 13:00 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>>> The ugliness of doing the diffserv parsing is what prompted the idea
>>> of storing the DSCP directly and I felt the stored tin selection was
>>> effectively abstracting the diffserv field anyway.
>>
>> Right, but that means that the CAKE interpretation of the fwmark would
>> have to change from something that selects the tin, to something that is
>> treated as a DSCP mark. I think this was the part that I was missing
>> before. I don't think this is a good idea, as that means we tie the
>> marks to one particular use case.
>
> I did say it was incompatible with the existing tin storing idea right
> from day 1 and why I was so keen/worried by the fact it had already
> gone upstream.
Yeah, but I didn't connect the dots for how this related to the CAKE use
case until I saw your full example...
>>> Storing the DSCP is more compatible with differing egress v ingress
>>> mappings (eg. egress diffserv4, ingress diffserv3 though I can’t
>>> really think of a use case for that)
>>
>> I think that if someone wants to do something like that, we are way out
>> of "simple use case that we want to actively support" territory, and can
>> legitimately ask people to go write a BPF filter or something :)
>>
>>> Of course using fwmark as tin number selector in cake doesn’t preclude
>>> some other mechanism of storing & recovering DSCP to/from firewall
>>> mark e.g. the previously discussed act-connmark+dscp which would help
>>> anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That
>>> of course makes you happier since cake doesn’t embed itself further
>>> into conntrack.
>>
>> Yeah, I definitely don't think CAKE has any business writing DSCP values
>> into the mark. However, as I said before, there may be a case for adding
>
> I’m going forward on the basis that *cake* ’storing DSCP within
> fwmark’ idea has died and will be trying to cobble something together
> within act_connmark as that has a wider potential client base than
> just cake.
Yup, sounds good.
BTW, for future reference, while digging through the TC eBPF code, I
discovered that it is (I think) possible to indirectly pass information
to the eBPF program: If you add a classid when adding the bpf filter,
the kernel will put that into skb->tc_classid before executing the eBPF
program. This serves as a default if you don't do any classification
from the eBPF program, but presumably you can also just read and modify
it from the eBPF program...
>> an option to write the tin selection back to conntrack. Something like
>> the patch below would do it (with an option to control it, of course),
>> but it does incur a dependency on another conntrack header, so I'm not
>> sure if it will be acceptable to upstream. Also, we would need to figure
>> out how the option should work.
>
> I think before expending any further mental effort on that, the
> question should be asked of the kernel people. I don’t see the point
> in working out the semantics of something that ultimately won’t be
> accepted.
Unfortunately that is not how upstream development works. If we want
this feature, we're going to have to do the work of defining and
implementing it and make our case that it should be included.
I think a possible way forward would be:
1. I'll submit the masking patch for the existing fwmark feature to
upstream. That should get in this cycle, and will form the basis for
anything else we want to do in the future.
2. For the next cycle (i.e., upstream 5.2) we can hash out what we
really need for this use case, and how it should work from a user
perspective; we're going to need that anyway even if we end up
implementing it differently. This also includes deciding on whether
this is doable outside of CAKE, or if we need another feature flag.
3. We submit whatever we come up with. If it gets accepted, great,
otherwise we iterate based on feedback.
Any objections? :)
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-11 13:00 ` Toke Høiland-Jørgensen
@ 2019-03-11 14:11 ` Kevin Darbyshire-Bryant
2019-03-11 14:32 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Darbyshire-Bryant @ 2019-03-11 14:11 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: cake
> On 11 Mar 2019, at 13:00, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>
>>>> The ugliness of doing the diffserv parsing is what prompted the idea
>>>> of storing the DSCP directly and I felt the stored tin selection was
>>>> effectively abstracting the diffserv field anyway.
>>>
>>> Right, but that means that the CAKE interpretation of the fwmark would
>>> have to change from something that selects the tin, to something that is
>>> treated as a DSCP mark. I think this was the part that I was missing
>>> before. I don't think this is a good idea, as that means we tie the
>>> marks to one particular use case.
>>
>> I did say it was incompatible with the existing tin storing idea right
>> from day 1 and why I was so keen/worried by the fact it had already
>> gone upstream.
>
> Yeah, but I didn't connect the dots for how this related to the CAKE use
> case until I saw your full example...
>
>>>> Storing the DSCP is more compatible with differing egress v ingress
>>>> mappings (eg. egress diffserv4, ingress diffserv3 though I can’t
>>>> really think of a use case for that)
>>>
>>> I think that if someone wants to do something like that, we are way out
>>> of "simple use case that we want to actively support" territory, and can
>>> legitimately ask people to go write a BPF filter or something :)
>>>
>>>> Of course using fwmark as tin number selector in cake doesn’t preclude
>>>> some other mechanism of storing & recovering DSCP to/from firewall
>>>> mark e.g. the previously discussed act-connmark+dscp which would help
>>>> anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That
>>>> of course makes you happier since cake doesn’t embed itself further
>>>> into conntrack.
>>>
>>> Yeah, I definitely don't think CAKE has any business writing DSCP values
>>> into the mark. However, as I said before, there may be a case for adding
>>
>> I’m going forward on the basis that *cake* ’storing DSCP within
>> fwmark’ idea has died and will be trying to cobble something together
>> within act_connmark as that has a wider potential client base than
>> just cake.
>
> Yup, sounds good.
>
> BTW, for future reference, while digging through the TC eBPF code, I
> discovered that it is (I think) possible to indirectly pass information
> to the eBPF program: If you add a classid when adding the bpf filter,
> the kernel will put that into skb->tc_classid before executing the eBPF
> program. This serves as a default if you don't do any classification
> from the eBPF program, but presumably you can also just read and modify
> it from the eBPF program…
I’ll try to remember that if I’m ever in the eBPF vicinity.
>
>>> an option to write the tin selection back to conntrack. Something like
>>> the patch below would do it (with an option to control it, of course),
>>> but it does incur a dependency on another conntrack header, so I'm not
>>> sure if it will be acceptable to upstream. Also, we would need to figure
>>> out how the option should work.
>>
>> I think before expending any further mental effort on that, the
>> question should be asked of the kernel people. I don’t see the point
>> in working out the semantics of something that ultimately won’t be
>> accepted.
>
> Unfortunately that is not how upstream development works. If we want
> this feature, we're going to have to do the work of defining and
> implementing it and make our case that it should be included.
>
> I think a possible way forward would be:
>
> 1. I'll submit the masking patch for the existing fwmark feature to
> upstream. That should get in this cycle, and will form the basis for
> anything else we want to do in the future.
Agreed. Masking should have been in the original patch. Had I known about (and been amused by) ffs it would have been done :-)
>
> 2. For the next cycle (i.e., upstream 5.2) we can hash out what we
> really need for this use case, and how it should work from a user
> perspective; we're going to need that anyway even if we end up
> implementing it differently. This also includes deciding on whether
> this is doable outside of CAKE, or if we need another feature flag.
You’ve already clearly demonstrated that the marking/flagging can be done using iptables, so it is doable outside of CAKE. The iptables rules to do it outside of CAKE are…an unpleasant way of implementing a lookup table which also requires knowledge of the inner workings of CAKE (lookup table & tin order). Personally I feel that’s ugly, but you already know that :-)
>
> 3. We submit whatever we come up with. If it gets accepted, great,
> otherwise we iterate based on feedback.
I await the tarring & feathering followed by “You’re trying to set a conntrack mark……from a QDISC?!!!!!! get outta town” :-)
>
> Any objections? :)
Nope.
>
> -Toke
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Cake] act_connmark + dscp
2019-03-11 14:11 ` Kevin Darbyshire-Bryant
@ 2019-03-11 14:32 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-03-11 14:32 UTC (permalink / raw)
To: Kevin Darbyshire-Bryant; +Cc: cake
Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>> On 11 Mar 2019, at 13:00, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> writes:
>>
>>>>> The ugliness of doing the diffserv parsing is what prompted the idea
>>>>> of storing the DSCP directly and I felt the stored tin selection was
>>>>> effectively abstracting the diffserv field anyway.
>>>>
>>>> Right, but that means that the CAKE interpretation of the fwmark would
>>>> have to change from something that selects the tin, to something that is
>>>> treated as a DSCP mark. I think this was the part that I was missing
>>>> before. I don't think this is a good idea, as that means we tie the
>>>> marks to one particular use case.
>>>
>>> I did say it was incompatible with the existing tin storing idea right
>>> from day 1 and why I was so keen/worried by the fact it had already
>>> gone upstream.
>>
>> Yeah, but I didn't connect the dots for how this related to the CAKE use
>> case until I saw your full example...
>>
>>>>> Storing the DSCP is more compatible with differing egress v ingress
>>>>> mappings (eg. egress diffserv4, ingress diffserv3 though I can’t
>>>>> really think of a use case for that)
>>>>
>>>> I think that if someone wants to do something like that, we are way out
>>>> of "simple use case that we want to actively support" territory, and can
>>>> legitimately ask people to go write a BPF filter or something :)
>>>>
>>>>> Of course using fwmark as tin number selector in cake doesn’t preclude
>>>>> some other mechanism of storing & recovering DSCP to/from firewall
>>>>> mark e.g. the previously discussed act-connmark+dscp which would help
>>>>> anyone who wanted to do such ‘link traversing’ DSCP shenanigans. That
>>>>> of course makes you happier since cake doesn’t embed itself further
>>>>> into conntrack.
>>>>
>>>> Yeah, I definitely don't think CAKE has any business writing DSCP values
>>>> into the mark. However, as I said before, there may be a case for adding
>>>
>>> I’m going forward on the basis that *cake* ’storing DSCP within
>>> fwmark’ idea has died and will be trying to cobble something together
>>> within act_connmark as that has a wider potential client base than
>>> just cake.
>>
>> Yup, sounds good.
>>
>> BTW, for future reference, while digging through the TC eBPF code, I
>> discovered that it is (I think) possible to indirectly pass information
>> to the eBPF program: If you add a classid when adding the bpf filter,
>> the kernel will put that into skb->tc_classid before executing the eBPF
>> program. This serves as a default if you don't do any classification
>> from the eBPF program, but presumably you can also just read and modify
>> it from the eBPF program…
>
> I’ll try to remember that if I’m ever in the eBPF vicinity.
>
>
>>
>>>> an option to write the tin selection back to conntrack. Something like
>>>> the patch below would do it (with an option to control it, of course),
>>>> but it does incur a dependency on another conntrack header, so I'm not
>>>> sure if it will be acceptable to upstream. Also, we would need to figure
>>>> out how the option should work.
>>>
>>> I think before expending any further mental effort on that, the
>>> question should be asked of the kernel people. I don’t see the point
>>> in working out the semantics of something that ultimately won’t be
>>> accepted.
>>
>> Unfortunately that is not how upstream development works. If we want
>> this feature, we're going to have to do the work of defining and
>> implementing it and make our case that it should be included.
>>
>> I think a possible way forward would be:
>>
>> 1. I'll submit the masking patch for the existing fwmark feature to
>> upstream. That should get in this cycle, and will form the basis for
>> anything else we want to do in the future.
>
> Agreed. Masking should have been in the original patch. Had I known
> about (and been amused by) ffs it would have been done :-)
Indeed, I also had a chuckle over the function name :D
>> 2. For the next cycle (i.e., upstream 5.2) we can hash out what we
>> really need for this use case, and how it should work from a user
>> perspective; we're going to need that anyway even if we end up
>> implementing it differently. This also includes deciding on whether
>> this is doable outside of CAKE, or if we need another feature flag.
>
> You’ve already clearly demonstrated that the marking/flagging can be
> done using iptables, so it is doable outside of CAKE. The iptables
> rules to do it outside of CAKE are…an unpleasant way of implementing a
> lookup table which also requires knowledge of the inner workings of
> CAKE (lookup table & tin order). Personally I feel that’s ugly, but
> you already know that :-)
Yeah, you basically summed up the argument I am planning to use to get
it accepted :)
>> 3. We submit whatever we come up with. If it gets accepted, great,
>> otherwise we iterate based on feedback.
>
> I await the tarring & feathering followed by “You’re trying to set a
> conntrack mark……from a QDISC?!!!!!! get outta town” :-)
Haha, yup, should be fun!
-Toke
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2019-03-11 14:32 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-05 14:35 [Cake] act_connmark + dscp Kevin Darbyshire-Bryant
2019-03-06 15:21 ` Toke Høiland-Jørgensen
2019-03-06 16:47 ` John Sager
2019-03-07 9:50 ` Toke Høiland-Jørgensen
2019-03-06 18:40 ` Kevin Darbyshire-Bryant
2019-03-07 10:10 ` Toke Høiland-Jørgensen
2019-03-07 15:56 ` Kevin Darbyshire-Bryant
2019-03-07 17:40 ` Toke Høiland-Jørgensen
2019-03-08 11:13 ` Kevin Darbyshire-Bryant
2019-03-08 11:28 ` Toke Høiland-Jørgensen
2019-03-08 14:03 ` Kevin Darbyshire-Bryant
2019-03-09 14:08 ` Toke Høiland-Jørgensen
2019-03-10 15:21 ` Kevin Darbyshire-Bryant
2019-03-10 23:56 ` Toke Høiland-Jørgensen
2019-03-11 10:51 ` Kevin Darbyshire-Bryant
2019-03-11 13:00 ` Toke Høiland-Jørgensen
2019-03-11 14:11 ` Kevin Darbyshire-Bryant
2019-03-11 14:32 ` Toke Høiland-Jørgensen
2019-03-09 20:21 ` John Sager
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox