* [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
[not found] <49ff39b1f55c914847cd58678bae6282112db701.1608836260.git.gnault@redhat.com>
@ 2020-12-29 3:18 ` Dave Taht
2020-12-29 13:33 ` Rodney W. Grimes
0 siblings, 1 reply; 5+ messages in thread
From: Dave Taht @ 2020-12-29 3:18 UTC (permalink / raw)
To: ECN-Sane
---------- Forwarded message ---------
From: Guillaume Nault <gnault@redhat.com>
Date: Thu, Dec 24, 2020 at 11:05 AM
Subject: [PATCH net] ipv4: Ignore ECN bits for fib lookups in
fib_compute_spec_dst()
To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>
Cc: <netdev@vger.kernel.org>
RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.
For example, ECT(0) and ECT(1) packets could be treated differently.
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/24 dev veth01
$ ip -netns ns1 address add 192.0.2.11/24 dev veth10
$ ip -netns ns1 address add 192.0.2.21/32 dev lo
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
$ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):
$ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
After this patch the ECN bits don't affect the result anymore:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
net/ipv4/fib_frontend.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index cdf6ec5aa45d..84bb707bd88d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -292,7 +292,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
.flowi4_iif = LOOPBACK_IFINDEX,
.flowi4_oif = l3mdev_master_ifindex_rcu(dev),
.daddr = ip_hdr(skb)->saddr,
- .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
+ .flowi4_tos = ip_hdr(skb)->tos & IPTOS_RT_MASK,
.flowi4_scope = scope,
.flowi4_mark = vmark ? skb->mark : 0,
};
--
2.21.3
--
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman
dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
2020-12-29 3:18 ` [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() Dave Taht
@ 2020-12-29 13:33 ` Rodney W. Grimes
2020-12-29 15:21 ` Jonathan Morton
0 siblings, 1 reply; 5+ messages in thread
From: Rodney W. Grimes @ 2020-12-29 13:33 UTC (permalink / raw)
To: Dave Taht; +Cc: ECN-Sane
Dave,
Thanks for the heads up on this...
Comments in line about what, IMHO, is very bad practice.
Regards,
Rod
> ---------- Forwarded message ---------
> From: Guillaume Nault <gnault@redhat.com>
> Date: Thu, Dec 24, 2020 at 11:05 AM
> Subject: [PATCH net] ipv4: Ignore ECN bits for fib lookups in
> fib_compute_spec_dst()
> To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>
> Cc: <netdev@vger.kernel.org>
>
>
> RT_TOS() only clears one of the ECN bits. Therefore, when
> fib_compute_spec_dst() resorts to a fib lookup, it can return
> different results depending on the value of the second ECN bit.
>
> For example, ECT(0) and ECT(1) packets could be treated differently.
>
> $ ip netns add ns0
> $ ip netns add ns1
> $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
> $ ip -netns ns0 link set dev lo up
> $ ip -netns ns1 link set dev lo up
> $ ip -netns ns0 link set dev veth01 up
> $ ip -netns ns1 link set dev veth10 up
>
> $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
> $ ip -netns ns1 address add 192.0.2.11/24 dev veth10
>
> $ ip -netns ns1 address add 192.0.2.21/32 dev lo
> $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
> $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
>
> With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
> (ping uses -Q to set all TOS and ECN bits):
>
> $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
> [...]
> 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
>
> But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
> because the "tos 4" route isn't matched:
>
> $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
> [...]
> 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
>
> After this patch the ECN bits don't affect the result anymore:
>
> $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
> [...]
> 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
>
> Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.")
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> ---
> net/ipv4/fib_frontend.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index cdf6ec5aa45d..84bb707bd88d 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -292,7 +292,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
> .flowi4_iif = LOOPBACK_IFINDEX,
> .flowi4_oif = l3mdev_master_ifindex_rcu(dev),
> .daddr = ip_hdr(skb)->saddr,
> - .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
> + .flowi4_tos = ip_hdr(skb)->tos & IPTOS_RT_MASK,
This is, IMHO, a bad way to fix this. It addresses the issue
by NOT using what should be the standard method of accessing
tos bits. And further if RT_TOS is every properly fixed leaves
a dangling place that would need to be cleaned up later.
I understand that RT_TOS is legacy and returns the whole byte,
that should probably be corrected as I suspect there are other
lurking cases of use of RT_TOS that is also still consuming
the ECN bits.
The above one line patch is the "fast and dirty" way to
do it, and leads to longer term maintainance issues.
I assert this Opinion from my experience in trying
to clean up IN_ADDR macros in the BSD's, 90% of what
needing patching was caused mostly by special cases
that didnt use the macros, and instead did there own
bit banging with | and &.
A better fix might be to create some new macros, make the
old macro emmit a compile time warning to weed out the
bad use cases:
a) Modify RT_TOS to assert a compile time warning to
flag old usage.
b) create RT_TOSBYTE, this returns the whole byte, used
to replace RT_TOS when the whole byte is needed and
dealt with properly.
c) create RT_TOSTOSBITS, this returns ONLY the TOS bits,
and probably should replace most of the current calls
to RT_TOS with this, after inspecting the code to
make sure it does the right thing.
d) create RT_TOSECNBITS, this returns ONLY the ECN bits
> .flowi4_scope = scope,
> .flowi4_mark = vmark ? skb->mark : 0,
> };
> --
> 2.21.3
>
>
>
> --
> "For a successful technology, reality must take precedence over public
> relations, for Mother Nature cannot be fooled" - Richard Feynman
>
> dave@taht.net <Dave T?ht> CTO, TekLibre, LLC Tel: 1-831-435-0729
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane
>
>
--
Rod Grimes rgrimes@freebsd.org
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
2020-12-29 13:33 ` Rodney W. Grimes
@ 2020-12-29 15:21 ` Jonathan Morton
2020-12-29 20:13 ` Rodney W. Grimes
0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Morton @ 2020-12-29 15:21 UTC (permalink / raw)
To: Rodney W. Grimes; +Cc: Dave Taht, ECN-Sane
> On 29 Dec, 2020, at 3:33 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote:
>
> A better fix might be to create some new macros, make the
> old macro emmit a compile time warning to weed out the
> bad use cases:
> a) Modify RT_TOS to assert a compile time warning to
> flag old usage.
>
> b) create RT_TOSBYTE, this returns the whole byte, used
> to replace RT_TOS when the whole byte is needed and
> dealt with properly.
>
> c) create RT_TOSTOSBITS, this returns ONLY the TOS bits,
> and probably should replace most of the current calls
> to RT_TOS with this, after inspecting the code to
> make sure it does the right thing.
>
> d) create RT_TOSECNBITS, this returns ONLY the ECN bits
Agree with the principle here, quibble with the terminology. Better names for these new macros:
RT_TOSBYTE
RT_TOS_DSFIELD
RT_TOS_ECNFIELD
This emphasises that the "TOS bits" and Precedence field have been absorbed into the Diffserv field for the past 20 years already, and that generally the talk is nowadays of a "field" or a "flag" rather than "bits".
I would seriously consider also making the DSFIELD macro return a right-justified value, matching most Diffserv documentation tables, rather than just masking out the left-justified position it occupies within the TOS byte. This requires a second macro to correctly insert a fresh right-justified value into the left-justified position, ideally without disturbing any ECN field value already present. There is some evidence that mishandling this is a significant source of bugs in user code, if not also in kernel code.
- Jonathan Morton
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
2020-12-29 15:21 ` Jonathan Morton
@ 2020-12-29 20:13 ` Rodney W. Grimes
2020-12-29 20:59 ` Jonathan Morton
0 siblings, 1 reply; 5+ messages in thread
From: Rodney W. Grimes @ 2020-12-29 20:13 UTC (permalink / raw)
To: Jonathan Morton; +Cc: Rodney W. Grimes, Dave Taht, ECN-Sane
> > On 29 Dec, 2020, at 3:33 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote:
> >
> > A better fix might be to create some new macros, make the
> > old macro emmit a compile time warning to weed out the
> > bad use cases:
> > a) Modify RT_TOS to assert a compile time warning to
> > flag old usage.
> >
> > b) create RT_TOSBYTE, this returns the whole byte, used
> > to replace RT_TOS when the whole byte is needed and
> > dealt with properly.
> >
> > c) create RT_TOSTOSBITS, this returns ONLY the TOS bits,
> > and probably should replace most of the current calls
> > to RT_TOS with this, after inspecting the code to
> > make sure it does the right thing.
> >
> > d) create RT_TOSECNBITS, this returns ONLY the ECN bits
>
> Agree with the principle here, quibble with the terminology. Better names for these new macros:
:-) Its, blue, I tell you that bike shed must be blue! :-)
>
> RT_TOSBYTE
> RT_TOS_DSFIELD
> RT_TOS_ECNFIELD
I agree that these are better names.
>
> This emphasises that the "TOS bits" and Precedence field have been absorbed into the Diffserv field for the past 20 years already, and that generally the talk is nowadays of a "field" or a "flag" rather than "bits".
>
> I would seriously consider also making the DSFIELD macro return a right-justified value, matching most Diffserv documentation tables, rather than just masking out the left-justified position it occupies within the TOS byte. This requires a second macro to correctly insert a fresh right-justified value into the left-justified position, ideally without disturbing any ECN field value already present. There is some evidence that mishandling this is a significant source of bugs in user code, if not also in kernel code.
That does make since but my concern is this would require more than just a drop in replacement of RT_TOS as now you would need to replace the macro and probably modify other nerby code.
If these are purely kernel macros that scope of work is well contained. Are these macro's exported from a header to userland visible code?
> - Jonathan Morton
--
Rod Grimes rgrimes@freebsd.org
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
2020-12-29 20:13 ` Rodney W. Grimes
@ 2020-12-29 20:59 ` Jonathan Morton
0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Morton @ 2020-12-29 20:59 UTC (permalink / raw)
To: Rodney W. Grimes; +Cc: Dave Taht, ECN-Sane
> On 29 Dec, 2020, at 10:13 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote:
>
> If these are purely kernel macros that scope of work is well contained. Are these macro's exported from a header to userland visible code?
It appears RT_TOS is defined in a "uapi" header, that is, Userspace API. So yes, and this also means any bugs in it have propagated into userspace code built against it.
Actually, it looks like it is not a bug in the macro itself, but in its intended application, referring to an obsolete and very short-lived specification. It's extracting *specifically* the widened Type of Service bitfield, containing the non-orthogonal "flags" indicating Max Throughput, Min Latency, Max Reliability, and Min Cost (this last being the short-lived part of the spec, as Diffserv and ECN replaced it very quickly). The mask specifically excludes the Precedence field and thus the most-significant half of the Diffserv field.
- Jonathan Morton
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-12-29 20:59 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <49ff39b1f55c914847cd58678bae6282112db701.1608836260.git.gnault@redhat.com>
2020-12-29 3:18 ` [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() Dave Taht
2020-12-29 13:33 ` Rodney W. Grimes
2020-12-29 15:21 ` Jonathan Morton
2020-12-29 20:13 ` Rodney W. Grimes
2020-12-29 20:59 ` Jonathan Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox