* [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() [not found] <49ff39b1f55c914847cd58678bae6282112db701.1608836260.git.gnault@redhat.com> @ 2020-12-29 3:18 ` Dave Taht 2020-12-29 13:33 ` Rodney W. Grimes 0 siblings, 1 reply; 5+ messages in thread From: Dave Taht @ 2020-12-29 3:18 UTC (permalink / raw) To: ECN-Sane ---------- Forwarded message --------- From: Guillaume Nault <gnault@redhat.com> Date: Thu, Dec 24, 2020 at 11:05 AM Subject: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Cc: <netdev@vger.kernel.org> RT_TOS() only clears one of the ECN bits. Therefore, when fib_compute_spec_dst() resorts to a fib lookup, it can return different results depending on the value of the second ECN bit. For example, ECT(0) and ECT(1) packets could be treated differently. $ ip netns add ns0 $ ip netns add ns1 $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev lo up $ ip -netns ns1 link set dev lo up $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10/24 dev veth01 $ ip -netns ns1 address add 192.0.2.11/24 dev veth10 $ ip -netns ns1 address add 192.0.2.21/32 dev lo $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21 $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0 With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21 (ping uses -Q to set all TOS and ECN bits): $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11 because the "tos 4" route isn't matched: $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms After this patch the ECN bits don't affect the result anymore: $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.") Signed-off-by: Guillaume Nault <gnault@redhat.com> --- net/ipv4/fib_frontend.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index cdf6ec5aa45d..84bb707bd88d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -292,7 +292,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb) .flowi4_iif = LOOPBACK_IFINDEX, .flowi4_oif = l3mdev_master_ifindex_rcu(dev), .daddr = ip_hdr(skb)->saddr, - .flowi4_tos = RT_TOS(ip_hdr(skb)->tos), + .flowi4_tos = ip_hdr(skb)->tos & IPTOS_RT_MASK, .flowi4_scope = scope, .flowi4_mark = vmark ? skb->mark : 0, }; -- 2.21.3 -- "For a successful technology, reality must take precedence over public relations, for Mother Nature cannot be fooled" - Richard Feynman dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() 2020-12-29 3:18 ` [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() Dave Taht @ 2020-12-29 13:33 ` Rodney W. Grimes 2020-12-29 15:21 ` Jonathan Morton 0 siblings, 1 reply; 5+ messages in thread From: Rodney W. Grimes @ 2020-12-29 13:33 UTC (permalink / raw) To: Dave Taht; +Cc: ECN-Sane Dave, Thanks for the heads up on this... Comments in line about what, IMHO, is very bad practice. Regards, Rod > ---------- Forwarded message --------- > From: Guillaume Nault <gnault@redhat.com> > Date: Thu, Dec 24, 2020 at 11:05 AM > Subject: [PATCH net] ipv4: Ignore ECN bits for fib lookups in > fib_compute_spec_dst() > To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> > Cc: <netdev@vger.kernel.org> > > > RT_TOS() only clears one of the ECN bits. Therefore, when > fib_compute_spec_dst() resorts to a fib lookup, it can return > different results depending on the value of the second ECN bit. > > For example, ECT(0) and ECT(1) packets could be treated differently. > > $ ip netns add ns0 > $ ip netns add ns1 > $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 > $ ip -netns ns0 link set dev lo up > $ ip -netns ns1 link set dev lo up > $ ip -netns ns0 link set dev veth01 up > $ ip -netns ns1 link set dev veth10 up > > $ ip -netns ns0 address add 192.0.2.10/24 dev veth01 > $ ip -netns ns1 address add 192.0.2.11/24 dev veth10 > > $ ip -netns ns1 address add 192.0.2.21/32 dev lo > $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21 > $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0 > > With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21 > (ping uses -Q to set all TOS and ECN bits): > > $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255 > [...] > 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms > > But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11 > because the "tos 4" route isn't matched: > > $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 > [...] > 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms > > After this patch the ECN bits don't affect the result anymore: > > $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 > [...] > 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms > > Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.") > Signed-off-by: Guillaume Nault <gnault@redhat.com> > --- > net/ipv4/fib_frontend.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c > index cdf6ec5aa45d..84bb707bd88d 100644 > --- a/net/ipv4/fib_frontend.c > +++ b/net/ipv4/fib_frontend.c > @@ -292,7 +292,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb) > .flowi4_iif = LOOPBACK_IFINDEX, > .flowi4_oif = l3mdev_master_ifindex_rcu(dev), > .daddr = ip_hdr(skb)->saddr, > - .flowi4_tos = RT_TOS(ip_hdr(skb)->tos), > + .flowi4_tos = ip_hdr(skb)->tos & IPTOS_RT_MASK, This is, IMHO, a bad way to fix this. It addresses the issue by NOT using what should be the standard method of accessing tos bits. And further if RT_TOS is every properly fixed leaves a dangling place that would need to be cleaned up later. I understand that RT_TOS is legacy and returns the whole byte, that should probably be corrected as I suspect there are other lurking cases of use of RT_TOS that is also still consuming the ECN bits. The above one line patch is the "fast and dirty" way to do it, and leads to longer term maintainance issues. I assert this Opinion from my experience in trying to clean up IN_ADDR macros in the BSD's, 90% of what needing patching was caused mostly by special cases that didnt use the macros, and instead did there own bit banging with | and &. A better fix might be to create some new macros, make the old macro emmit a compile time warning to weed out the bad use cases: a) Modify RT_TOS to assert a compile time warning to flag old usage. b) create RT_TOSBYTE, this returns the whole byte, used to replace RT_TOS when the whole byte is needed and dealt with properly. c) create RT_TOSTOSBITS, this returns ONLY the TOS bits, and probably should replace most of the current calls to RT_TOS with this, after inspecting the code to make sure it does the right thing. d) create RT_TOSECNBITS, this returns ONLY the ECN bits > .flowi4_scope = scope, > .flowi4_mark = vmark ? skb->mark : 0, > }; > -- > 2.21.3 > > > > -- > "For a successful technology, reality must take precedence over public > relations, for Mother Nature cannot be fooled" - Richard Feynman > > dave@taht.net <Dave T?ht> CTO, TekLibre, LLC Tel: 1-831-435-0729 > _______________________________________________ > Ecn-sane mailing list > Ecn-sane@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/ecn-sane > > -- Rod Grimes rgrimes@freebsd.org ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() 2020-12-29 13:33 ` Rodney W. Grimes @ 2020-12-29 15:21 ` Jonathan Morton 2020-12-29 20:13 ` Rodney W. Grimes 0 siblings, 1 reply; 5+ messages in thread From: Jonathan Morton @ 2020-12-29 15:21 UTC (permalink / raw) To: Rodney W. Grimes; +Cc: Dave Taht, ECN-Sane > On 29 Dec, 2020, at 3:33 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote: > > A better fix might be to create some new macros, make the > old macro emmit a compile time warning to weed out the > bad use cases: > a) Modify RT_TOS to assert a compile time warning to > flag old usage. > > b) create RT_TOSBYTE, this returns the whole byte, used > to replace RT_TOS when the whole byte is needed and > dealt with properly. > > c) create RT_TOSTOSBITS, this returns ONLY the TOS bits, > and probably should replace most of the current calls > to RT_TOS with this, after inspecting the code to > make sure it does the right thing. > > d) create RT_TOSECNBITS, this returns ONLY the ECN bits Agree with the principle here, quibble with the terminology. Better names for these new macros: RT_TOSBYTE RT_TOS_DSFIELD RT_TOS_ECNFIELD This emphasises that the "TOS bits" and Precedence field have been absorbed into the Diffserv field for the past 20 years already, and that generally the talk is nowadays of a "field" or a "flag" rather than "bits". I would seriously consider also making the DSFIELD macro return a right-justified value, matching most Diffserv documentation tables, rather than just masking out the left-justified position it occupies within the TOS byte. This requires a second macro to correctly insert a fresh right-justified value into the left-justified position, ideally without disturbing any ECN field value already present. There is some evidence that mishandling this is a significant source of bugs in user code, if not also in kernel code. - Jonathan Morton ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() 2020-12-29 15:21 ` Jonathan Morton @ 2020-12-29 20:13 ` Rodney W. Grimes 2020-12-29 20:59 ` Jonathan Morton 0 siblings, 1 reply; 5+ messages in thread From: Rodney W. Grimes @ 2020-12-29 20:13 UTC (permalink / raw) To: Jonathan Morton; +Cc: Rodney W. Grimes, Dave Taht, ECN-Sane > > On 29 Dec, 2020, at 3:33 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote: > > > > A better fix might be to create some new macros, make the > > old macro emmit a compile time warning to weed out the > > bad use cases: > > a) Modify RT_TOS to assert a compile time warning to > > flag old usage. > > > > b) create RT_TOSBYTE, this returns the whole byte, used > > to replace RT_TOS when the whole byte is needed and > > dealt with properly. > > > > c) create RT_TOSTOSBITS, this returns ONLY the TOS bits, > > and probably should replace most of the current calls > > to RT_TOS with this, after inspecting the code to > > make sure it does the right thing. > > > > d) create RT_TOSECNBITS, this returns ONLY the ECN bits > > Agree with the principle here, quibble with the terminology. Better names for these new macros: :-) Its, blue, I tell you that bike shed must be blue! :-) > > RT_TOSBYTE > RT_TOS_DSFIELD > RT_TOS_ECNFIELD I agree that these are better names. > > This emphasises that the "TOS bits" and Precedence field have been absorbed into the Diffserv field for the past 20 years already, and that generally the talk is nowadays of a "field" or a "flag" rather than "bits". > > I would seriously consider also making the DSFIELD macro return a right-justified value, matching most Diffserv documentation tables, rather than just masking out the left-justified position it occupies within the TOS byte. This requires a second macro to correctly insert a fresh right-justified value into the left-justified position, ideally without disturbing any ECN field value already present. There is some evidence that mishandling this is a significant source of bugs in user code, if not also in kernel code. That does make since but my concern is this would require more than just a drop in replacement of RT_TOS as now you would need to replace the macro and probably modify other nerby code. If these are purely kernel macros that scope of work is well contained. Are these macro's exported from a header to userland visible code? > - Jonathan Morton -- Rod Grimes rgrimes@freebsd.org ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() 2020-12-29 20:13 ` Rodney W. Grimes @ 2020-12-29 20:59 ` Jonathan Morton 0 siblings, 0 replies; 5+ messages in thread From: Jonathan Morton @ 2020-12-29 20:59 UTC (permalink / raw) To: Rodney W. Grimes; +Cc: Dave Taht, ECN-Sane > On 29 Dec, 2020, at 10:13 pm, Rodney W. Grimes <4bone@gndrsh.dnsmgr.net> wrote: > > If these are purely kernel macros that scope of work is well contained. Are these macro's exported from a header to userland visible code? It appears RT_TOS is defined in a "uapi" header, that is, Userspace API. So yes, and this also means any bugs in it have propagated into userspace code built against it. Actually, it looks like it is not a bug in the macro itself, but in its intended application, referring to an obsolete and very short-lived specification. It's extracting *specifically* the widened Type of Service bitfield, containing the non-orthogonal "flags" indicating Max Throughput, Min Latency, Max Reliability, and Min Cost (this last being the short-lived part of the spec, as Diffserv and ECN replaced it very quickly). The mask specifically excludes the Precedence field and thus the most-significant half of the Diffserv field. - Jonathan Morton ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-12-29 20:59 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <49ff39b1f55c914847cd58678bae6282112db701.1608836260.git.gnault@redhat.com> 2020-12-29 3:18 ` [Ecn-sane] Fwd: [PATCH net] ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() Dave Taht 2020-12-29 13:33 ` Rodney W. Grimes 2020-12-29 15:21 ` Jonathan Morton 2020-12-29 20:13 ` Rodney W. Grimes 2020-12-29 20:59 ` Jonathan Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox