From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x331.google.com (mail-wm1-x331.google.com [IPv6:2a00:1450:4864:20::331]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 9FFDC3CB37 for ; Thu, 29 Sep 2022 11:04:13 -0400 (EDT) Received: by mail-wm1-x331.google.com with SMTP id k3-20020a05600c1c8300b003b4fa1a85f8so871948wms.3 for ; Thu, 29 Sep 2022 08:04:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=wddfcTSRugnSMxZameM8TtZ00HcEfolW8H8mRinAsfI=; b=H18ApyDoi2IHZvO3zIMZuEc9NZ2veaPMHAge6X/p+bUBECEtUzIfw13q3o09Z02mVM R9Hnx3hTaB2ZyL3RY1edJ6gFgRDbX8/vETZ8WXP6ou5VZzS8w1KSf3aEVjCpHfI7QPpv VImIt+yEauKZe8Bxwflf0D2R8mBJMr7VmPXf1te2two41DtCNyPDLPNUo1NcQSbBCmOL NpTbhp3vxl9rDVFfHZggod09qsEh6Zr+Vh69xkt7SRSWmr+JPHkwRDZG8/EVPOyBamYC zl/24eWuMAnI7CpCI7PAENVpRuN/MnGF+Xkj38rP/F9MsJNzT8lWIiTmZUaSOhfpmHov SgAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=wddfcTSRugnSMxZameM8TtZ00HcEfolW8H8mRinAsfI=; b=OnfKQinYrdsUW7MvAiA2jWFckvbcI7pX/q5Be7VDuCyf9uQAnB3HrPliJO+0wlQX2v 9rcqYck28XU5hvu/xaxIbSIPRdsQ/IG9zYvmHrgzkRPZ4CbzDN+2/7BmSNMdRJxrJRvF 2JM+YSrL2os4ANRXbHye15Eeu1co8uoqoFEv15YRyxi+l6GSzAMo9EbrcZpIeZ+jdSs9 ilMri+Ro0FKcyOC+t48Bu9Q7X0UE9/FzYbS13Rwe6ceBnuvvRD/s7ugr4qpL7wZY3dKQ b34ndc0jZwZR2/AsDTUvNIz08DVlsLhrMTdTvWooN6GlKxMUeW5GTEpqcNKoxwI8BaUO /nHQ== X-Gm-Message-State: ACrzQf0plO9Ogc7WgOBt9eYo95EqEHXLP4pDD761ROSb1JlhsixxY6fe 8BzJcIO0IC6NxP/FF+zfJqvRA56rfp1EECAvaimjAkNhwzI= X-Google-Smtp-Source: AMsMyM5ynWzR45YZhehS8SjWtjrwR6lXJXmlXBFfjWp4c6F+GcdbVAZUMCSQ8QFiGr0Js5fU3R9Wk4qzR2qHV+3bsJo= X-Received: by 2002:a05:600c:6025:b0:3b4:8c0c:f3c9 with SMTP id az37-20020a05600c602500b003b48c0cf3c9mr11015259wmb.206.1664463851857; Thu, 29 Sep 2022 08:04:11 -0700 (PDT) MIME-Version: 1.0 References: <20220929142447.3821638-1-mubashirmaq@gmail.com> <20220929142447.3821638-3-mubashirmaq@gmail.com> In-Reply-To: <20220929142447.3821638-3-mubashirmaq@gmail.com> From: Dave Taht Date: Thu, 29 Sep 2022 08:03:59 -0700 Message-ID: To: ECN-Sane Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: [Ecn-sane] Fwd: [PATCH net-next 2/5] tcp: add PLB functionality for TCP X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Sep 2022 15:04:13 -0000 see: https://henryhxu.github.io/share/jinbin-apnet22.pdf ---------- Forwarded message --------- From: Mubashir Adnan Qureshi Date: Thu, Sep 29, 2022 at 7:59 AM Subject: [PATCH net-next 2/5] tcp: add PLB functionality for TCP To: David Miller Cc: , Mubashir Adnan Qureshi , Yuchung Cheng , Neal Cardwell , Eric Dumazet From: Mubashir Adnan Qureshi Congestion control algorithms track PLB state and cause the connection to trigger a path change when either of the 2 conditions is satisfied: - No packets are in flight and (# consecutive congested rounds >=3D sysctl_tcp_plb_idle_rehash_rounds) - (# consecutive congested rounds >=3D sysctl_tcp_plb_rehash_rounds) A round (RTT) is marked as congested when congestion signal (ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh. In the event of RTO, PLB (via tcp_write_timeout()) triggers a path change and disables congestion-triggered path changes for random time between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec) to avoid hopping onto the "connectivity blackhole". RTO-triggered path changes can still happen during this cool-off period. Signed-off-by: Mubashir Adnan Qureshi Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Reviewed-by: Eric Dumazet --- include/net/tcp.h | 28 +++++++++++++ net/ipv4/Makefile | 2 +- net/ipv4/tcp_plb.c | 102 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 131 insertions(+), 1 deletion(-) create mode 100644 net/ipv4/tcp_plb.c diff --git a/include/net/tcp.h b/include/net/tcp.h index 27e8d378c70a..c50af63addea 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2135,6 +2135,34 @@ extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, extern void tcp_rack_reo_timeout(struct sock *sk); extern void tcp_rack_update_reo_wnd(struct sock *sk, struct rate_sample *r= s); +/* tcp_plb.c */ + +/* + * Scaling factor for fractions in PLB. For example, tcp_plb_update_state + * expects cong_ratio which represents fraction of traffic that experience= d + * congestion over a single RTT. In order to avoid floating point operatio= ns, + * this fraction should be mapped to (1 << TCP_PLB_SCALE) and passed in. + */ +#define TCP_PLB_SCALE 8 + +/* State for PLB (Protective Load Balancing) for a single TCP connection. = */ +struct tcp_plb_state { + u8 consec_cong_rounds:5, /* consecutive congested rounds */ + unused:3; + u32 pause_until; /* jiffies32 when PLB can resume rerouting */ +}; + +static inline void tcp_plb_init(const struct sock *sk, + struct tcp_plb_state *plb) +{ + plb->consec_cong_rounds =3D 0; + plb->pause_until =3D 0; +} +void tcp_plb_update_state(const struct sock *sk, struct tcp_plb_state *plb= , + const int cong_ratio); +void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb); +void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *= plb); + /* At how many usecs into the future should the RTO fire? */ static inline s64 tcp_rto_delta_us(const struct sock *sk) { diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index bbdd9c44f14e..af7d2cf490fb 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -10,7 +10,7 @@ obj-y :=3D route.o inetpeer.o protocol.o \ tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \ tcp_rate.o tcp_recovery.o tcp_ulp.o \ - tcp_offload.o datagram.o raw.o udp.o udplite.o \ + tcp_offload.o tcp_plb.o datagram.o raw.o udp.o udplite.o \ udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \ inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \ diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c new file mode 100644 index 000000000000..26ffc5a45f53 --- /dev/null +++ b/net/ipv4/tcp_plb.c @@ -0,0 +1,102 @@ +/* Protective Load Balancing (PLB) + * + * PLB was designed to reduce link load imbalance across datacenter + * switches. PLB is a host-based optimization; it leverages congestion + * signals from the transport layer to randomly change the path of the + * connection experiencing sustained congestion. PLB prefers to repath + * after idle periods to minimize packet reordering. It repaths by + * changing the IPv6 Flow Label on the packets of a connection, which + * datacenter switches include as part of ECMP/WCMP hashing. + * + * PLB is described in detail in: + * + * Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, + * Gautam Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, + * David Wetherall,Abdul Kabbani: + * "PLB: Congestion Signals are Simple and Effective for + * Network Load Balancing" + * In ACM SIGCOMM 2022, Amsterdam Netherlands. + * + */ + +#include + +/* Called once per round-trip to update PLB state for a connection. */ +void tcp_plb_update_state(const struct sock *sk, struct tcp_plb_state *plb= , + const int cong_ratio) +{ + struct net *net =3D sock_net(sk); + + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) + return; + + if (cong_ratio >=3D 0) { + if (cong_ratio < READ_ONCE(net->ipv4.sysctl_tcp_plb_cong_thresh)) + plb->consec_cong_rounds =3D 0; + else if (plb->consec_cong_rounds < + READ_ONCE(net->ipv4.sysctl_tcp_plb_rehash_rounds)) + plb->consec_cong_rounds++; + } +} +EXPORT_SYMBOL_GPL(tcp_plb_update_state); + +/* Check whether recent congestion has been persistent enough to warrant + * a load balancing decision that switches the connection to another path. + */ +void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb) +{ + struct net *net =3D sock_net(sk); + bool can_idle_rehash, can_force_rehash; + u32 max_suspend; + + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) + return; + + /* Note that tcp_jiffies32 can wrap; we detect wraps by checking fo= r + * cases where the max suspension end is before the actual suspensi= on + * end. We clear pause_until to 0 to indicate there is no recent + * RTO event that constrains PLB rehashing. + */ + max_suspend =3D 2 * READ_ONCE(net->ipv4.sysctl_tcp_plb_suspend_rto_sec) * HZ; + if (plb->pause_until && + (!before(tcp_jiffies32, plb->pause_until) || + before(tcp_jiffies32 + max_suspend, plb->pause_until))) + plb->pause_until =3D 0; + + can_idle_rehash =3D READ_ONCE(net->ipv4.sysctl_tcp_plb_idle_rehash_rounds) && + !tcp_sk(sk)->packets_out && + plb->consec_cong_rounds >=3D + READ_ONCE(net->ipv4.sysctl_tcp_plb_idle_rehash_rounds); + can_force_rehash =3D plb->consec_cong_rounds >=3D + READ_ONCE(net->ipv4.sysctl_tcp_plb_rehash_rounds= ); + + if (!plb->pause_until && (can_idle_rehash || can_force_rehash)) { + sk_rethink_txhash(sk); + plb->consec_cong_rounds =3D 0; + } +} +EXPORT_SYMBOL_GPL(tcp_plb_check_rehash); + +/* Upon RTO, disallow load balancing for a while, to avoid having load + * balancing decisions switch traffic to a black-holed path that was + * previously avoided with a sk_rethink_txhash() call at RTO time. + */ +void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *= plb) +{ + struct net *net =3D sock_net(sk); + u32 pause; + + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) + return; + + pause =3D READ_ONCE(net->ipv4.sysctl_tcp_plb_suspend_rto_sec) * HZ; + pause +=3D prandom_u32_max(pause); + plb->pause_until =3D tcp_jiffies32 + pause; + + /* Reset PLB state upon RTO, since an RTO causes a sk_rethink_txhash() call + * that may switch this connection to a path with completely differ= ent + * congestion characteristics. + */ + plb->consec_cong_rounds =3D 0; +} +EXPORT_SYMBOL_GPL(tcp_plb_update_state_upon_rto); -- 2.37.3.998.g577e59143f-goog --=20 FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_code= l/ Dave T=C3=A4ht CEO, TekLibre, LLC