From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com
 [IPv6:2a00:1450:4864:20::42a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id BFD433B2A4
 for <cake@lists.bufferbloat.net>; Fri, 18 Nov 2022 19:03:47 -0500 (EST)
Received: by mail-wr1-x42a.google.com with SMTP id x5so7634494wrt.7
 for <cake@lists.bufferbloat.net>; Fri, 18 Nov 2022 16:03:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=BU0BOzMbXvuLaA1Y9PYl9XtiFlYcnE4InOihNHQEm04=;
 b=E0i6fJeAu8MyBYCpxAhWexWCWoFXEL4JvTnpWQ3eoA4OXQNvi9B2jcwSIcD9+VuUEU
 Aj3LGQmElUk/3NjgBfCXdGSxKbnfwiA7+mY6jmT9Ebz/LRS6Un/gBuTVTvmcsabZLTx4
 7wligFcJ3ItbDJ5XaDJPCEJCI9HWRKwiPTrkoEhAein7YUA533ksfwJ16ijd7df0WsED
 guwe4Qs2P5SWXo8FK4D2CPOHCF2cpwawucnBD6s4B5nGGHpGxjnxLcot7Xz/KP7aok51
 CCRr/QZ9M2Gbgo9LE8+e33SNzU9G4upzFDKo7HgjqzDj6augkAyQogxOElHX+/geUSN4
 Kqig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=BU0BOzMbXvuLaA1Y9PYl9XtiFlYcnE4InOihNHQEm04=;
 b=o7zOL5+LV8gqPUIDFf3pqLuHRGdnypEuMuxMB42Jdf2j9IjbEooOwKby4YuA8tqtst
 7P3u4v9afWIDGJq2ylqR4ErK4QqCnNdXUFQf6IlYd5xg/zlt1qriwcAk9jlxA7iyxC1r
 BvjvMNioojVufN2AAUp/VjLSYpJDcWb2m3CFxMaZE4U4zrLexu2qVWloAovHlaWsRFqU
 UelEg3V8NMbSRkk1HHfwAg7ryGjqGcJTDJzD8Bpg7y9ARGgtr/cQoh26Jbi48b8peRWN
 otIj9U7DvE4mhq8leXU1QcR0RCzX4dISRRQTKSIrbSFR44xC0+rHYT236g0XsJaZrVbm
 2Syw==
X-Gm-Message-State: ANoB5pnrAWmoTCiFlo6SrFhaPuCaAW8KeayEr3Q3M8IuictH5MZ6xrYT
 p2CJcqMib+gd5TXvWpC2DbQ1UTVcAIpYLgSx2J49GuXlOTI=
X-Google-Smtp-Source: AA0mqf7TV9L+VVavH+MAX05M1f+trppj8LyA5Al3LVtAbBRlezCd4FX3fwZPxT3qWowuAQm/zaz5a6ITnXEf6a7nv6c=
X-Received: by 2002:a5d:4844:0:b0:230:55fc:5de1 with SMTP id
 n4-20020a5d4844000000b0023055fc5de1mr5392624wrs.500.1668816225995; Fri, 18
 Nov 2022 16:03:45 -0800 (PST)
MIME-Version: 1.0
References: <33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com>
 <YzCZMHYmk53mQ+HK@pop-os.localdomain>
 <YzxwCy7R0MdWZuO4@dcaratti.users.ipa.redhat.com>
In-Reply-To: <YzxwCy7R0MdWZuO4@dcaratti.users.ipa.redhat.com>
From: Dave Taht <dave.taht@gmail.com>
Date: Fri, 18 Nov 2022 16:03:34 -0800
Message-ID: <CAA93jw6gevBNXi--8m=TwFgZ_b0nr0xojSbsvnmGx_oX4mJrJQ@mail.gmail.com>
To: Cake List <cake@lists.bufferbloat.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: [Cake] Fwd: [PATCH net] net/sched: act_mirred: use the backlog for
	mirred ingress
X-BeenThere: cake@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Cake - FQ_codel the next generation <cake.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cake>,
 <mailto:cake-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cake>
List-Post: <mailto:cake@lists.bufferbloat.net>
List-Help: <mailto:cake-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cake>,
 <mailto:cake-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sat, 19 Nov 2022 00:03:47 -0000

mirred must die.

---------- Forwarded message ---------
From: Davide Caratti <dcaratti@redhat.com>
Date: Tue, Oct 4, 2022 at 10:52 AM
Subject: Re: [PATCH net] net/sched: act_mirred: use the backlog for
mirred ingress
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>, Jiri Pirko
<jiri@resnulli.us>, Paolo Abeni <pabeni@redhat.com>, Marcelo Ricardo
Leitner <marcelo.leitner@gmail.com>, <wizhao@redhat.com>,
<netdev@vger.kernel.org>


hello Cong, thanks for looking at this!

On Sun, Sep 25, 2022 at 11:08:48AM -0700, Cong Wang wrote:
> On Fri, Sep 23, 2022 at 05:11:12PM +0200, Davide Caratti wrote:
> > William reports kernel soft-lockups on some OVS topologies when TC mirr=
ed
> > "egress-to-ingress" action is hit by local TCP traffic. Indeed, using t=
he
> > mirred action in egress-to-ingress can easily produce a dmesg splat lik=
e:
> >
> >  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >  WARNING: possible recursive locking detected

[...]

> >  6.0.0-rc4+ #511 Not tainted
> >  --------------------------------------------
> >  nc/1037 is trying to acquire lock:
> >  ffff950687843cb0 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1023/=
0x1160
> >
> >  but task is already holding lock:
> >  ffff950687846cb0 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1023/=
0x1160

FTR, this is:

2091         sk_incoming_cpu_update(sk);
2092
2093         bh_lock_sock_nested(sk); <--- the lock reported in the splat
2094         tcp_segs_in(tcp_sk(sk), skb);
2095         ret =3D 0;
2096         if (!sock_owned_by_user(sk)) {

> BTW, have you thought about solving the above lockdep warning in TCP
> layer?

yes, but that doesn't look like a trivial fix at all - and I doubt it's
worth doing it just to make mirred and TCP "friends". Please note:
on current kernel this doesn't just result in a lockdep warning: using
iperf3 on unpatched kernels it's possible to see a real deadlock, like:

WARRNING: possible circular locking dependency detected
 6.0.0-rc4+ #511 Not tainted
 ------------------------------------------------------
 iperf3/1021 is trying to acquire lock:
 ffff976005c5a630 (slock-AF_INET6/1){+...}-{2:2}, at: tcp_v4_rcv+0x1023/0x1=
160

 but task is already holding lock:
 ffff97607b06e0b0 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1023/0x11=
60

 which lock already depends on the new lock.


 the existing dependency chain (in reverse order) is:

 -> #1 (slock-AF_INET/1){+.-.}-{2:2}:
        lock_acquire+0xd5/0x310
        _raw_spin_lock_nested+0x39/0x70
        tcp_v4_rcv+0x1023/0x1160
        ip_protocol_deliver_rcu+0x4d/0x280
        ip_local_deliver_finish+0xac/0x160
        ip_local_deliver+0x71/0x220
        ip_rcv+0x5a/0x200
        __netif_receive_skb_one_core+0x89/0xa0
        netif_receive_skb+0x1c1/0x400
        tcf_mirred_act+0x2a5/0x610 [act_mirred]
        tcf_action_exec+0xb3/0x210
        fl_classify+0x1f7/0x240 [cls_flower]
        tcf_classify+0x7b/0x320
        __dev_queue_xmit+0x3a4/0x11b0
        ip_finish_output2+0x3b8/0xa10
        ip_output+0x7f/0x260
        __ip_queue_xmit+0x1ce/0x610
        __tcp_transmit_skb+0xabc/0xc80
        tcp_rcv_established+0x284/0x810
        tcp_v4_do_rcv+0x1f3/0x370
        tcp_v4_rcv+0x10bc/0x1160
        ip_protocol_deliver_rcu+0x4d/0x280
        ip_local_deliver_finish+0xac/0x160
        ip_local_deliver+0x71/0x220
        ip_rcv+0x5a/0x200
        __netif_receive_skb_one_core+0x89/0xa0
        netif_receive_skb+0x1c1/0x400
        tcf_mirred_act+0x2a5/0x610 [act_mirred]
        tcf_action_exec+0xb3/0x210
        fl_classify+0x1f7/0x240 [cls_flower]
        tcf_classify+0x7b/0x320
        __dev_queue_xmit+0x3a4/0x11b0
        ip_finish_output2+0x3b8/0xa10
        ip_output+0x7f/0x260
        __ip_queue_xmit+0x1ce/0x610
        __tcp_transmit_skb+0xabc/0xc80
        tcp_write_xmit+0x229/0x12c0
        __tcp_push_pending_frames+0x32/0xf0
        tcp_sendmsg_locked+0x297/0xe10
        tcp_sendmsg+0x27/0x40
        sock_sendmsg+0x58/0x70
        sock_write_iter+0x9a/0x100
        vfs_write+0x481/0x4f0
        ksys_write+0xc2/0xe0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x63/0xcd

 -> #0 (slock-AF_INET6/1){+...}-{2:2}:
        check_prevs_add+0x185/0xf50
        __lock_acquire+0x11eb/0x1620
        lock_acquire+0xd5/0x310
        _raw_spin_lock_nested+0x39/0x70
        tcp_v4_rcv+0x1023/0x1160
        ip_protocol_deliver_rcu+0x4d/0x280
        ip_local_deliver_finish+0xac/0x160
        ip_local_deliver+0x71/0x220
        ip_rcv+0x5a/0x200
        __netif_receive_skb_one_core+0x89/0xa0
        netif_receive_skb+0x1c1/0x400
        tcf_mirred_act+0x2a5/0x610 [act_mirred]
        tcf_action_exec+0xb3/0x210
        fl_classify+0x1f7/0x240 [cls_flower]
        tcf_classify+0x7b/0x320
        __dev_queue_xmit+0x3a4/0x11b0
        ip_finish_output2+0x3b8/0xa10
        ip_output+0x7f/0x260
        __ip_queue_xmit+0x1ce/0x610
        __tcp_transmit_skb+0xabc/0xc80
        tcp_rcv_established+0x284/0x810
        tcp_v4_do_rcv+0x1f3/0x370
        tcp_v4_rcv+0x10bc/0x1160
        ip_protocol_deliver_rcu+0x4d/0x280
        ip_local_deliver_finish+0xac/0x160
        ip_local_deliver+0x71/0x220
        ip_rcv+0x5a/0x200
        __netif_receive_skb_one_core+0x89/0xa0
        netif_receive_skb+0x1c1/0x400
        tcf_mirred_act+0x2a5/0x610 [act_mirred]
        tcf_action_exec+0xb3/0x210
        fl_classify+0x1f7/0x240 [cls_flower]
        tcf_classify+0x7b/0x320
        __dev_queue_xmit+0x3a4/0x11b0
        ip_finish_output2+0x3b8/0xa10
        ip_output+0x7f/0x260
        __ip_queue_xmit+0x1ce/0x610
        __tcp_transmit_skb+0xabc/0xc80
        tcp_write_xmit+0x229/0x12c0
        __tcp_push_pending_frames+0x32/0xf0
        tcp_sendmsg_locked+0x297/0xe10
        tcp_sendmsg+0x27/0x40
        sock_sendmsg+0x42/0x70
        sock_write_iter+0x9a/0x100
        vfs_write+0x481/0x4f0
        ksys_write+0xc2/0xe0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x63/0xcd

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(slock-AF_INET/1);
                                lock(slock-AF_INET6/1);
                                lock(slock-AF_INET/1);
   lock(slock-AF_INET6/1);

  *** DEADLOCK ***

 12 locks held by iperf3/1021:
  #0: ffff976005c5a6c0 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_sendmsg+0x19=
/0x40
  #1: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
__ip_queue_xmit+0x5/0x610
  #2: ffffffffbca072e0 (rcu_read_lock_bh){....}-{1:2}, at:
ip_finish_output2+0xaa/0xa10
  #3: ffffffffbca072e0 (rcu_read_lock_bh){....}-{1:2}, at:
__dev_queue_xmit+0x72/0x11b0
  #4: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
netif_receive_skb+0x181/0x400
  #5: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
ip_local_deliver_finish+0x54/0x160
  #6: ffff97607b06e0b0 (slock-AF_INET/1){+.-.}-{2:2}, at:
tcp_v4_rcv+0x1023/0x1160
  #7: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
__ip_queue_xmit+0x5/0x610
  #8: ffffffffbca072e0 (rcu_read_lock_bh){....}-{1:2}, at:
ip_finish_output2+0xaa/0xa10
  #9: ffffffffbca072e0 (rcu_read_lock_bh){....}-{1:2}, at:
__dev_queue_xmit+0x72/0x11b0
  #10: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
netif_receive_skb+0x181/0x400
  #11: ffffffffbca07320 (rcu_read_lock){....}-{1:2}, at:
ip_local_deliver_finish+0x54/0x160

 [...]

 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [swapper/1:0]

Moreover, even if we improve TCP locking in order to avoid lockups for
this simple topology, I suspect that TCP will experience some packet
losses: when mirred detects 4 nested calls of tcf_mirred_act(), the kernel
will protect against excessive stack growth and drop the skb (that can
also be a full TSO packet). Probably the protocol can recover, but the
performance will be certainly non-optimal.

> Which also means we can no longer know the RX path status any more,
> right? I mean if we have filters on ingress, we can't know whether they
> drop this packet or not, after this patch? To me, this at least breaks
> users' expectation.

Fair point! Then maybe we don't need to change the whole TC mirred ingress:
since the problem only affects egress to ingress, we can preserve the call
to netif_recive_skb() on ingress->ingress, and just use the backlog in the
egress->ingress direction _ that has been broken since the very beginning
and got similar fixes in the past [1]. Something like:

-- >8 --
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -205,12 +205,14 @@ static int tcf_mirred_init(struct net *net,
struct nlattr *nla,
        return err;
 }

-static int tcf_mirred_forward(bool want_ingress, struct sk_buff *skb)
+static int tcf_mirred_forward(bool want_ingress, bool at_ingress,
struct sk_buff *skb)
 {
        int err;

        if (!want_ingress)
                err =3D tcf_dev_queue_xmit(skb, dev_queue_xmit);
+       else if (!at_ingress)
+               err =3D netif_rx(skb);
        else
                err =3D netif_receive_skb(skb);

@@ -306,7 +308,7 @@ static int tcf_mirred_act(struct sk_buff *skb,
const struct tc_action *a,
                /* let's the caller reinsert the packet, if possible */
                if (use_reinsert) {
                        res->ingress =3D want_ingress;
-                       err =3D tcf_mirred_forward(res->ingress, skb);
+                       err =3D tcf_mirred_forward(res->ingress, at_ingress=
, skb);
                        if (err)
                                tcf_action_inc_overlimit_qstats(&m->common)=
;
                        __this_cpu_dec(mirred_rec_level);
@@ -314,7 +316,7 @@ static int tcf_mirred_act(struct sk_buff *skb,
const struct tc_action *a,
                }
        }

-       err =3D tcf_mirred_forward(want_ingress, skb2);
+       err =3D tcf_mirred_forward(want_ingress, at_ingress, skb2);
        if (err) {
 out:
                tcf_action_inc_overlimit_qstats(&m->common);
-- >8 --

WDYT? Any feedback appreciated, thanks!

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/com=
mit/?id=3Df799ada6bf2397c351220088b9b0980125c77280

--
davide


--=20
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656=
07352320-FXtz
Dave T=C3=A4ht CEO, TekLibre, LLC