From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mail.toke.dk; spf=none smtp.mailfrom=mojatatu.com; dkim=pass header.d=mojatatu-com.20230601.gappssmtp.com; arc=pass; dmarc=none Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by mail.toke.dk (Postfix) with ESMTPS id 5FD45E02A21 for ; Fri, 13 Mar 2026 20:36:41 +0100 (CET) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-823c56765fdso1520313b3a.1 for ; Fri, 13 Mar 2026 12:36:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773430600; cv=none; d=google.com; s=arc-20240605; b=fxf0sKx2FyYg77oqYLRhNOaUdk2dh6ldQFkwRqmaZ2Wvq9myysSgQ8Ak/sx+iLhBCG fVJtJl6onIyOW/yzmaQZnb0g9G9lFzqi3Xje3oflcWJQBWtcUU29ogWyAxWsEkojlK4z hLlZqnQ3g4oA0c9pj9Y0HszyfPOC+7Sd5NdANSJc+6IaLELaW6ucDyWeucJ2wthrIwOl /DCEOwgrBLGh7kUDifKHozEbbEMac6vkOXHx4fhIe2Et0kIyijcsm/On+OYIegRdRt8m v/GZHpaCNtkg/I+/ZGeWaqUh0E0J6aTW8r5fQW/ajB7ST6gI7PVQnx5OwZVqtQqV9a5T zuCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=7SZBQxP659/1Vz18iP/TaSiFqht3yaJrXilwyHpy4Uc=; fh=j+n/xQjUSXDehiyIYcyhxaVuNUvEa8BNy877vAerQU8=; b=hq28GKNSS/U1d0phdd2qvm78NbuPrHAyiTcccbow1bQyfwDoYyZSPyVTpymRYnxLuP ku8AoRgc6tfspBEXjUQ0nG7IEvcCp/bXbWH0sKDcBOlOOFTtDT8QhbejsFxCjs0Dmz41 OgMQZvpUunYBWT/rfGz4ZAw/G0ySOvD/1xBYOQ3lhd5t4UCJtAM2ZbNOsUpmggfIiDAp kcmbFi1KD5sJbwHINIzEQ/im03lMoKmj/Yc/HWNdwI7nwl6LDmzHWmfYdB9OUf1+KKRO Ra85MFuGo+0YpqzpZJ5PY2H+hTwmrRH8TGp7WVfpHcJrofSDh6qx57lyjwKCDJXrPKB6 wzQw==; darn=lists.bufferbloat.net ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mojatatu-com.20230601.gappssmtp.com; s=20230601; t=1773430600; x=1774035400; darn=lists.bufferbloat.net; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7SZBQxP659/1Vz18iP/TaSiFqht3yaJrXilwyHpy4Uc=; b=DJzg+9xKQSh4qIPlgQIfTUkFfWoEZLRF6dk0DVd9NyEDlSXtRYWFJ+oT6aVUXoMGJu mYhgORUxW1i7VNSsER/Fa09YjI6N/Jjq7x5gnd491kzSqvaF/cSTTkLYs2cvfGCV9LwZ FtlD7Sl36RmKPsibUv6NQos84Afy50Q2SPgmuKxfAK+PoFTzxDXOkS1q36MLZ1vpZ7rh driE0iyZdweeSQSN0L9RN9hA3LXCDakrsMMl7YLTMdZq0Za8yNTRB23w+lvEBROEJInd 3bDPPU1UcPtdKo1xACI3XArXgVmvvFsQ0uHGtNA6YmF3062tSRo3uTz3m1SYjW1T/U/N 0ESA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1773430600; x=1774035400; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=7SZBQxP659/1Vz18iP/TaSiFqht3yaJrXilwyHpy4Uc=; b=DK2aG/r+t3JLsW6+24anZGMluoSQB+Ky/Xoh75YJkXwb8QGNPuk5zhKCNnz5GBClk0 0pAVZQTZv/2bbyLdXF3sGXvyIAKr2nH8duzO01WQFD9gINb7rp5aDw+p6/2ng0LSA5Cj wuYd0+VJX5zShR8/PgR0eZSSrfI+ux/oU0rjhzrEBGYVnwK+Gd3kwjI1J3DTsKBCaOey aS++8Tc0Df3gwvRk7FZsmp+3wKNS3O8rlEkx5HGrj7toLK00emJ2/9hmZFYsk3xsv6aP 8OpqtF2KGkXPI8Q3FFMbZqLZEvb8zJYJTNmwAb6ssWYbht9joVJfQDRLhEAgJqylqvZT ObKQ== X-Forwarded-Encrypted: i=1; AJvYcCVnpyQh1td2AGfFQ5Vq5XvI/sF2N+gC9fvFJbJ/T06le51gjZpRuT//6/GRZXamKkC53VbA@lists.bufferbloat.net X-Gm-Message-State: AOJu0YzYIajsV6Chspe3ByAsAViBhm+Rx1rTDN+r18YiIaWA6cuca5cY vBt5YfvpA0mcn10bpCuEkv+wXdl+06xMmM8tY1k28Nq4RAJBUeKTc8enb8gQ4rSLDTuFV82xq6b s6ZHr0BmA+zTOtc3hlUDyPjTRXBOZRbXJQogPM/59 X-Gm-Gg: ATEYQzxE73y/ql3yCHcuvMeyDYoyQL3XYZZHjTsQo9jAN3L7oyuWVPPk02ZKgA0gxdE gZNHAPLipMUQ36MmrN9kG2QfAisUVrMa1Y2WwMrNggBpXA4Obl5Ws2vaEx/686vWBgBEHksYivf m9G4PoIwHJ8Z/6gxwrfgnGQD5R0LqbUMLr0iNjYY/HF/CdCG8vq6IiUs1W4atw27m1uOcyPUxpq 6qz6g20i8+CJLUhFuCvj7bKhvjxFiBzvX4FUokHr+VJPa6FkRYS9krNiqkZNDTba+i4bIcD9Y69 JMHKojNwEXlUdCs= X-Received: by 2002:a05:6a00:1388:b0:829:7ee0:1a44 with SMTP id d2e1a72fcca58-82a197046dfmr4186054b3a.4.1773430599837; Fri, 13 Mar 2026 12:36:39 -0700 (PDT) MIME-Version: 1.0 References: <20260307212058.169511-1-jhs@mojatatu.com> <20260310184713.7e810431@kernel.org> <20260311175249.54abe1b6@kernel.org> <20260312165113.773a5f44@kernel.org> In-Reply-To: From: Jamal Hadi Salim Date: Fri, 13 Mar 2026 15:36:28 -0400 X-Gm-Features: AaiRm51FzVPEj5x8-7UQwBJ_H90URykuqOW7BBLFDfVKu8idWAvLsU3tRRb1ky8 Message-ID: To: Jakub Kicinski Cc: netdev@vger.kernel.org, davem@davemloft.net, edumazet@google.com, pabeni@redhat.com, horms@kernel.org, jiri@resnulli.us, toke@toke.dk, vinicius.gomes@intel.com, stephen@networkplumber.org, vladbu@nvidia.com, cake@lists.bufferbloat.net, bpf@vger.kernel.org, ghandatmanas@gmail.com, km.kim1503@gmail.com, security@kernel.org, Victor Nogueira Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: ST6PFVSVZRGJH7LEAMKJJHEZBONICD4W X-Message-ID-Hash: ST6PFVSVZRGJH7LEAMKJJHEZBONICD4W X-MailFrom: jhs@mojatatu.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Subject: [Cake] Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete List-Id: Cake - FQ_codel the next generation Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, Mar 13, 2026 at 11:56=E2=80=AFAM Jamal Hadi Salim wrote: > > On Thu, Mar 12, 2026 at 7:51=E2=80=AFPM Jakub Kicinski = wrote: > > > > On Thu, 12 Mar 2026 16:36:48 -0400 Jamal Hadi Salim wrote: > > > > > Two of the several (I think 4!) patches we had took a similar pat= h. I > > > > > am trying to remember at least one variant was bad for performanc= e and > > > > > the other was unstable. Let's see if we can revive it and take a > > > > > closer look. BTW - none were pretty, it was maybe half the lines = of > > > > > code but touched many things. > > > > > > > > FWIW / of course, we have to apply similar change to all(?) callers= of > > > > __tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty = long. > > > > And it's not going to help the already spaghetti-looking locking. B= ut > > > > even if it's more LoC I quite like the idea of containing the poopy > > > > code to where problems originate which is the lockless filter handl= ing. > > > > Fingers crossed.. > > > > > > Something like attached. > > > Unfortunately after running it for a few hours it reproduced. > > > The action code path (entered by virtue of filter code path execution= ) > > > releases the rtnl when attempting to load an action module. A paralle= l > > > qdisc operation waiting for the lock then grabs it and we hit the sam= e > > > issue... > > > > > > So now we have to be more invasive and start coordinating the action > > > code etc, which is not appealing. Thoughts? > > > > I see. Doesn't seem entirely crazy to let tcf_proto_lookup_ops() > > return -EAGAIN without actually loading the module, and have it's > > call path (of which there are only 2?) do the module loading once > > all the locks are released. The call paths handle the EAGAIN and > > retry already they just assume tcf_proto_lookup_ops() has loaded > > the module so they don't have to. > > I might not be entirely following what you are saying. The core issue > is not module loading perse. We can avoid module loading in > tcf_proto_lookup_ops while the rtnl lock is released and the qdisc > refcnt is greater than 1 (which is what the patch i sent attempted). > The new quark is when we have actions associated with the filter. > > I will send you a much lengthier description of the issue in an hour > or two to get us on the same page. Note, i am going to describe the scenario we saw with the action quark where "filter add" enters before "qdisc del"; the one described in the commit is the reverse order. We see the issue in both scenarios. Sorry, this is a lot of text ;-> Let me explain the "expected" behavior =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D time x: On cpu1: "filter add" enters increments refcnt to 2 after finding the root qdisc (see __tcf_qdisc_find) time x+1: On cpu1: "filter add" in the action code releases the rtnl_lock before calling request_module. time x+1: On cpu2: "qdisc del" enters for root qdisc, holding rtnl (example refcnt =3D 1), see tc_get_qdisc(). time x+2: On cpu2: "qdisc del" calls graft which calls qdisc_put() which decrements refcnt to 1 and cant delete qdisc because refcnt is no= t 0; "qdisc del" exits ***We are in an inconsistent state at this point - IOW, the qdisc needs to be deleted but cant because of the refcnt !=3D0. time x+3: On cpu1: "filter add" acquires the rtnl_lock again in the action code (see time x+1). time x+4: On cpu1: "filter add" eventually completes; see tcf_block_release= () invocation in tc_new_tfilter() which eventually calls qdisc_put() which manages to decrement refcnt to 0. Qdisc is deleted.. To reproduce the issue requires 3 threads - which is what the repro does. Note: sequence time x to x+2 is the same as "expected" flow above. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D time x: On cpu1: "filter add" enters increments refcnt to 2 after finding the root qdisc (see __tcf_qdisc_find) time x+1: On cpu1: "filter add" in the action code releases the rtnl_lock before calling request_module. Note: the action code path will be momentarily rescheduled because of request_module call. time x+1: On cpu2: "qdisc del" enters for root qdisc, holding rtnl (example refcnt =3D 1), see tc_get_qdisc(). time x+2: On cpu2: "qdisc del" calls graft which calls qdisc_put() which decrements refcnt to 1 and cant delete qdisc because refcnt is no= t 0; "qdisc del" exits ***We are in an inconsistent state at this point- IOW, the qdisc needs to be deleted but cant because of the refcnt !=3D0. time x+3: On cpu3: "qdisc add" comes in and finds the deleted qdisc still on the qdisc hash table since it hasnt been deleted (see inconsit= ent state in time x+2). And then shit hits the fan (as described in t= he commit, qdisc_tree_reduce_backlog() etc). In this specific example, the issue is that the classifier code path can't release the rtnl_lock while the qdisc's refcnt is bigger than 1. Does this make more sense? The reason we went with the "mark for delete" approach is at time x+3 the "qdisc add" wont be able to find this qdisc. This is the common observed pattern - for example described in the commit message where we get have a slightly different flow with "qdisc del" before "filter add". cheers, jamal