From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id DF1403B29E for ; Thu, 31 Jan 2019 18:09:28 -0500 (EST) Received: by mail-lf1-f68.google.com with SMTP id f5so3611502lfc.13 for ; Thu, 31 Jan 2019 15:09:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=WMDOCCT4d+FcJtXQhu2OMfpA1+KG/NNwmAdxvjIa3Ws=; b=nTEyKIMaDTX/a+pFkkJ9veXZ0Uxk1cj5Dyu4+H93A4eLVGxcGnibHkBJLh1+nlCYh1 +B/BSoh4wu92kua+qz6jd1Zntq4prmDq+JHyDPJTSs2bvGeVXcvjPB1Tz3IwB4CqNs+h fWDM9wiPKZ9yY64JoTgGZfrqUqRMUdCy/HMIlv9SF85VysAmhoXe8Hkl2briL6VfGfSq y7dx8tEue5x+VJXV6bIjEw7xeui+Gl2/WrMKSQP20LXtCOIrgxhcQeAlrgWtlDo+ho42 1w2Ph4H6xoe2hxhDpA32ksE4SbgcVZtEllR6zdwSAAl4CGOc/QF0y71mYSng1u/Dri78 Iksg== X-Gm-Message-State: AJcUukfdCHwwtKLPcCi76Gd6D60OUQvQDTqbHs2i31ZWh4CUGOZofV7D B6c9lHNgcrdIkFBzQBhlrYBKIg== X-Google-Smtp-Source: ALg8bN7AomJwcDp0OLBVcQPo1FbBv12Q/flyy1MuEkTNM03fJwNHW2zJA8I0ExyU6x3KMaiFw8e1Kg== X-Received: by 2002:a19:4287:: with SMTP id p129mr29528167lfa.135.1548976167588; Thu, 31 Jan 2019 15:09:27 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk (borgediget.toke.dk. [85.204.121.218]) by smtp.gmail.com with ESMTPSA id f8sm1198092lfe.72.2019.01.31.15.09.26 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 31 Jan 2019 15:09:26 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 8FAC7180496; Fri, 1 Feb 2019 00:09:25 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Pete Heist Cc: Cake List In-Reply-To: <60A1337C-DE0E-43DE-B5CA-5815F615124D@heistp.net> References: <15FB76CC-44B2-496B-80EC-8D00AD2AF9B7@heistp.net> <87zhrhiwfv.fsf@toke.dk> <9540B582-7B7C-4846-BA40-54419DF109D4@heistp.net> <87r2csj2uk.fsf@toke.dk> <60A1337C-DE0E-43DE-B5CA-5815F615124D@heistp.net> X-Clacks-Overhead: GNU Terry Pratchett Date: Fri, 01 Feb 2019 00:09:25 +0100 Message-ID: <87h8doifve.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Cake] lockup with multiple cake instances on 3.16.7 X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Jan 2019 23:09:29 -0000 Pete Heist writes: >> On Jan 31, 2019, at 3:53 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >>=20 >> Well, the backtrace is definitely hanging on that lock in >> gnet_stats_start_copy_compat(). If it's not related to the CAKE logging, >> I guess it must be a bug in the upstream kernel; which we probably can't >> fix from the cake side anyway. >>=20 >> I don't suppose you can reproduce this on a newer kernel? > > So far it works fine on 3.10.107 (with mipsel EdgeOS build) and > 4.9.0-8 amd64, haven=E2=80=99t tried anything in between, but=E2=80=A6 > > printk tells me that it=E2=80=99s not locking up in cake_dump_class_stats= , but > after a failure in cake_dump_stats. When =E2=80=9Ctc qdisc=E2=80=9D is ru= n after > adding the fifth cake instance, line 2974 is failing: > > PUT_TSTAT_U32(TARGET_US, > ktime_to_us(ns_to_ktime(b->cparams.target))= ); > > So the call to nla_put_u32 returns nonzero. Then it ends up at > nla_put_failure where nla_nest_cancel is called. The function returns, > but the lock is not being released by the kernel in the failure case. > The following patch =E2=80=9Cfixes it=E2=80=9D: Ah, good find! > diff --git a/sch_cake.c b/sch_cake.c > index 3a26db0..ae3e16c 100644 > --- a/sch_cake.c > +++ b/sch_cake.c > @@ -3010,6 +3010,7 @@ static int cake_dump_stats(struct Qdisc *sch, struc= t gnet_dump *d) >=20=20 > nla_put_failure: > nla_nest_cancel(d->skb, stats); > + sch_tree_unlock(sch); > return -1; > } > > Two questions: > > 1) Why is nla_put_u32 suddenly failing for TARGET_US after adding five > cake instances? Probably because it's running out of kernel memory? How much system memory do you have on the system you are testing this on? > 2) Is calling sch_tree_unlock the right thing to do in the failure > case, or am I working around a kernel bug, and doing something that > would fail in other kernels? Yes, I think you are working around a kernel bug. See https://elixir.bootlin.com/linux/v3.16.7/source/net/sched/sch_api.c#L1330 The lock is taken in gnet_stats_start_copy_compat() and released in gnet_stats_finish_copy(). The latter is skipped in the failure path. It seems this bug is present all the way up to Eric's change to remove the locking entirely (which went into 4.8). So I guess you could get a patch accepted for the stable trees in 3.16 and 4.4; not that this would help you much if you are stuck on 3.16.7... -Toke