[Cake] lockup with multiple cake instances on 3.16.7

Toke Høiland-Jørgensen toke at redhat.com
Thu Jan 31 18:09:25 EST 2019


Pete Heist <pete at heistp.net> writes:

>> On Jan 31, 2019, at 3:53 PM, Toke Høiland-Jørgensen <toke at redhat.com> wrote:
>> 
>> Well, the backtrace is definitely hanging on that lock in
>> gnet_stats_start_copy_compat(). If it's not related to the CAKE logging,
>> I guess it must be a bug in the upstream kernel; which we probably can't
>> fix from the cake side anyway.
>> 
>> I don't suppose you can reproduce this on a newer kernel?
>
> So far it works fine on 3.10.107 (with mipsel EdgeOS build) and
> 4.9.0-8 amd64, haven’t tried anything in between, but…
>
> printk tells me that it’s not locking up in cake_dump_class_stats, but
> after a failure in cake_dump_stats. When “tc qdisc” is run after
> adding the fifth cake instance, line 2974 is failing:
>
>                 PUT_TSTAT_U32(TARGET_US,
>                               ktime_to_us(ns_to_ktime(b->cparams.target)));
>
> So the call to nla_put_u32 returns nonzero. Then it ends up at
> nla_put_failure where nla_nest_cancel is called. The function returns,
> but the lock is not being released by the kernel in the failure case.
> The following patch “fixes it”:

Ah, good find!

> diff --git a/sch_cake.c b/sch_cake.c
> index 3a26db0..ae3e16c 100644
> --- a/sch_cake.c
> +++ b/sch_cake.c
> @@ -3010,6 +3010,7 @@ static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
>  
>  nla_put_failure:
>         nla_nest_cancel(d->skb, stats);
> +       sch_tree_unlock(sch);
>         return -1;
>  }
>
> Two questions:
>
> 1) Why is nla_put_u32 suddenly failing for TARGET_US after adding five
> cake instances?

Probably because it's running out of kernel memory? How much system
memory do you have on the system you are testing this on?

> 2) Is calling sch_tree_unlock the right thing to do in the failure
> case, or am I working around a kernel bug, and doing something that
> would fail in other kernels?

Yes, I think you are working around a kernel bug. See
https://elixir.bootlin.com/linux/v3.16.7/source/net/sched/sch_api.c#L1330

The lock is taken in gnet_stats_start_copy_compat() and released in
gnet_stats_finish_copy(). The latter is skipped in the failure path. It
seems this bug is present all the way up to Eric's change to remove the
locking entirely (which went into 4.8). So I guess you could get a patch
accepted for the stable trees in 3.16 and 4.4; not that this would help
you much if you are stuck on 3.16.7...

-Toke


More information about the Cake mailing list