On Feb 1, 2019, at 12:09 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:

1) Why is nla_put_u32 suddenly failing for TARGET_US after adding five
cake instances?

Probably because it's running out of kernel memory? How much system
memory do you have on the system you are testing this on?

Plenty of memory (used 131308, free 1911900). I’m guessing this was by design where earlier kernels allocated a smaller initial size for tail space, but that’s only a guess as I haven’t found where that’s done.

2) Is calling sch_tree_unlock the right thing to do in the failure
case, or am I working around a kernel bug, and doing something that
would fail in other kernels?

Yes, I think you are working around a kernel bug. See
https://elixir.bootlin.com/linux/v3.16.7/source/net/sched/sch_api.c#L1330

The lock is taken in gnet_stats_start_copy_compat() and released in
gnet_stats_finish_copy(). The latter is skipped in the failure path. It
seems this bug is present all the way up to Eric's change to remove the
locking entirely (which went into 4.8). So I guess you could get a patch
accepted for the stable trees in 3.16 and 4.4; not that this would help
you much if you are stuck on 3.16.7…

Hehe, “crossing the streams” here. :) That’s what I gathered after looking at that code for a while, but I’m glad to be sure about it.

Would you accept my workaround in cake_dump_stats, or rather not?