[net-next PATCH 00/14] lockless qdisc series

I'm forwarding this sort of stuff 'cause I keep hoping to find more
optimizations for cake, and it really seems like cheap multicores have
grown very common.

> This series adds support for building lockless qdiscs. This is
> the result of noticing the qdisc lock is a common hot-spot in
> perf analysis of the Linux network stack, especially when testing
> with high packet per second rates. However, nothing is free and
> most qdiscs rely on the qdisc lock for their data structures so
> each qdisc must be converted on a case by case basis. In this
> series, to kick things off, we make pfifo_fast, mq, and mqprio
> lockless. Follow up series can address additional qdiscs as needed.
> For example sch_tbf might be useful. To allow this the lockless
> design is an opt-in flag. In some future utopia we convert all
> qdiscs and we get to drop this case analysis, but in order to
> make progress we live in the real-world.
> There are also a handful of optimizations I have behind this
> series and a few code cleanups that I couldn't figure out how
> to fit neatly into this series with out increasing the patch
> count. Once this is in additional patches can address this. The
> most notable is in skb_dequeue we can push the consumer lock
> out a bit and consume multiple skbs off the skb_array in pfifo
> fast per iteration. Ideally we could push arrays of packets at
> drivers as well but we would need the infrastructure for this.
> The other notable improvement is to do less locking in the
> overrun cases where bad tx queue list and gso_skb are being
> hit. Although, nice in theory in practice this is the error
> case and I haven't found a benchmark where this matters yet.
> For testing...
> My first test case uses multiple containers (via cilium) where
> multiple client containers use 'wrk' to benchmark connections with
> a server container running lighttpd. Where lighttpd is configured
> to use multiple threads, one per core. Additionally this test has
> a proxy agent running so all traffic takes an extra hop through a
> proxy container. In these cases each TCP packet traverses the egress
> qdisc layer at least four times and the ingress qdisc layer an
> additional four times. This makes for a good stress test IMO, perf
> details below.
> The other micro-benchmark I run is injecting packets directly into
> qdisc layer using pktgen. This uses the benchmark script,
>  ./pktgen_bench_xmit_mode_queue_xmit.sh
> Benchmarks taken in two cases, "base" running latest net-next no
> changes to qdisc layer and "qdisc" tests run with qdisc lockless
> updates. Numbers reported in req/sec. All virtual 'veth' devices
> run with pfifo_fast in the qdisc test case.
> `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`
> conns    16      32     64   1024
> -----------------------------------------------
> base:   18831  20201  21393  29151
> qdisc:  19309  21063  23899  29265
> notice in all cases we see performance improvement when running
> with qdisc case.
> Microbenchmarks using pktgen are as follows,
> `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000
> base(mq):          2.1Mpps
> base(pfifo_fast):  2.1Mpps
> qdisc(mq):         2.6Mpps
> qdisc(pfifo_fast): 2.6Mpps
> notice numbers are the same for mq and pfifo_fast because only
> testing a single thread here. In both tests we see a nice bump
> in performance gain. The key with 'mq' is it is already per
> txq ring so contention is minimal in the above cases. Qdiscs
> such as tbf or htb which have more contention will likely show
> larger gains when/if lockless versions are implemented.
> Thanks to everyone who helped with this work especially Daniel
> Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
> design and reviewing versions of the code.
> Changes from the RFC: dropped a couple patches off the end,
> fixed a bug with skb_queue_walk_safe not unlinking skb in all
> cases, fixed a lockdep splat with pfifo_fast_destroy not calling
> *_bh lock variant, addressed _most_ of Willem's comments, there
> was a bug in the bulk locking (final patch) of the RFC series.
> @Willem, I left out lockdep annotation for a follow on series
> to add lockdep more completely, rather than just in code I
> touched.
> Comments and feedback welcome.
> ---
> John Fastabend (14):
>       net: sched: cleanup qdisc_run and __qdisc_run semantics
>       net: sched: allow qdiscs to handle locking
>       net: sched: remove remaining uses for qdisc_qlen in xmit path
>       net: sched: provide per cpu qstat helpers
>       net: sched: a dflt qdisc may be used with per cpu stats
>       net: sched: explicit locking in gso_cpu fallback
>       net: sched: drop qdisc_reset from dev_graft_qdisc
>       net: sched: use skb list for skb_bad_tx
>       net: sched: check for frozen queue before skb_bad_txq check
>       net: sched: helpers to sum qlen and qlen for per cpu logic
>       net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
>       net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
>       net: skb_array: expose peek API
>       net: sched: pfifo_fast use skb_array
>  include/linux/skb_array.h |    5 +
>  include/net/gen_stats.h   |    3
>  include/net/pkt_sched.h   |   10 +
>  include/net/sch_generic.h |   79 +++++++-
>  net/core/dev.c            |   31 +++
>  net/core/gen_stats.c      |    9 +
>  net/sched/sch_api.c       |    8 +
>  net/sched/sch_generic.c   |  440 ++++++++++++++++++++++++++++++++-------------
>  net/sched/sch_mq.c        |   34 +++
>  net/sched/sch_mqprio.c    |   69 +++++--
>  10 files changed, 512 insertions(+), 176 deletions(-)
