[Cake] Fwd: [net-next PATCH 00/14] lockless qdisc series

Cake - FQ_codel the next generation
 help / color / mirror / Atom feed

* [Cake] Fwd: [net-next PATCH 00/14] lockless qdisc series
       [not found] <20171207173500.5771.41198.stgit@john-Precision-Tower-5810>
@ 2017-12-07 18:53 ` Dave Taht
       [not found]   ` <CAJq5cE22iXQsrwT3o0sv_=ydrk6wr6nqtoN3k5Tf1mxdzJC04A@mail.gmail.com>
  2017-12-07 19:07   ` [Cake] " Dave Taht
  0 siblings, 2 replies; 5+ messages in thread
From: Dave Taht @ 2017-12-07 18:53 UTC (permalink / raw)
  To: Cake List

---------- Forwarded message ----------
From: John Fastabend <john.fastabend@gmail.com>
Date: Thu, Dec 7, 2017 at 9:53 AM
Subject: [net-next PATCH 00/14] lockless qdisc series
To: willemdebruijn.kernel@gmail.com, daniel@iogearbox.net,
eric.dumazet@gmail.com, davem@davemloft.net
Cc: netdev@vger.kernel.org, jiri@resnulli.us, xiyou.wangcong@gmail.com

This series adds support for building lockless qdiscs. This is
the result of noticing the qdisc lock is a common hot-spot in
perf analysis of the Linux network stack, especially when testing
with high packet per second rates. However, nothing is free and
most qdiscs rely on the qdisc lock for their data structures so
each qdisc must be converted on a case by case basis. In this
series, to kick things off, we make pfifo_fast, mq, and mqprio
lockless. Follow up series can address additional qdiscs as needed.
For example sch_tbf might be useful. To allow this the lockless
design is an opt-in flag. In some future utopia we convert all
qdiscs and we get to drop this case analysis, but in order to
make progress we live in the real-world.

There are also a handful of optimizations I have behind this
series and a few code cleanups that I couldn't figure out how
to fit neatly into this series with out increasing the patch
count. Once this is in additional patches can address this. The
most notable is in skb_dequeue we can push the consumer lock
out a bit and consume multiple skbs off the skb_array in pfifo
fast per iteration. Ideally we could push arrays of packets at
drivers as well but we would need the infrastructure for this.
The other notable improvement is to do less locking in the
overrun cases where bad tx queue list and gso_skb are being
hit. Although, nice in theory in practice this is the error
case and I haven't found a benchmark where this matters yet.

For testing...

My first test case uses multiple containers (via cilium) where
multiple client containers use 'wrk' to benchmark connections with
a server container running lighttpd. Where lighttpd is configured
to use multiple threads, one per core. Additionally this test has
a proxy agent running so all traffic takes an extra hop through a
proxy container. In these cases each TCP packet traverses the egress
qdisc layer at least four times and the ingress qdisc layer an
additional four times. This makes for a good stress test IMO, perf
details below.

The other micro-benchmark I run is injecting packets directly into
qdisc layer using pktgen. This uses the benchmark script,

 ./pktgen_bench_xmit_mode_queue_xmit.sh

Benchmarks taken in two cases, "base" running latest net-next no
changes to qdisc layer and "qdisc" tests run with qdisc lockless
updates. Numbers reported in req/sec. All virtual 'veth' devices
run with pfifo_fast in the qdisc test case.

`wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`

conns    16      32     64   1024
-----------------------------------------------
base:   18831  20201  21393  29151
qdisc:  19309  21063  23899  29265

notice in all cases we see performance improvement when running
with qdisc case.

Microbenchmarks using pktgen are as follows,

`pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000

base(mq):          2.1Mpps
base(pfifo_fast):  2.1Mpps
qdisc(mq):         2.6Mpps
qdisc(pfifo_fast): 2.6Mpps

notice numbers are the same for mq and pfifo_fast because only
testing a single thread here. In both tests we see a nice bump
in performance gain. The key with 'mq' is it is already per
txq ring so contention is minimal in the above cases. Qdiscs
such as tbf or htb which have more contention will likely show
larger gains when/if lockless versions are implemented.

Thanks to everyone who helped with this work especially Daniel
Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
design and reviewing versions of the code.

Changes from the RFC: dropped a couple patches off the end,
fixed a bug with skb_queue_walk_safe not unlinking skb in all
cases, fixed a lockdep splat with pfifo_fast_destroy not calling
*_bh lock variant, addressed _most_ of Willem's comments, there
was a bug in the bulk locking (final patch) of the RFC series.

@Willem, I left out lockdep annotation for a follow on series
to add lockdep more completely, rather than just in code I
touched.

Comments and feedback welcome.

Thanks,
John

---

John Fastabend (14):
      net: sched: cleanup qdisc_run and __qdisc_run semantics
      net: sched: allow qdiscs to handle locking
      net: sched: remove remaining uses for qdisc_qlen in xmit path
      net: sched: provide per cpu qstat helpers
      net: sched: a dflt qdisc may be used with per cpu stats
      net: sched: explicit locking in gso_cpu fallback
      net: sched: drop qdisc_reset from dev_graft_qdisc
      net: sched: use skb list for skb_bad_tx
      net: sched: check for frozen queue before skb_bad_txq check
      net: sched: helpers to sum qlen and qlen for per cpu logic
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
      net: skb_array: expose peek API
      net: sched: pfifo_fast use skb_array

 include/linux/skb_array.h |    5 +
 include/net/gen_stats.h   |    3
 include/net/pkt_sched.h   |   10 +
 include/net/sch_generic.h |   79 +++++++-
 net/core/dev.c            |   31 +++
 net/core/gen_stats.c      |    9 +
 net/sched/sch_api.c       |    8 +
 net/sched/sch_generic.c   |  440 ++++++++++++++++++++++++++++++++-------------
 net/sched/sch_mq.c        |   34 +++
 net/sched/sch_mqprio.c    |   69 +++++--
 10 files changed, 512 insertions(+), 176 deletions(-)

--
Signature

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Cake] Fwd: [net-next PATCH 00/14] lockless qdisc series
       [not found]     ` <CAJq5cE3ESNejMEPDJORtDtHZpTTQKROs3-e1xbejajrpciYD2A@mail.gmail.com>
@ 2017-12-07 19:05       ` Jonathan Morton
  0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Morton @ 2017-12-07 19:05 UTC (permalink / raw)
  To: Dave Taht; +Cc: Cake List

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

Yeah, that's likely to help Cake on weak multicores; even if we can't
actually make individual Cake instances concurrent, we should benefit from
running ingress and egress in parallel with each other.

And there's every chance we *can* make Cake run concurrently to some
extent, even if it's just enqueue with dequeue.  So you can then bring four
cores to bear on simple bidirectional traffic.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 489 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Cake] [net-next PATCH 00/14] lockless qdisc series
  2017-12-07 18:53 ` [Cake] Fwd: [net-next PATCH 00/14] lockless qdisc series Dave Taht
       [not found]   ` <CAJq5cE22iXQsrwT3o0sv_=ydrk6wr6nqtoN3k5Tf1mxdzJC04A@mail.gmail.com>
@ 2017-12-07 19:07   ` Dave Taht
  2017-12-07 21:57     ` Pete Heist
  1 sibling, 1 reply; 5+ messages in thread
From: Dave Taht @ 2017-12-07 19:07 UTC (permalink / raw)
  To: Cake List

I'm forwarding this sort of stuff 'cause I keep hoping to find more
optimizations for cake, and it really seems like cheap multicores have
grown very common.




On Thu, Dec 7, 2017 at 10:53 AM, Dave Taht <dave.taht@gmail.com> wrote:
> ---------- Forwarded message ----------
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Thu, Dec 7, 2017 at 9:53 AM
> Subject: [net-next PATCH 00/14] lockless qdisc series
> To: willemdebruijn.kernel@gmail.com, daniel@iogearbox.net,
> eric.dumazet@gmail.com, davem@davemloft.net
> Cc: netdev@vger.kernel.org, jiri@resnulli.us, xiyou.wangcong@gmail.com
>
>
> This series adds support for building lockless qdiscs. This is
> the result of noticing the qdisc lock is a common hot-spot in
> perf analysis of the Linux network stack, especially when testing
> with high packet per second rates. However, nothing is free and
> most qdiscs rely on the qdisc lock for their data structures so
> each qdisc must be converted on a case by case basis. In this
> series, to kick things off, we make pfifo_fast, mq, and mqprio
> lockless. Follow up series can address additional qdiscs as needed.
> For example sch_tbf might be useful. To allow this the lockless
> design is an opt-in flag. In some future utopia we convert all
> qdiscs and we get to drop this case analysis, but in order to
> make progress we live in the real-world.
>
> There are also a handful of optimizations I have behind this
> series and a few code cleanups that I couldn't figure out how
> to fit neatly into this series with out increasing the patch
> count. Once this is in additional patches can address this. The
> most notable is in skb_dequeue we can push the consumer lock
> out a bit and consume multiple skbs off the skb_array in pfifo
> fast per iteration. Ideally we could push arrays of packets at
> drivers as well but we would need the infrastructure for this.
> The other notable improvement is to do less locking in the
> overrun cases where bad tx queue list and gso_skb are being
> hit. Although, nice in theory in practice this is the error
> case and I haven't found a benchmark where this matters yet.
>
> For testing...
>
> My first test case uses multiple containers (via cilium) where
> multiple client containers use 'wrk' to benchmark connections with
> a server container running lighttpd. Where lighttpd is configured
> to use multiple threads, one per core. Additionally this test has
> a proxy agent running so all traffic takes an extra hop through a
> proxy container. In these cases each TCP packet traverses the egress
> qdisc layer at least four times and the ingress qdisc layer an
> additional four times. This makes for a good stress test IMO, perf
> details below.
>
> The other micro-benchmark I run is injecting packets directly into
> qdisc layer using pktgen. This uses the benchmark script,
>
>  ./pktgen_bench_xmit_mode_queue_xmit.sh
>
> Benchmarks taken in two cases, "base" running latest net-next no
> changes to qdisc layer and "qdisc" tests run with qdisc lockless
> updates. Numbers reported in req/sec. All virtual 'veth' devices
> run with pfifo_fast in the qdisc test case.
>
> `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`
>
> conns    16      32     64   1024
> -----------------------------------------------
> base:   18831  20201  21393  29151
> qdisc:  19309  21063  23899  29265
>
> notice in all cases we see performance improvement when running
> with qdisc case.
>
> Microbenchmarks using pktgen are as follows,
>
> `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000
>
> base(mq):          2.1Mpps
> base(pfifo_fast):  2.1Mpps
> qdisc(mq):         2.6Mpps
> qdisc(pfifo_fast): 2.6Mpps
>
> notice numbers are the same for mq and pfifo_fast because only
> testing a single thread here. In both tests we see a nice bump
> in performance gain. The key with 'mq' is it is already per
> txq ring so contention is minimal in the above cases. Qdiscs
> such as tbf or htb which have more contention will likely show
> larger gains when/if lockless versions are implemented.
>
> Thanks to everyone who helped with this work especially Daniel
> Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
> design and reviewing versions of the code.
>
> Changes from the RFC: dropped a couple patches off the end,
> fixed a bug with skb_queue_walk_safe not unlinking skb in all
> cases, fixed a lockdep splat with pfifo_fast_destroy not calling
> *_bh lock variant, addressed _most_ of Willem's comments, there
> was a bug in the bulk locking (final patch) of the RFC series.
>
> @Willem, I left out lockdep annotation for a follow on series
> to add lockdep more completely, rather than just in code I
> touched.
>
> Comments and feedback welcome.
>
> Thanks,
> John
>
> ---
>
> John Fastabend (14):
>       net: sched: cleanup qdisc_run and __qdisc_run semantics
>       net: sched: allow qdiscs to handle locking
>       net: sched: remove remaining uses for qdisc_qlen in xmit path
>       net: sched: provide per cpu qstat helpers
>       net: sched: a dflt qdisc may be used with per cpu stats
>       net: sched: explicit locking in gso_cpu fallback
>       net: sched: drop qdisc_reset from dev_graft_qdisc
>       net: sched: use skb list for skb_bad_tx
>       net: sched: check for frozen queue before skb_bad_txq check
>       net: sched: helpers to sum qlen and qlen for per cpu logic
>       net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
>       net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
>       net: skb_array: expose peek API
>       net: sched: pfifo_fast use skb_array
>
>
>  include/linux/skb_array.h |    5 +
>  include/net/gen_stats.h   |    3
>  include/net/pkt_sched.h   |   10 +
>  include/net/sch_generic.h |   79 +++++++-
>  net/core/dev.c            |   31 +++
>  net/core/gen_stats.c      |    9 +
>  net/sched/sch_api.c       |    8 +
>  net/sched/sch_generic.c   |  440 ++++++++++++++++++++++++++++++++-------------
>  net/sched/sch_mq.c        |   34 +++
>  net/sched/sch_mqprio.c    |   69 +++++--
>  10 files changed, 512 insertions(+), 176 deletions(-)
>
> --
> Signature
>
>
> --
>
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Cake] [net-next PATCH 00/14] lockless qdisc series
  2017-12-07 19:07   ` [Cake] " Dave Taht
@ 2017-12-07 21:57     ` Pete Heist
  2017-12-13  7:07       ` Pete Heist
  0 siblings, 1 reply; 5+ messages in thread
From: Pete Heist @ 2017-12-07 21:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: Cake List

> On Dec 7, 2017, at 8:07 PM, Dave Taht <dave.taht@gmail.com> wrote:
> 
> I'm forwarding this sort of stuff 'cause I keep hoping to find more
> optimizations for cake, and it really seems like cheap multicores have
> grown very common.

Great, that sounds like it could be a real improvement. I’m thinking again of ER-Xs, which are weak quad-cores that can’t run cake at more than around 100Mbps aggregate bandwidth, and do better at significantly less.

Speaking of those, Ubiquiti hasn’t responded yet but at least the new kernel request has more Kudos than a number of other requests that have already been accepted, for what that’s worth, and in case anyone hasn’t added their vote (https://community.ubnt.com/t5/EdgeMAX-Feature-Requests/Upgrade-Linux-kernel-to-at-least-4-4/idi-p/2140663).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Cake] [net-next PATCH 00/14] lockless qdisc series
  2017-12-07 21:57     ` Pete Heist
@ 2017-12-13  7:07       ` Pete Heist
  0 siblings, 0 replies; 5+ messages in thread
From: Pete Heist @ 2017-12-13  7:07 UTC (permalink / raw)
  To: Dave Taht; +Cc: Cake List


> On Dec 7, 2017, at 10:57 PM, Pete Heist <peteheist@gmail.com> wrote:
> 
>> On Dec 7, 2017, at 8:07 PM, Dave Taht <dave.taht@gmail.com> wrote:
>> 
>> I'm forwarding this sort of stuff 'cause I keep hoping to find more
>> optimizations for cake, and it really seems like cheap multicores have
>> grown very common.
> 
> Great, that sounds like it could be a real improvement. I’m thinking again of ER-Xs, which are weak quad-cores that can’t run cake at more than around 100Mbps aggregate bandwidth, and do better at significantly less.
> 
> Speaking of those, Ubiquiti hasn’t responded yet but at least the new kernel request has more Kudos than a number of other requests that have already been accepted, for what that’s worth, and in case anyone hasn’t added their vote (https://community.ubnt.com/t5/EdgeMAX-Feature-Requests/Upgrade-Linux-kernel-to-at-least-4-4/idi-p/2140663).

It’s looking like a >= 4.4 kernel will appear in EdgeOS 2.0, whenever that arrives…


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-12-13  7:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20171207173500.5771.41198.stgit@john-Precision-Tower-5810>
2017-12-07 18:53 ` [Cake] Fwd: [net-next PATCH 00/14] lockless qdisc series Dave Taht
     [not found]   ` <CAJq5cE22iXQsrwT3o0sv_=ydrk6wr6nqtoN3k5Tf1mxdzJC04A@mail.gmail.com>
     [not found]     ` <CAJq5cE3ESNejMEPDJORtDtHZpTTQKROs3-e1xbejajrpciYD2A@mail.gmail.com>
2017-12-07 19:05       ` Jonathan Morton
2017-12-07 19:07   ` [Cake] " Dave Taht
2017-12-07 21:57     ` Pete Heist
2017-12-13  7:07       ` Pete Heist

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox