[Cerowrt-devel] Linux network is damn fast, need more use XDP (Was: [Bloat] DC behaviors today)
Jesper Dangaard Brouer
brouer at redhat.com
Mon Dec 4 05:56:51 EST 2017
On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave at taht.net> wrote:
> Changing the topic, adding bloat.
Adding netdev, and also adjust the topic to be a rant on that the Linux
kernel network stack is actually damn fast, and if you need something
faster then XDP can solved your needs...
> Joel Wirāmu Pauling <joel at aenertia.net> writes:
> > Just from a Telco/Industry perspective slant.
> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
> > Mellanox X5 cards are the current hotness, and their offload
> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
> > OVS flow rules programming into the card. We have a lot of customers
> > chomping at the bit for that feature (disclaimer I work for Nuage
> > Networks, and we are working on enhanced OVS to do just that) for NFV
> > workloads.
> What Jesper's been working on for ages has been to try and get linux's
> PPS up for small packets, which last I heard was hovering at about
I hope you made a typo here Dave, the normal Linux kernel is definitely
way beyond 4Gbit/s, you must have misunderstood something, maybe you
meant 40Gbit/s? (which is also too low)
Scaling up to more CPUs and TCP-stream, Tariq and I have showed the
Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
But when the drivers page-recycler fails, we hit bottlenecks in the
page-allocator, that cause negative scaling to around 43Gbit/s.
Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets,
but last couple of years the network stack have been optimized (with
UDP workloads), and as a result we can do 10G without TSO/GRO on a
single-CPU. This is "only" 812Kpps with MTU size frames.
It is important to NOTICE that I'm mostly talking about SINGLE-CPU
performance. But the Linux kernel scales very well to more CPUs, and
you can scale this up, although we are starting to hit scalability
issues in MM-land.
I've also demonstrated that netdev-community have optimized the kernels
per-CPU processing power to around 2Mpps. What does this really
mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
single CPU without GRO (MTU size frames). Do you need more I ask?
> The route table lookup also really expensive on the main cpu.
Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
blogposts on the recent improvements over kernel versions, and
gave due credit to people involved.
He measured around 25 to 35 nanosec cost of route lookups. My own
recent measurements were 36.9 ns cost of fib_table_lookup.
> Does this stuff offload the route table lookup also?
If you have not heard, the netdev-community have worked on something
called XDP (eXpress Data Path). This is a new layer in the network
stack, that basically operates a the same "layer"/level as DPDK.
Thus, surprise we get the same performance numbers as DPDK. E.g. I can
do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
We can actually use XDP for (software) offloading the Linux routing
table. There are two methods we are experimenting with:
(1) externally monitor route changes from userspace and update BPF-maps
to reflect this. That approach is already accepted upstream. I'm
measuring 9,513,746 pps per CPU with that approach.
(2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
This is still experimental patches (credit to David Ahern), and I've
measured 9,350,160 pps with this approach in a single CPU. Using more
CPUs we hit 14.6Mpps (only used 3 CPUs in that test)
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
More information about the Cerowrt-devel