[Cerowrt-devel] Linux network is damn fast, need more use XDP (Was: [Bloat] DC behaviors today)
Jesper Dangaard Brouer
brouer at redhat.com
Mon Dec 4 05:56:51 EST 2017
On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave at taht.net> wrote:
> Changing the topic, adding bloat.
Adding netdev, and also adjust the topic to be a rant on that the Linux
kernel network stack is actually damn fast, and if you need something
faster then XDP can solved your needs...
> Joel Wirāmu Pauling <joel at aenertia.net> writes:
>
> > Just from a Telco/Industry perspective slant.
> >
> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
> > Mellanox X5 cards are the current hotness, and their offload
> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
> > OVS flow rules programming into the card. We have a lot of customers
> > chomping at the bit for that feature (disclaimer I work for Nuage
> > Networks, and we are working on enhanced OVS to do just that) for NFV
> > workloads.
>
> What Jesper's been working on for ages has been to try and get linux's
> PPS up for small packets, which last I heard was hovering at about
> 4Gbits.
I hope you made a typo here Dave, the normal Linux kernel is definitely
way beyond 4Gbit/s, you must have misunderstood something, maybe you
meant 40Gbit/s? (which is also too low)
Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
But when the drivers page-recycler fails, we hit bottlenecks in the
page-allocator, that cause negative scaling to around 43Gbit/s.
[1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets,
but last couple of years the network stack have been optimized (with
UDP workloads), and as a result we can do 10G without TSO/GRO on a
single-CPU. This is "only" 812Kpps with MTU size frames.
It is important to NOTICE that I'm mostly talking about SINGLE-CPU
performance. But the Linux kernel scales very well to more CPUs, and
you can scale this up, although we are starting to hit scalability
issues in MM-land[1].
I've also demonstrated that netdev-community have optimized the kernels
per-CPU processing power to around 2Mpps. What does this really
mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
single CPU without GRO (MTU size frames). Do you need more I ask?
> The route table lookup also really expensive on the main cpu.
Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
blogposts[2][3] on the recent improvements over kernel versions, and
gave due credit to people involved.
[2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
[3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
He measured around 25 to 35 nanosec cost of route lookups. My own
recent measurements were 36.9 ns cost of fib_table_lookup.
> Does this stuff offload the route table lookup also?
If you have not heard, the netdev-community have worked on something
called XDP (eXpress Data Path). This is a new layer in the network
stack, that basically operates a the same "layer"/level as DPDK.
Thus, surprise we get the same performance numbers as DPDK. E.g. I can
do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
We can actually use XDP for (software) offloading the Linux routing
table. There are two methods we are experimenting with:
(1) externally monitor route changes from userspace and update BPF-maps
to reflect this. That approach is already accepted upstream[4][5]. I'm
measuring 9,513,746 pps per CPU with that approach.
(2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
This is still experimental patches (credit to David Ahern), and I've
measured 9,350,160 pps with this approach in a single CPU. Using more
CPUs we hit 14.6Mpps (only used 3 CPUs in that test)
[4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
[5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
More information about the Cerowrt-devel
mailing list