[Cerowrt-devel] [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

Dave Taht dave.taht at gmail.com
Mon Dec 4 12:00:41 EST 2017


I have a tendency to deal with netdev by itself and never cross post
there, as the bufferbloat.net servers (primarily to combat spam)
mandate starttls and vger doesn't support it at all, thus leading to
raising davem blood pressure which I'd rather not do.

But moving on...

On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer
<brouer at redhat.com> wrote:
> On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave at taht.net> wrote:
>> Changing the topic, adding bloat.
> Adding netdev, and also adjust the topic to be a rant on that the Linux
> kernel network stack is actually damn fast, and if you need something
> faster then XDP can solved your needs...
>> Joel Wirāmu Pauling <joel at aenertia.net> writes:
>> > Just from a Telco/Industry perspective slant.
>> >
>> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
>> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
>> > Mellanox X5 cards are the current hotness, and their offload
>> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
>> > OVS flow rules programming into the card. We have a lot of customers
>> > chomping at the bit for that feature (disclaimer I work for Nuage
>> > Networks, and we are working on enhanced OVS to do just that) for NFV
>> > workloads.
>> What Jesper's been working on for ages has been to try and get linux's
>> PPS up for small packets, which last I heard was hovering at about
>> 4Gbits.
> I hope you made a typo here Dave, the normal Linux kernel is definitely
> way beyond 4Gbit/s, you must have misunderstood something, maybe you
> meant 40Gbit/s? (which is also too low)

The context here was PPS for *non-gro'd* tcp ack packets, in the
further context of
the increasingly epic "benefits of ack filtering" thread on the bloat
list, in the context
that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new
sch_cake ack-filter code, double the throughput...

The kind of return traffic you see from data sent outside the DC, with
tons of flows.

What's that number?

> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.

So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or?

> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.


> It is important to NOTICE that I'm mostly talking about SINGLE-CPU
> performance.  But the Linux kernel scales very well to more CPUs, and
> you can scale this up, although we are starting to hit scalability
> issues in MM-land[1].
> I've also demonstrated that netdev-community have optimized the kernels
> per-CPU processing power to around 2Mpps.  What does this really
> mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
> should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
> single CPU without GRO (MTU size frames).  Do you need more I ask?

The benchmark I had in mind was, say, 100k flows going out over the internet,
and the characteristics of the ack flows on the return path.

>> The route table lookup also really expensive on the main cpu.

To clarify the context here, I was asking specifically if the X5 mellonox card
did routing table offlload or only switching.

> Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
> blogposts[2][3] on the recent improvements over kernel versions, and
> gave due credit to people involved.
> [2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
> [3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
> He measured around 25 to 35 nanosec cost of route lookups.  My own
> recent measurements were 36.9 ns cost of fib_table_lookup.

On intel hw.

>> Does this stuff offload the route table lookup also?
> If you have not heard, the netdev-community have worked on something
> called XDP (eXpress Data Path).  This is a new layer in the network
> stack, that basically operates a the same "layer"/level as DPDK.
> Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
> We can actually use XDP for (software) offloading the Linux routing
> table.  There are two methods we are experimenting with:
> (1) externally monitor route changes from userspace and update BPF-maps
> to reflect this. That approach is already accepted upstream[4][5].  I'm
> measuring 9,513,746 pps per CPU with that approach.
> (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
> This is still experimental patches (credit to David Ahern), and I've
> measured 9,350,160 pps with this approach in a single CPU.  Using more
> CPUs we hit 14.6Mpps (only used 3 CPUs in that test)

Neat. Perhaps trying xdp on the itty bitty routers I usually work on
would be a win.
quad arm cores are increasingy common there.

> [4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
> [5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c

thx very much for the update.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


Dave Täht
CEO, TekLibre, LLC
Tel: 1-669-226-2619

More information about the Cerowrt-devel mailing list