[Codel] Re: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)

CoDel AQM discussions
 help / color / mirror / Atom feed

From: "David P. Reed" <dpreed@deepplum.com>
To: "Frantisek Borsik" <frantisek.borsik@gmail.com>
Cc: "Tom Herbert" <tom@herbertland.com>,
	stephen@networkplumber.org,
	"BeckW--- via Bloat" <bloat@lists.bufferbloat.net>,
	beckw@telekom.de, cake@lists.bufferbloat.net,
	codel@lists.bufferbloat.net, rpm@lists.bufferbloat.net
Subject: [Codel] Re: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
Date: Mon, 15 Sep 2025 19:16:24 -0400 (EDT)	[thread overview]
Message-ID: <1757978184.931224085@apps.rackspace.com> (raw)
In-Reply-To: <CAJUtOOh81gJBcrA8C=C9AHQhuRRDy5KaoLptKZ9_y4+SxW2T_w@mail.gmail.com>



On Monday, September 15, 2025 18:26, "Frantisek Borsik" <frantisek.borsik@gmail.com> said:

> Fresh from Tom's oven:
> https://medium.com/@tom_84912/programming-a-parser-in-xdp2-is-as-easy-as-pie-8f26c8b3e704

Nice enough. I've always wondered why no one did a table driven packet parser. Surely wireshark did this?

However, the spaghetti mess in practice partially arises from skbuff (Linux) or mbuf (unix) structuring.  I can testify to this because I've had to add a low level protocol on top of Ethernet in a version of the FreeBSD kernel, and that protocol had to (for performance reasons) modify headers "in place" to generate response packets without copying or reallocating.

This is an additional reason why Linux's network stack is a mess - skbuff management is generally not compatible with packet formats in NICs with acceleration features that parse or verify parts of the packets. 

Admittedly, I've focused on minimizing end-to-end latency in protocols where the minimum speed is 10 Gb/sec Ethernet, where the hardware functionality (including pipelining the packet processing in the NIC controller, possibly with FPGAs, are essential.

While the header parsing is a chunk of the performance issue, specialized memory management is also.

Hoping to see what replaces skbuff in EDK2.

> 
> 
> All the best,
> 
> Frank
> 
> Frantisek (Frank) Borsik
> 
> 
> *In loving memory of Dave Täht: *1965-2025
> 
> https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
> 
> 
> https://www.linkedin.com/in/frantisekborsik
> 
> Signal, Telegram, WhatsApp: +421919416714
> 
> iMessage, mobile: +420775230885
> 
> Skype: casioa5302ca
> 
> frantisek.borsik@gmail.com
> 
> 
> On Mon, Sep 15, 2025 at 8:35 PM Tom Herbert <tom@herbertland.com> wrote:
> 
>> On Mon, Sep 15, 2025 at 11:07 AM Frantisek Borsik
>> <frantisek.borsik@gmail.com> wrote:
>> >
>> >
>> > "There were a few NIC's that offloaded eBPF but they never really went
>> mainstream."
>> >
>> > And even then, they were doing only 40 Gbps, like https://netronome.com
>> and didn't even supported full eBPF...
>> >
>> > They only support a pretty small subset of eBPF (in particular they
>> don't support the LPM map type, which was our biggest performance pain
>> point), and have a pretty cool user replaceable firmware system. They also
>> don't have the higher speeds - above 40 Gbps - where the offloading would
>> be most useful."
>>
>> Yeah, the attempts at offloading eBPF were doomed to fail. It's a
>> restricted model, lacks parallelism, doesn't support inline
>> accelerators, and requires the eBPF VM to make it no-staters. DPDK
>> would fail as well. The kernel/host environment and hardware
>> environments are quite different. If we try to force the hardware to
>> look like the host to make eBPF or DPDK portable then we'll lose the
>> performance advantages of running in the hardware. We need a model
>> that allows the software to adapt to HW, not the other way around (of
>> course, in a perfect world we'd do software/hardware codesign from the
>> get-go).
>>
>> >
>> > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT / 11:20
>> CDT / 10:20 MDT / 9:20 PDT
>>
>> Can't wait!
>>
>> >
>> > https://www.youtube.com/live/OBW5twvmHOI
>> >
>> >
>> > All the best,
>> >
>> > Frank
>> >
>> > Frantisek (Frank) Borsik
>> >
>> >
>> > In loving memory of Dave Täht: 1965-2025
>> >
>> > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
>> >
>> >
>> > https://www.linkedin.com/in/frantisekborsik
>> >
>> > Signal, Telegram, WhatsApp: +421919416714
>> >
>> > iMessage, mobile: +420775230885
>> >
>> > Skype: casioa5302ca
>> >
>> > frantisek.borsik@gmail.com
>> >
>> >
>> >
>> > On Mon, Sep 15, 2025 at 5:16 PM Stephen Hemminger <
>> stephen@networkplumber.org> wrote:
>> >>
>> >> On Mon, 15 Sep 2025 08:39:48 +0000
>> >> BeckW--- via Bloat <bloat@lists.bufferbloat.net> wrote:
>> >>
>> >> > Programming networking hardware is a bit like programming 8 bit
>> computers int the 1980s, the hardware is often too limited and varied to
>> support useful abstractions. This is also true for CPU-based networking
>> once you get into the >10 Gbps realm, when caching and pipelining
>> architectures become relevant. Writing a network protocol compiler that
>> produces efficient code for different NICs and different CPUs is a daunting
>> task. And unlike with 8 bit computers, there are no simple metrics ('you
>> need at least 32kb RAM to run this code' vs 'this NIC supports 4k queues
>> with PIE, Codel', 'this CPU has 20 Mbyte of Intel SmartCache').
>> >>
>> >> Linux kernel still lacks an easy way to setup many features in Smart
>> NIC's. DPDK has rte_flow which allows direct
>> >> access to hardware flow processing. But DPDK lacks any reasonable form
>> of shaper control.
>> >>
>> >> > Ebpf is very close to what was described in this 1995 exokernel
>> paper(
>> https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf).
>> The idea of the exokernel was to have easily loadable, verified code in the
>> kernel -- eg the security-critical task of assigning a packet to a session
>> of a user -- and leave the rest of the protocol -- eg tcp retransmissions
>> -- to the user space. AFAIK few people use ebpf like this, but it should be
>> possible.
>> >> >
>> >> > Ebpf manages the abstraction part well, but sacrifices a lot of
>> performance -- eg lack of aggressive batching like vpp / fd.io does. With
>> DPDK,  you often find out that your nic's hardware or driver doesn't
>> support the function that you hoped to use and end up optimizing for a
>> particular hardware. Even if driver and hardware support a functionality,
>> it may very well be that hardware resources are too limited for your
>> particular use case. The abstraction is there, but your code is still
>> hardware specific.
>> >>
>> >> There were a few NIC's that offloaded eBPF but they never really went
>> mainstream.
>> >>
>> >
>> >
>> >>
>> >> > -----Ursprüngliche Nachricht-----
>> >> > Von: David P. Reed <dpreed@deepplum.com>
>> >> > Gesendet: Samstag, 13. September 2025 22:33
>> >> > An: Tom Herbert <tom@herbertland.com>
>> >> > Cc: Frantisek Borsik <frantisek.borsik@gmail.com>; Cake List <
>> cake@lists.bufferbloat.net>; codel@lists.bufferbloat.net; bloat <
>> bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <
>> rpm@lists.bufferbloat.net>
>> >> > Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom
>> Herbert (almost to the date, 10 years after XDP was released)
>> >> >
>> >> >
>> >> > Tom -
>> >> >
>> >> > An architecture-independent network framework independent of the OS
>> kernel's peculiarities seems within reach (though a fair bit of work), and
>> I think it would be a GOOD THING indeed. IMHO the Linux networking stack in
>> the kernel is a horrific mess, and it doesn't have to be.
>> >> >
>> >> > The reason it doesn't have to be is that there should be no reason it
>> cannot run in ring3/userland, just like DPDK. And it should be built using
>> "real-time" userland programming techniques. (avoiding the generic linux
>> scheduler). The ONLY reason for involving the scheduler would be because
>> there aren't enough cores. Linux was designed to be a uniprocessor Unix,
>> and that just is no longer true at all. With hyperthreading, too, one need
>> never abandon a processor's context in userspace to run some "userland"
>> application.
>> >> >
>> >> > This would rip a huge amount of kernel code out of the kernel. (at
>> least 50%, and probably more). THe security issues of all those 3rd party
>> network drivers would go away.
>> >> >
>> >> > And the performance would be much higher for networking.  (running in
>> ring 3, especially if you don't do system calls, is no performance penalty,
>> and interprocessor communications using shared memory is much lower latency
>> than Linux IPC or mutexes).
>> >> >
>> >> > I like the idea of a compilation based network stack, at a slightly
>> higher level than C. eBPF is NOT what I have in mind - it's an interpreter
>> with high overhead. The language should support high-performance
>> co-routining - shared memory, ideally. I don't thing GC is a good thing.
>> Rust might be a good starting point because its memory management is safe.
>> >> > To me, some of what the base of DPDK is like is good stuff. However,
>> it isn't architecturally neutral.
>> >> >
>> >> > To me, the network stack should not be entangled with interrupt
>> handling at all. "polling" is far more performant under load. The only use
>> for interrupts is when the network stack is completely idle. That would be,
>> in userland, a "wait for interrupt" call (not a poll). Ideally, on recent
>> Intel machines, a userspace version of MONITOR/MWAIT).
>> >> >
>> >> > Now I know that Linus and his crew are really NOT gonna like this.
>> Linus is still thinking like MINIX, a uniprocessor time-sharing system with
>> rich OS functions in the kernel and doing "file" reads and writes to
>> communicate with the kernel state. But it is a much more modern way to
>> think of real-time IO in a modern operating system. (Windows and macOS are
>> also Unix-like, uniprocessor monolithic kernel designs).
>> >> >
>> >> > So, if XDP2 got away from the Linux kernel, it could be great.
>> >> > BTW, io_uring, etc. are half-measures. They address getting away from
>> interrupts toward polling, but they still make the mistake of keeping huge
>> drivers in the kernel.
>> >>
>> >> DPDK already supports use of XDP as a way to do userspace networking.
>> >> It is good generic way to get packets in/out but the dedicated
>> userspace drivers allow
>> >> for more access to hardware. The XDP abstraction gets in the way of
>> little things like programming
>> >> VLAN's, etc.
>> >>
>> >> The tradeoff is userspace networking works great for infrastructure,
>> routers, switches, firewalls etc;
>> >> but userspace networking for network stacks to applications is hard to
>> do, and loses the isolation
>> >> that the kernel provides.
>> >>
>> >> >  > I think it is interesting as a concept. A project I am advising
>> has been  > using DPDK very effectively to get rid of the huge path and
>> locking delays  > in the current Linux network stack. XDP2 could be
>> supported in a ring3  > (user) address space, achieving a similar result.
>> >> > HI David,
>> >> > The idea is you could write the code in XDP2 and it would be compiled
>> to DPDK or eBPF and the compiler would handle the optimizations.
>> >> >  >
>> >> >  >
>> >> >  >
>> >> >  > But I don't think XDP2 is going that direction - so it may be
>> stuckinto  > the mess of kernel space networking. Adding eBPF only has made
>> this more of  > a mess, by the way (and adding a new "compiler" that needs
>> to be veriried  > as safe for the kernel).
>> >> > Think of XDP2 as the generalization of XDP to go beyond just the
>> kernel. The idea is that the user writes their datapath code once and they
>> compile it to run in whatever targets they have-- DPDK, P4, other
>> programmable hardware, and yes XDP/eBPF. It's really not limited to kernel
>> networking.
>> >> > As for the name XDP2, when we created XDP, eXpress DataPath, my
>> vision was that it would be implementation agnostic. eBPF was the first
>> instantiation for practicality, but now ten years later I think we can
>> realize the initial vision.
>> >> > Tom
>> >>
>> >>
>> >> At this point, different network architectures get focused at different
>> use cases.
>> >> The days of the one-size-fits-all networking of BSD Unix is dead.
>>
>

next prev parent reply	other threads:[~2025-09-15 23:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 10:32 [Codel] " Frantisek Borsik
2025-09-09 20:25 ` [Codel] Re: [Cake] " David P. Reed
2025-09-09 21:02   ` Frantisek Borsik
2025-09-09 21:36     ` [Codel] Re: [Cake] " Tom Herbert
2025-09-10  8:54       ` [Codel] Re: [Bloat] " BeckW
2025-09-10 13:59         ` Tom Herbert
2025-09-10 14:06           ` Tom Herbert
2025-09-13 18:33       ` [Codel] " David P. Reed
2025-09-13 20:58         ` Tom Herbert
2025-09-14 18:00           ` David P. Reed
2025-09-14 18:38             ` Tom Herbert
2025-09-15  8:39         ` [Codel] Re: [Bloat] " BeckW
2025-09-15 15:16           ` Stephen Hemminger
2025-09-15 18:07             ` Frantisek Borsik
2025-09-15 18:35               ` Tom Herbert
2025-09-15 22:26                 ` Frantisek Borsik
2025-09-15 23:16                   ` David P. Reed [this message]
2025-09-16  0:05                     ` Tom Herbert
     [not found]       ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUT LOOK.COM>
     [not found]         ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUTLOO K.COM>
2025-09-13 18:35           ` David P. Reed
2025-09-15 23:16   ` [Codel] Re: [Rpm] Re: [Cake] " Robert McMahon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.bufferbloat.net/postorius/lists/codel.lists.bufferbloat.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1757978184.931224085@apps.rackspace.com \
    --to=dpreed@deepplum.com \
    --cc=beckw@telekom.de \
    --cc=bloat@lists.bufferbloat.net \
    --cc=cake@lists.bufferbloat.net \
    --cc=codel@lists.bufferbloat.net \
    --cc=frantisek.borsik@gmail.com \
    --cc=rpm@lists.bufferbloat.net \
    --cc=stephen@networkplumber.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox