From: Stephen Hemminger <stephen@networkplumber.org>
To: BeckW--- via Bloat <bloat@lists.bufferbloat.net>
Cc: BeckW@telekom.de, <dpreed@deepplum.com>, <tom@herbertland.com>,
<frantisek.borsik@gmail.com>, <cake@lists.bufferbloat.net>,
<codel@lists.bufferbloat.net>, <rpm@lists.bufferbloat.net>
Subject: [Codel] Re: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
Date: Mon, 15 Sep 2025 08:16:37 -0700 [thread overview]
Message-ID: <20250915081637.2cd0d07c@hermes.local> (raw)
In-Reply-To: <FR2PPFEFD18174C4925B861C3070972199CDC15A@FR2PPFEFD18174C.DEUP281.PROD.OUTLOOK.COM>
On Mon, 15 Sep 2025 08:39:48 +0000
BeckW--- via Bloat <bloat@lists.bufferbloat.net> wrote:
> Programming networking hardware is a bit like programming 8 bit computers int the 1980s, the hardware is often too limited and varied to support useful abstractions. This is also true for CPU-based networking once you get into the >10 Gbps realm, when caching and pipelining architectures become relevant. Writing a network protocol compiler that produces efficient code for different NICs and different CPUs is a daunting task. And unlike with 8 bit computers, there are no simple metrics ('you need at least 32kb RAM to run this code' vs 'this NIC supports 4k queues with PIE, Codel', 'this CPU has 20 Mbyte of Intel SmartCache').
Linux kernel still lacks an easy way to setup many features in Smart NIC's. DPDK has rte_flow which allows direct
access to hardware flow processing. But DPDK lacks any reasonable form of shaper control.
> Ebpf is very close to what was described in this 1995 exokernel paper( https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf). The idea of the exokernel was to have easily loadable, verified code in the kernel -- eg the security-critical task of assigning a packet to a session of a user -- and leave the rest of the protocol -- eg tcp retransmissions -- to the user space. AFAIK few people use ebpf like this, but it should be possible.
>
> Ebpf manages the abstraction part well, but sacrifices a lot of performance -- eg lack of aggressive batching like vpp / fd.io does. With DPDK, you often find out that your nic's hardware or driver doesn't support the function that you hoped to use and end up optimizing for a particular hardware. Even if driver and hardware support a functionality, it may very well be that hardware resources are too limited for your particular use case. The abstraction is there, but your code is still hardware specific.
There were a few NIC's that offloaded eBPF but they never really went mainstream.
> -----Ursprüngliche Nachricht-----
> Von: David P. Reed <dpreed@deepplum.com>
> Gesendet: Samstag, 13. September 2025 22:33
> An: Tom Herbert <tom@herbertland.com>
> Cc: Frantisek Borsik <frantisek.borsik@gmail.com>; Cake List <cake@lists.bufferbloat.net>; codel@lists.bufferbloat.net; bloat <bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <rpm@lists.bufferbloat.net>
> Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
>
>
> Tom -
>
> An architecture-independent network framework independent of the OS kernel's peculiarities seems within reach (though a fair bit of work), and I think it would be a GOOD THING indeed. IMHO the Linux networking stack in the kernel is a horrific mess, and it doesn't have to be.
>
> The reason it doesn't have to be is that there should be no reason it cannot run in ring3/userland, just like DPDK. And it should be built using "real-time" userland programming techniques. (avoiding the generic linux scheduler). The ONLY reason for involving the scheduler would be because there aren't enough cores. Linux was designed to be a uniprocessor Unix, and that just is no longer true at all. With hyperthreading, too, one need never abandon a processor's context in userspace to run some "userland" application.
>
> This would rip a huge amount of kernel code out of the kernel. (at least 50%, and probably more). THe security issues of all those 3rd party network drivers would go away.
>
> And the performance would be much higher for networking. (running in ring 3, especially if you don't do system calls, is no performance penalty, and interprocessor communications using shared memory is much lower latency than Linux IPC or mutexes).
>
> I like the idea of a compilation based network stack, at a slightly higher level than C. eBPF is NOT what I have in mind - it's an interpreter with high overhead. The language should support high-performance co-routining - shared memory, ideally. I don't thing GC is a good thing. Rust might be a good starting point because its memory management is safe.
> To me, some of what the base of DPDK is like is good stuff. However, it isn't architecturally neutral.
>
> To me, the network stack should not be entangled with interrupt handling at all. "polling" is far more performant under load. The only use for interrupts is when the network stack is completely idle. That would be, in userland, a "wait for interrupt" call (not a poll). Ideally, on recent Intel machines, a userspace version of MONITOR/MWAIT).
>
> Now I know that Linus and his crew are really NOT gonna like this. Linus is still thinking like MINIX, a uniprocessor time-sharing system with rich OS functions in the kernel and doing "file" reads and writes to communicate with the kernel state. But it is a much more modern way to think of real-time IO in a modern operating system. (Windows and macOS are also Unix-like, uniprocessor monolithic kernel designs).
>
> So, if XDP2 got away from the Linux kernel, it could be great.
> BTW, io_uring, etc. are half-measures. They address getting away from interrupts toward polling, but they still make the mistake of keeping huge drivers in the kernel.
DPDK already supports use of XDP as a way to do userspace networking.
It is good generic way to get packets in/out but the dedicated userspace drivers allow
for more access to hardware. The XDP abstraction gets in the way of little things like programming
VLAN's, etc.
The tradeoff is userspace networking works great for infrastructure, routers, switches, firewalls etc;
but userspace networking for network stacks to applications is hard to do, and loses the isolation
that the kernel provides.
> > I think it is interesting as a concept. A project I am advising has been > using DPDK very effectively to get rid of the huge path and locking delays > in the current Linux network stack. XDP2 could be supported in a ring3 > (user) address space, achieving a similar result.
> HI David,
> The idea is you could write the code in XDP2 and it would be compiled to DPDK or eBPF and the compiler would handle the optimizations.
> >
> >
> >
> > But I don't think XDP2 is going that direction - so it may be stuckinto > the mess of kernel space networking. Adding eBPF only has made this more of > a mess, by the way (and adding a new "compiler" that needs to be veriried > as safe for the kernel).
> Think of XDP2 as the generalization of XDP to go beyond just the kernel. The idea is that the user writes their datapath code once and they compile it to run in whatever targets they have-- DPDK, P4, other programmable hardware, and yes XDP/eBPF. It's really not limited to kernel networking.
> As for the name XDP2, when we created XDP, eXpress DataPath, my vision was that it would be implementation agnostic. eBPF was the first instantiation for practicality, but now ten years later I think we can realize the initial vision.
> Tom
At this point, different network architectures get focused at different use cases.
The days of the one-size-fits-all networking of BSD Unix is dead.
next prev parent reply other threads:[~2025-09-15 15:16 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 10:32 [Codel] " Frantisek Borsik
2025-09-09 20:25 ` [Codel] Re: [Cake] " David P. Reed
2025-09-09 21:02 ` Frantisek Borsik
2025-09-09 21:36 ` [Codel] Re: [Cake] " Tom Herbert
2025-09-10 8:54 ` [Codel] Re: [Bloat] " BeckW
2025-09-10 13:59 ` Tom Herbert
2025-09-10 14:06 ` Tom Herbert
2025-09-13 18:33 ` [Codel] " David P. Reed
2025-09-13 20:58 ` Tom Herbert
2025-09-14 18:00 ` David P. Reed
2025-09-14 18:38 ` Tom Herbert
2025-09-15 8:39 ` [Codel] Re: [Bloat] " BeckW
2025-09-15 15:16 ` Stephen Hemminger [this message]
2025-09-15 18:07 ` Frantisek Borsik
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUT LOOK.COM>
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUTLOO K.COM>
2025-09-13 18:35 ` David P. Reed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.bufferbloat.net/postorius/lists/codel.lists.bufferbloat.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250915081637.2cd0d07c@hermes.local \
--to=stephen@networkplumber.org \
--cc=BeckW@telekom.de \
--cc=bloat@lists.bufferbloat.net \
--cc=cake@lists.bufferbloat.net \
--cc=codel@lists.bufferbloat.net \
--cc=dpreed@deepplum.com \
--cc=frantisek.borsik@gmail.com \
--cc=rpm@lists.bufferbloat.net \
--cc=tom@herbertland.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox