CoDel AQM discussions
 help / color / mirror / Atom feed
From: Tom Herbert <tom@herbertland.com>
To: "David P. Reed" <dpreed@deepplum.com>
Cc: Frantisek Borsik <frantisek.borsik@gmail.com>,
	Cake List <cake@lists.bufferbloat.net>,
	codel@lists.bufferbloat.net, bloat <bloat@lists.bufferbloat.net>,
	Jeremy Austin via Rpm <rpm@lists.bufferbloat.net>
Subject: [Codel] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
Date: Sat, 13 Sep 2025 22:58:53	[thread overview]
Message-ID: <CALx6S34SYbYhNVHgGJP6+aGegiABy3KM4Ugx3yTLiye3hbAtrQ@mail.gmail.com> (raw)
In-Reply-To: <1757795591.523513612@apps.rackspace.com>

On Sat, Sep 13, 2025 at 1:33 PM David P. Reed <dpreed@deepplum.com> wrote:
>
> Tom -
>
>
>
> An architecture-independent network framework independent of the OS
kernel's peculiarities seems within reach (though a fair bit of work), and
I think it would be a GOOD THING indeed. IMHO the Linux networking stack in
the kernel is a horrific mess, and it doesn't have to be.

Hi David,

Agreed. But I want to encompass programmable HW in the solution scope.

>
>
>
> The reason it doesn't have to be is that there should be no reason it
cannot run in ring3/userland, just like DPDK. And it should be built using
"real-time" userland programming techniques. (avoiding the generic linux
scheduler). The ONLY reason for involving the scheduler would be because
there aren't enough cores. Linux was designed to be a uniprocessor Unix,
and that just is no longer true at all. With hyperthreading, too, one need
never abandon a processor's context in userspace to run some "userland"
application.

XDP/eBPF gets us most of the way to that. I like the idea that eBPF is a
modern day take on micro kernels.

>
> This would rip a huge amount of kernel code out of the kernel. (at least
50%, and probably more). THe security issues of all those 3rd party network
drivers would go away.

That's exactly the direction I believe the kernel should go. Rip out kernel
code and replace it with eBPF. The result is a malleable kernel and pieces
of it become sub-programs that can be independently run in userspace or in
programmable hardware. That's also the segue to finally solving the kernel
offloads mess that we've had for twenty (except for a couple of exceptions,
all the efforts for kernel offload have been flops)

>
> And the performance would be much higher for networking.  (running in
ring 3, especially if you don't do system calls, is no performance penalty,
and interprocessor communications using shared memory is much lower latency
than Linux IPC or mutexes).

Yes, performance improves when code lives directly on top of the queue.
It's even higher performance running in HW.
>
>
>
> I like the idea of a compilation based network stack, at a slightly
higher level than C. eBPF is NOT what I have in mind - it's an interpreter
with high overhead. The language should support high-performance
co-routining - shared memory, ideally. I don't thing GC is a good thing.
Rust might be a good starting point because its memory management is safe.

IMO, we should let the user pick the language they want to use. It's
feasible as long as the programming model is supported.

> To me, some of what the base of DPDK is like is good stuff. However, it
isn't architecturally neutral.

Yes, there's some good things in DPDK to adopt. Some nice things from P4 as
well. XDP2 unified them and takes the best ideas from them.

>
> To me, the network stack should not be entangled with interrupt handling
at all. "polling" is far more performant under load. The only use for
interrupts is when the network stack is completely idle. That would be, in
userland, a "wait for interrupt" call (not a poll). Ideally, on recent
Intel machines, a userspace version of MONITOR/MWAIT).
>

Part of the reason why high performance networking in use space is so hard.
We have spend inordinate amounts worrying about isolation or APIs to HW.
All that goes away when we run the stack on bare metal (what we do in
CPU-in-the-datapath).

> Now I know that Linus and his crew are really NOT gonna like this. Linus
is still thinking like MINIX, a uniprocessor time-sharing system with rich
OS functions in the kernel and doing "file" reads and writes to communicate
with the kernel state. But it is a much more modern way to think of
real-time IO in a modern operating system. (Windows and macOS are also
Unix-like, uniprocessor monolithic kernel designs).

Just hide everything behind eBPF when in the kernel and they'll be happy.
Outside of the kernel they won't care.

>
> So, if XDP2 got away from the Linux kernel, it could be great.

Yep, we need to go beyond the kernel.

Tom


> BTW, io_uring, etc. are half-measures. They address getting away from
interrupts toward polling, but they still make the mistake of keeping huge
drivers in the kernel.
>
>
>
>
>
> On Tuesday, September 9, 2025 17:36, "Tom Herbert" <tom@herbertland.com>
said:
>
>
>
> On Tue, Sep 9, 2025, 5:03 PM Frantisek Borsik <frantisek.borsik@gmail.com>
wrote:
>>
>> Thanks a lot, David.
>>
>> I have asked Tom if he wants to join us and he should be here to chat
with
>> us now.
>>
>> All the best,
>>
>> Frank
>>
>> Frantisek (Frank) Borsik
>>
>>
>> *In loving memory of Dave Täht: *1965-2025
>>
>> https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
>>
>>
>> https://www.linkedin.com/in/frantisekborsik
>>
>> Signal, Telegram, WhatsApp: +421919416714
>>
>> iMessage, mobile: +420775230885
>>
>> Skype: casioa5302ca
>>
>> frantisek.borsik@gmail.com
>>
>>
>> On Tue, Sep 9, 2025 at 10:25 PM David P. Reed <dpreed@deepplum.com>
wrote:
>>
>> > Hi Frank -
>> >
>> >
>> >
>> > I think it is interesting as a concept. A project I am advising has
been
>> > using DPDK very effectively to get rid of the huge path and locking
delays
>> > in the current Linux network stack. XDP2 could be supported in a ring3
>> > (user) address space, achieving a similar result.
>
> HI David,
> The idea is you could write the code in XDP2 and it would be compiled to
DPDK or eBPF and the compiler would handle the optimizations.
>
>>
>> >
>> >
>> >
>> > But I don't think XDP2 is going that direction - so it may be stuckinto
>> > the mess of kernel space networking. Adding eBPF only has made this
more of
>> > a mess, by the way (and adding a new "compiler" that needs to be
veriried
>> > as safe for the kernel).
>
> Think of XDP2 as the generalization of XDP to go beyond just the kernel.
The idea is that the user writes their datapath code once and they compile
it to run in whatever targets they have-- DPDK, P4, other programmable
hardware, and yes XDP/eBPF. It's really not limited to kernel networking.
> As for the name XDP2, when we created XDP, eXpress DataPath, my vision
was that it would be implementation agnostic. eBPF was the first
instantiation for practicality, but now ten years later I think we can
realize the initial vision.
> Tom
>>
>> >
>> > I will be watching how this evolves.
>> >
>> >
>> >
>> > David
>> >
>> >
>> >
>> > On Tuesday, September 9, 2025 06:32, "Frantisek Borsik" <
>> > frantisek.borsik@gmail.com> said:
>> >
>> > > Hello to all,
>> > >
>> > > Looks interesting:
>> > >
>> >
https://medium.com/@tom_84912/xdp2-this-changes-everything-at-least-for-ai-ml-infrastructure-850c1ba82771
>> > >
>> > >
>> > > All the best,
>> > >
>> > > Frank
>> > >
>> > > Frantisek (Frank) Borsik
>> > >
>> > >
>> > > *In loving memory of Dave Täht: *1965-2025
>> > >
>> > > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
>> > >
>> > >
>> > > https://www.linkedin.com/in/frantisekborsik
>> > >
>> > > Signal, Telegram, WhatsApp: +421919416714
>> > >
>> > > iMessage, mobile: +420775230885
>> > >
>> > > Skype: casioa5302ca
>> > >
>> > > frantisek.borsik@gmail.com
>> > > _______________________________________________
>> > > Cake mailing list -- cake@lists.bufferbloat.net
>> > > To unsubscribe send an email to cake-leave@lists.bufferbloat.net
>> > >
>> >
>> _______________________________________________
>> Cake mailing list -- cake@lists.bufferbloat.net
>> To unsubscribe send an email to cake-leave@lists.bufferbloat.net

  reply	other threads:[~2025-09-13 22:58 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 10:32 [Codel] " Frantisek Borsik
2025-09-09 20:25 ` [Codel] Re: [Cake] " David P. Reed
2025-09-09 21:02   ` Frantisek Borsik
2025-09-09 21:36     ` [Codel] Re: [Cake] " Tom Herbert
2025-09-10  8:54       ` [Codel] Re: [Bloat] " BeckW
2025-09-10 13:59         ` Tom Herbert
2025-09-10 14:06           ` Tom Herbert
2025-09-13 18:33       ` [Codel] " David P. Reed
2025-09-13 20:58         ` Tom Herbert [this message]
2025-09-14 18:00           ` David P. Reed
2025-09-14 18:38             ` Tom Herbert
2025-09-15  8:39         ` [Codel] Re: [Bloat] " BeckW
2025-09-15 15:16           ` Stephen Hemminger
2025-09-15 18:07             ` Frantisek Borsik
     [not found]       ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUT LOOK.COM>
     [not found]         ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUTLOO K.COM>
2025-09-13 18:35           ` David P. Reed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.bufferbloat.net/postorius/lists/codel.lists.bufferbloat.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALx6S34SYbYhNVHgGJP6+aGegiABy3KM4Ugx3yTLiye3hbAtrQ@mail.gmail.com \
    --to=tom@herbertland.com \
    --cc=bloat@lists.bufferbloat.net \
    --cc=cake@lists.bufferbloat.net \
    --cc=codel@lists.bufferbloat.net \
    --cc=dpreed@deepplum.com \
    --cc=frantisek.borsik@gmail.com \
    --cc=rpm@lists.bufferbloat.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox