From: Tom Herbert <tom@herbertland.com>
To: "David P. Reed" <dpreed@deepplum.com>
Cc: Frantisek Borsik <frantisek.borsik@gmail.com>,
Stephen Hemminger <stephen@networkplumber.org>,
BeckW--- via Bloat <bloat@lists.bufferbloat.net>,
beckw@telekom.de, Cake List <cake@lists.bufferbloat.net>,
codel@lists.bufferbloat.net,
Jeremy Austin via Rpm <rpm@lists.bufferbloat.net>
Subject: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
Date: Mon, 15 Sep 2025 17:05:11 -0700 [thread overview]
Message-ID: <CALx6S36YW0pCDCeHT_mYkKcnnqY4ZdmYOC9+j0r1M5r6=w2Pjg@mail.gmail.com> (raw)
In-Reply-To: <1757978184.931224085@apps.rackspace.com>
On Mon, Sep 15, 2025, 4:16 PM David P. Reed <dpreed@deepplum.com> wrote:
>
>
> On Monday, September 15, 2025 18:26, "Frantisek Borsik" <
> frantisek.borsik@gmail.com> said:
>
> > Fresh from Tom's oven:
> >
> https://medium.com/@tom_84912/programming-a-parser-in-xdp2-is-as-easy-as-pie-8f26c8b3e704
>
> Nice enough. I've always wondered why no one did a table driven packet
> parser. Surely wireshark did this?
>
> However, the spaghetti mess in practice partially arises from skbuff
> (Linux) or mbuf (unix) structuring. I can testify to this because I've had
> to add a low level protocol on top of Ethernet in a version of the FreeBSD
> kernel, and that protocol had to (for performance reasons) modify headers
> "in place" to generate response packets without copying or reallocating.
>
> This is an additional reason why Linux's network stack is a mess - skbuff
> management is generally not compatible with packet formats in NICs with
> acceleration features that parse or verify parts of the packets.
>
> Admittedly, I've focused on minimizing end-to-end latency in protocols
> where the minimum speed is 10 Gb/sec Ethernet, where the hardware
> functionality (including pipelining the packet processing in the NIC
> controller, possibly with FPGAs, are essential.
>
> While the header parsing is a chunk of the performance issue, specialized
> memory management is also.
>
> Hoping to see what replaces skbuff in EDK2.
>
PVbufs :-)
https://medium.com/@tom_84912/pvbufs-hardware-friendly-metadata-structure-for-scatter-gather-6fce8d42f75f
> >
> >
> > All the best,
> >
> > Frank
> >
> > Frantisek (Frank) Borsik
> >
> >
> > *In loving memory of Dave Täht: *1965-2025
> >
> > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
> >
> >
> > https://www.linkedin.com/in/frantisekborsik
> >
> > Signal, Telegram, WhatsApp: +421919416714
> >
> > iMessage, mobile: +420775230885
> >
> > Skype: casioa5302ca
> >
> > frantisek.borsik@gmail.com
> >
> >
> > On Mon, Sep 15, 2025 at 8:35 PM Tom Herbert <tom@herbertland.com> wrote:
> >
> >> On Mon, Sep 15, 2025 at 11:07 AM Frantisek Borsik
> >> <frantisek.borsik@gmail.com> wrote:
> >> >
> >> >
> >> > "There were a few NIC's that offloaded eBPF but they never really went
> >> mainstream."
> >> >
> >> > And even then, they were doing only 40 Gbps, like
> https://netronome.com
> >> and didn't even supported full eBPF...
> >> >
> >> > They only support a pretty small subset of eBPF (in particular they
> >> don't support the LPM map type, which was our biggest performance pain
> >> point), and have a pretty cool user replaceable firmware system. They
> also
> >> don't have the higher speeds - above 40 Gbps - where the offloading
> would
> >> be most useful."
> >>
> >> Yeah, the attempts at offloading eBPF were doomed to fail. It's a
> >> restricted model, lacks parallelism, doesn't support inline
> >> accelerators, and requires the eBPF VM to make it no-staters. DPDK
> >> would fail as well. The kernel/host environment and hardware
> >> environments are quite different. If we try to force the hardware to
> >> look like the host to make eBPF or DPDK portable then we'll lose the
> >> performance advantages of running in the hardware. We need a model
> >> that allows the software to adapt to HW, not the other way around (of
> >> course, in a perfect world we'd do software/hardware codesign from the
> >> get-go).
> >>
> >> >
> >> > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT / 11:20
> >> CDT / 10:20 MDT / 9:20 PDT
> >>
> >> Can't wait!
> >>
> >> >
> >> > https://www.youtube.com/live/OBW5twvmHOI
> >> >
> >> >
> >> > All the best,
> >> >
> >> > Frank
> >> >
> >> > Frantisek (Frank) Borsik
> >> >
> >> >
> >> > In loving memory of Dave Täht: 1965-2025
> >> >
> >> > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
> >> >
> >> >
> >> > https://www.linkedin.com/in/frantisekborsik
> >> >
> >> > Signal, Telegram, WhatsApp: +421919416714
> >> >
> >> > iMessage, mobile: +420775230885
> >> >
> >> > Skype: casioa5302ca
> >> >
> >> > frantisek.borsik@gmail.com
> >> >
> >> >
> >> >
> >> > On Mon, Sep 15, 2025 at 5:16 PM Stephen Hemminger <
> >> stephen@networkplumber.org> wrote:
> >> >>
> >> >> On Mon, 15 Sep 2025 08:39:48 +0000
> >> >> BeckW--- via Bloat <bloat@lists.bufferbloat.net> wrote:
> >> >>
> >> >> > Programming networking hardware is a bit like programming 8 bit
> >> computers int the 1980s, the hardware is often too limited and varied to
> >> support useful abstractions. This is also true for CPU-based networking
> >> once you get into the >10 Gbps realm, when caching and pipelining
> >> architectures become relevant. Writing a network protocol compiler that
> >> produces efficient code for different NICs and different CPUs is a
> daunting
> >> task. And unlike with 8 bit computers, there are no simple metrics ('you
> >> need at least 32kb RAM to run this code' vs 'this NIC supports 4k queues
> >> with PIE, Codel', 'this CPU has 20 Mbyte of Intel SmartCache').
> >> >>
> >> >> Linux kernel still lacks an easy way to setup many features in Smart
> >> NIC's. DPDK has rte_flow which allows direct
> >> >> access to hardware flow processing. But DPDK lacks any reasonable
> form
> >> of shaper control.
> >> >>
> >> >> > Ebpf is very close to what was described in this 1995 exokernel
> >> paper(
> >> https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf).
> >> The idea of the exokernel was to have easily loadable, verified code in
> the
> >> kernel -- eg the security-critical task of assigning a packet to a
> session
> >> of a user -- and leave the rest of the protocol -- eg tcp
> retransmissions
> >> -- to the user space. AFAIK few people use ebpf like this, but it
> should be
> >> possible.
> >> >> >
> >> >> > Ebpf manages the abstraction part well, but sacrifices a lot of
> >> performance -- eg lack of aggressive batching like vpp / fd.io does.
> With
> >> DPDK, you often find out that your nic's hardware or driver doesn't
> >> support the function that you hoped to use and end up optimizing for a
> >> particular hardware. Even if driver and hardware support a
> functionality,
> >> it may very well be that hardware resources are too limited for your
> >> particular use case. The abstraction is there, but your code is still
> >> hardware specific.
> >> >>
> >> >> There were a few NIC's that offloaded eBPF but they never really went
> >> mainstream.
> >> >>
> >> >
> >> >
> >> >>
> >> >> > -----Ursprüngliche Nachricht-----
> >> >> > Von: David P. Reed <dpreed@deepplum.com>
> >> >> > Gesendet: Samstag, 13. September 2025 22:33
> >> >> > An: Tom Herbert <tom@herbertland.com>
> >> >> > Cc: Frantisek Borsik <frantisek.borsik@gmail.com>; Cake List <
> >> cake@lists.bufferbloat.net>; codel@lists.bufferbloat.net; bloat <
> >> bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <
> >> rpm@lists.bufferbloat.net>
> >> >> > Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only
> Tom
> >> Herbert (almost to the date, 10 years after XDP was released)
> >> >> >
> >> >> >
> >> >> > Tom -
> >> >> >
> >> >> > An architecture-independent network framework independent of the OS
> >> kernel's peculiarities seems within reach (though a fair bit of work),
> and
> >> I think it would be a GOOD THING indeed. IMHO the Linux networking
> stack in
> >> the kernel is a horrific mess, and it doesn't have to be.
> >> >> >
> >> >> > The reason it doesn't have to be is that there should be no reason
> it
> >> cannot run in ring3/userland, just like DPDK. And it should be built
> using
> >> "real-time" userland programming techniques. (avoiding the generic linux
> >> scheduler). The ONLY reason for involving the scheduler would be because
> >> there aren't enough cores. Linux was designed to be a uniprocessor Unix,
> >> and that just is no longer true at all. With hyperthreading, too, one
> need
> >> never abandon a processor's context in userspace to run some "userland"
> >> application.
> >> >> >
> >> >> > This would rip a huge amount of kernel code out of the kernel. (at
> >> least 50%, and probably more). THe security issues of all those 3rd
> party
> >> network drivers would go away.
> >> >> >
> >> >> > And the performance would be much higher for networking. (running
> in
> >> ring 3, especially if you don't do system calls, is no performance
> penalty,
> >> and interprocessor communications using shared memory is much lower
> latency
> >> than Linux IPC or mutexes).
> >> >> >
> >> >> > I like the idea of a compilation based network stack, at a slightly
> >> higher level than C. eBPF is NOT what I have in mind - it's an
> interpreter
> >> with high overhead. The language should support high-performance
> >> co-routining - shared memory, ideally. I don't thing GC is a good thing.
> >> Rust might be a good starting point because its memory management is
> safe.
> >> >> > To me, some of what the base of DPDK is like is good stuff.
> However,
> >> it isn't architecturally neutral.
> >> >> >
> >> >> > To me, the network stack should not be entangled with interrupt
> >> handling at all. "polling" is far more performant under load. The only
> use
> >> for interrupts is when the network stack is completely idle. That would
> be,
> >> in userland, a "wait for interrupt" call (not a poll). Ideally, on
> recent
> >> Intel machines, a userspace version of MONITOR/MWAIT).
> >> >> >
> >> >> > Now I know that Linus and his crew are really NOT gonna like this.
> >> Linus is still thinking like MINIX, a uniprocessor time-sharing system
> with
> >> rich OS functions in the kernel and doing "file" reads and writes to
> >> communicate with the kernel state. But it is a much more modern way to
> >> think of real-time IO in a modern operating system. (Windows and macOS
> are
> >> also Unix-like, uniprocessor monolithic kernel designs).
> >> >> >
> >> >> > So, if XDP2 got away from the Linux kernel, it could be great.
> >> >> > BTW, io_uring, etc. are half-measures. They address getting away
> from
> >> interrupts toward polling, but they still make the mistake of keeping
> huge
> >> drivers in the kernel.
> >> >>
> >> >> DPDK already supports use of XDP as a way to do userspace networking.
> >> >> It is good generic way to get packets in/out but the dedicated
> >> userspace drivers allow
> >> >> for more access to hardware. The XDP abstraction gets in the way of
> >> little things like programming
> >> >> VLAN's, etc.
> >> >>
> >> >> The tradeoff is userspace networking works great for infrastructure,
> >> routers, switches, firewalls etc;
> >> >> but userspace networking for network stacks to applications is hard
> to
> >> do, and loses the isolation
> >> >> that the kernel provides.
> >> >>
> >> >> > > I think it is interesting as a concept. A project I am advising
> >> has been > using DPDK very effectively to get rid of the huge path and
> >> locking delays > in the current Linux network stack. XDP2 could be
> >> supported in a ring3 > (user) address space, achieving a similar
> result.
> >> >> > HI David,
> >> >> > The idea is you could write the code in XDP2 and it would be
> compiled
> >> to DPDK or eBPF and the compiler would handle the optimizations.
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > But I don't think XDP2 is going that direction - so it may be
> >> stuckinto > the mess of kernel space networking. Adding eBPF only has
> made
> >> this more of > a mess, by the way (and adding a new "compiler" that
> needs
> >> to be veriried > as safe for the kernel).
> >> >> > Think of XDP2 as the generalization of XDP to go beyond just the
> >> kernel. The idea is that the user writes their datapath code once and
> they
> >> compile it to run in whatever targets they have-- DPDK, P4, other
> >> programmable hardware, and yes XDP/eBPF. It's really not limited to
> kernel
> >> networking.
> >> >> > As for the name XDP2, when we created XDP, eXpress DataPath, my
> >> vision was that it would be implementation agnostic. eBPF was the first
> >> instantiation for practicality, but now ten years later I think we can
> >> realize the initial vision.
> >> >> > Tom
> >> >>
> >> >>
> >> >> At this point, different network architectures get focused at
> different
> >> use cases.
> >> >> The days of the one-size-fits-all networking of BSD Unix is dead.
> >>
> >
>
>
>
next prev parent reply other threads:[~2025-09-16 0:05 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 10:32 [Bloat] " Frantisek Borsik
2025-09-09 20:25 ` [Bloat] Re: [Cake] " David P. Reed
2025-09-09 21:02 ` Frantisek Borsik
2025-09-09 21:36 ` [Bloat] Re: [Cake] " Tom Herbert
2025-09-10 8:54 ` BeckW
2025-09-10 13:59 ` Tom Herbert
2025-09-10 14:06 ` Tom Herbert
2025-09-13 20:33 ` David P. Reed
2025-09-13 20:58 ` Tom Herbert
2025-09-14 18:00 ` David P. Reed
2025-09-14 18:18 ` David Collier-Brown
2025-09-14 18:38 ` Tom Herbert
2025-09-15 8:39 ` BeckW
2025-09-15 15:16 ` Stephen Hemminger
2025-09-15 18:07 ` Frantisek Borsik
2025-09-15 18:35 ` Tom Herbert
2025-09-15 22:26 ` Frantisek Borsik
2025-09-15 23:16 ` David P. Reed
2025-09-16 0:05 ` Tom Herbert [this message]
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUT LOOK.COM>
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUTLOO K.COM>
2025-09-13 20:35 ` David P. Reed
2025-09-15 23:16 ` [Bloat] Re: [Rpm] Re: [Cake] " Robert McMahon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.bufferbloat.net/postorius/lists/bloat.lists.bufferbloat.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALx6S36YW0pCDCeHT_mYkKcnnqY4ZdmYOC9+j0r1M5r6=w2Pjg@mail.gmail.com' \
--to=tom@herbertland.com \
--cc=beckw@telekom.de \
--cc=bloat@lists.bufferbloat.net \
--cc=cake@lists.bufferbloat.net \
--cc=codel@lists.bufferbloat.net \
--cc=dpreed@deepplum.com \
--cc=frantisek.borsik@gmail.com \
--cc=rpm@lists.bufferbloat.net \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox