From: Frantisek Borsik <frantisek.borsik@gmail.com>
To: Tom Herbert <tom@herbertland.com>
Cc: "David P. Reed" <dpreed@deepplum.com>,
Stephen Hemminger <stephen@networkplumber.org>,
BeckW--- via Bloat <bloat@lists.bufferbloat.net>,
beckw@telekom.de, Cake List <cake@lists.bufferbloat.net>,
codel@lists.bufferbloat.net,
Jeremy Austin via Rpm <rpm@lists.bufferbloat.net>
Subject: [Cake] Re: [Bloat] Re: Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
Date: Wed, 17 Sep 2025 21:29:29 +0200 [thread overview]
Message-ID: <CAJUtOOguw4-0+_Lv0htK6LTtjqTOyFneJb9ORQ+w3Q1rgL-N1w@mail.gmail.com> (raw)
In-Reply-To: <CALx6S36YW0pCDCeHT_mYkKcnnqY4ZdmYOC9+j0r1M5r6=w2Pjg@mail.gmail.com>
And it's up:
https://hackaday.com/2025/09/17/floss-weekly-episode-847-this-is-networking/
- watch, listen.
Let's continue the discussion...
All the best,
Frank
Frantisek (Frank) Borsik
*In loving memory of Dave Täht: *1965-2025
https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
https://www.linkedin.com/in/frantisekborsik
Signal, Telegram, WhatsApp: +421919416714
iMessage, mobile: +420775230885
Skype: casioa5302ca
frantisek.borsik@gmail.com
On Tue, Sep 16, 2025 at 2:05 AM Tom Herbert <tom@herbertland.com> wrote:
>
>
> On Mon, Sep 15, 2025, 4:16 PM David P. Reed <dpreed@deepplum.com> wrote:
>
>>
>>
>> On Monday, September 15, 2025 18:26, "Frantisek Borsik" <
>> frantisek.borsik@gmail.com> said:
>>
>> > Fresh from Tom's oven:
>> >
>> https://medium.com/@tom_84912/programming-a-parser-in-xdp2-is-as-easy-as-pie-8f26c8b3e704
>>
>> Nice enough. I've always wondered why no one did a table driven packet
>> parser. Surely wireshark did this?
>>
>> However, the spaghetti mess in practice partially arises from skbuff
>> (Linux) or mbuf (unix) structuring. I can testify to this because I've had
>> to add a low level protocol on top of Ethernet in a version of the FreeBSD
>> kernel, and that protocol had to (for performance reasons) modify headers
>> "in place" to generate response packets without copying or reallocating.
>>
>> This is an additional reason why Linux's network stack is a mess - skbuff
>> management is generally not compatible with packet formats in NICs with
>> acceleration features that parse or verify parts of the packets.
>>
>> Admittedly, I've focused on minimizing end-to-end latency in protocols
>> where the minimum speed is 10 Gb/sec Ethernet, where the hardware
>> functionality (including pipelining the packet processing in the NIC
>> controller, possibly with FPGAs, are essential.
>>
>> While the header parsing is a chunk of the performance issue, specialized
>> memory management is also.
>>
>> Hoping to see what replaces skbuff in EDK2.
>>
>
> PVbufs :-)
>
>
> https://medium.com/@tom_84912/pvbufs-hardware-friendly-metadata-structure-for-scatter-gather-6fce8d42f75f
>
>
>> >
>> >
>> > All the best,
>> >
>> > Frank
>> >
>> > Frantisek (Frank) Borsik
>> >
>> >
>> > *In loving memory of Dave Täht: *1965-2025
>> >
>> > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
>> >
>> >
>> > https://www.linkedin.com/in/frantisekborsik
>> >
>> > Signal, Telegram, WhatsApp: +421919416714
>> >
>> > iMessage, mobile: +420775230885
>> >
>> > Skype: casioa5302ca
>> >
>> > frantisek.borsik@gmail.com
>> >
>> >
>> > On Mon, Sep 15, 2025 at 8:35 PM Tom Herbert <tom@herbertland.com>
>> wrote:
>> >
>> >> On Mon, Sep 15, 2025 at 11:07 AM Frantisek Borsik
>> >> <frantisek.borsik@gmail.com> wrote:
>> >> >
>> >> >
>> >> > "There were a few NIC's that offloaded eBPF but they never really
>> went
>> >> mainstream."
>> >> >
>> >> > And even then, they were doing only 40 Gbps, like
>> https://netronome.com
>> >> and didn't even supported full eBPF...
>> >> >
>> >> > They only support a pretty small subset of eBPF (in particular they
>> >> don't support the LPM map type, which was our biggest performance pain
>> >> point), and have a pretty cool user replaceable firmware system. They
>> also
>> >> don't have the higher speeds - above 40 Gbps - where the offloading
>> would
>> >> be most useful."
>> >>
>> >> Yeah, the attempts at offloading eBPF were doomed to fail. It's a
>> >> restricted model, lacks parallelism, doesn't support inline
>> >> accelerators, and requires the eBPF VM to make it no-staters. DPDK
>> >> would fail as well. The kernel/host environment and hardware
>> >> environments are quite different. If we try to force the hardware to
>> >> look like the host to make eBPF or DPDK portable then we'll lose the
>> >> performance advantages of running in the hardware. We need a model
>> >> that allows the software to adapt to HW, not the other way around (of
>> >> course, in a perfect world we'd do software/hardware codesign from the
>> >> get-go).
>> >>
>> >> >
>> >> > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT /
>> 11:20
>> >> CDT / 10:20 MDT / 9:20 PDT
>> >>
>> >> Can't wait!
>> >>
>> >> >
>> >> > https://www.youtube.com/live/OBW5twvmHOI
>> >> >
>> >> >
>> >> > All the best,
>> >> >
>> >> > Frank
>> >> >
>> >> > Frantisek (Frank) Borsik
>> >> >
>> >> >
>> >> > In loving memory of Dave Täht: 1965-2025
>> >> >
>> >> > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/
>> >> >
>> >> >
>> >> > https://www.linkedin.com/in/frantisekborsik
>> >> >
>> >> > Signal, Telegram, WhatsApp: +421919416714
>> >> >
>> >> > iMessage, mobile: +420775230885
>> >> >
>> >> > Skype: casioa5302ca
>> >> >
>> >> > frantisek.borsik@gmail.com
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Sep 15, 2025 at 5:16 PM Stephen Hemminger <
>> >> stephen@networkplumber.org> wrote:
>> >> >>
>> >> >> On Mon, 15 Sep 2025 08:39:48 +0000
>> >> >> BeckW--- via Bloat <bloat@lists.bufferbloat.net> wrote:
>> >> >>
>> >> >> > Programming networking hardware is a bit like programming 8 bit
>> >> computers int the 1980s, the hardware is often too limited and varied
>> to
>> >> support useful abstractions. This is also true for CPU-based networking
>> >> once you get into the >10 Gbps realm, when caching and pipelining
>> >> architectures become relevant. Writing a network protocol compiler that
>> >> produces efficient code for different NICs and different CPUs is a
>> daunting
>> >> task. And unlike with 8 bit computers, there are no simple metrics
>> ('you
>> >> need at least 32kb RAM to run this code' vs 'this NIC supports 4k
>> queues
>> >> with PIE, Codel', 'this CPU has 20 Mbyte of Intel SmartCache').
>> >> >>
>> >> >> Linux kernel still lacks an easy way to setup many features in Smart
>> >> NIC's. DPDK has rte_flow which allows direct
>> >> >> access to hardware flow processing. But DPDK lacks any reasonable
>> form
>> >> of shaper control.
>> >> >>
>> >> >> > Ebpf is very close to what was described in this 1995 exokernel
>> >> paper(
>> >> https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf).
>> >> The idea of the exokernel was to have easily loadable, verified code
>> in the
>> >> kernel -- eg the security-critical task of assigning a packet to a
>> session
>> >> of a user -- and leave the rest of the protocol -- eg tcp
>> retransmissions
>> >> -- to the user space. AFAIK few people use ebpf like this, but it
>> should be
>> >> possible.
>> >> >> >
>> >> >> > Ebpf manages the abstraction part well, but sacrifices a lot of
>> >> performance -- eg lack of aggressive batching like vpp / fd.io does.
>> With
>> >> DPDK, you often find out that your nic's hardware or driver doesn't
>> >> support the function that you hoped to use and end up optimizing for a
>> >> particular hardware. Even if driver and hardware support a
>> functionality,
>> >> it may very well be that hardware resources are too limited for your
>> >> particular use case. The abstraction is there, but your code is still
>> >> hardware specific.
>> >> >>
>> >> >> There were a few NIC's that offloaded eBPF but they never really
>> went
>> >> mainstream.
>> >> >>
>> >> >
>> >> >
>> >> >>
>> >> >> > -----Ursprüngliche Nachricht-----
>> >> >> > Von: David P. Reed <dpreed@deepplum.com>
>> >> >> > Gesendet: Samstag, 13. September 2025 22:33
>> >> >> > An: Tom Herbert <tom@herbertland.com>
>> >> >> > Cc: Frantisek Borsik <frantisek.borsik@gmail.com>; Cake List <
>> >> cake@lists.bufferbloat.net>; codel@lists.bufferbloat.net; bloat <
>> >> bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <
>> >> rpm@lists.bufferbloat.net>
>> >> >> > Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only
>> Tom
>> >> Herbert (almost to the date, 10 years after XDP was released)
>> >> >> >
>> >> >> >
>> >> >> > Tom -
>> >> >> >
>> >> >> > An architecture-independent network framework independent of the
>> OS
>> >> kernel's peculiarities seems within reach (though a fair bit of work),
>> and
>> >> I think it would be a GOOD THING indeed. IMHO the Linux networking
>> stack in
>> >> the kernel is a horrific mess, and it doesn't have to be.
>> >> >> >
>> >> >> > The reason it doesn't have to be is that there should be no
>> reason it
>> >> cannot run in ring3/userland, just like DPDK. And it should be built
>> using
>> >> "real-time" userland programming techniques. (avoiding the generic
>> linux
>> >> scheduler). The ONLY reason for involving the scheduler would be
>> because
>> >> there aren't enough cores. Linux was designed to be a uniprocessor
>> Unix,
>> >> and that just is no longer true at all. With hyperthreading, too, one
>> need
>> >> never abandon a processor's context in userspace to run some "userland"
>> >> application.
>> >> >> >
>> >> >> > This would rip a huge amount of kernel code out of the kernel. (at
>> >> least 50%, and probably more). THe security issues of all those 3rd
>> party
>> >> network drivers would go away.
>> >> >> >
>> >> >> > And the performance would be much higher for networking.
>> (running in
>> >> ring 3, especially if you don't do system calls, is no performance
>> penalty,
>> >> and interprocessor communications using shared memory is much lower
>> latency
>> >> than Linux IPC or mutexes).
>> >> >> >
>> >> >> > I like the idea of a compilation based network stack, at a
>> slightly
>> >> higher level than C. eBPF is NOT what I have in mind - it's an
>> interpreter
>> >> with high overhead. The language should support high-performance
>> >> co-routining - shared memory, ideally. I don't thing GC is a good
>> thing.
>> >> Rust might be a good starting point because its memory management is
>> safe.
>> >> >> > To me, some of what the base of DPDK is like is good stuff.
>> However,
>> >> it isn't architecturally neutral.
>> >> >> >
>> >> >> > To me, the network stack should not be entangled with interrupt
>> >> handling at all. "polling" is far more performant under load. The only
>> use
>> >> for interrupts is when the network stack is completely idle. That
>> would be,
>> >> in userland, a "wait for interrupt" call (not a poll). Ideally, on
>> recent
>> >> Intel machines, a userspace version of MONITOR/MWAIT).
>> >> >> >
>> >> >> > Now I know that Linus and his crew are really NOT gonna like this.
>> >> Linus is still thinking like MINIX, a uniprocessor time-sharing system
>> with
>> >> rich OS functions in the kernel and doing "file" reads and writes to
>> >> communicate with the kernel state. But it is a much more modern way to
>> >> think of real-time IO in a modern operating system. (Windows and macOS
>> are
>> >> also Unix-like, uniprocessor monolithic kernel designs).
>> >> >> >
>> >> >> > So, if XDP2 got away from the Linux kernel, it could be great.
>> >> >> > BTW, io_uring, etc. are half-measures. They address getting away
>> from
>> >> interrupts toward polling, but they still make the mistake of keeping
>> huge
>> >> drivers in the kernel.
>> >> >>
>> >> >> DPDK already supports use of XDP as a way to do userspace
>> networking.
>> >> >> It is good generic way to get packets in/out but the dedicated
>> >> userspace drivers allow
>> >> >> for more access to hardware. The XDP abstraction gets in the way of
>> >> little things like programming
>> >> >> VLAN's, etc.
>> >> >>
>> >> >> The tradeoff is userspace networking works great for infrastructure,
>> >> routers, switches, firewalls etc;
>> >> >> but userspace networking for network stacks to applications is hard
>> to
>> >> do, and loses the isolation
>> >> >> that the kernel provides.
>> >> >>
>> >> >> > > I think it is interesting as a concept. A project I am advising
>> >> has been > using DPDK very effectively to get rid of the huge path and
>> >> locking delays > in the current Linux network stack. XDP2 could be
>> >> supported in a ring3 > (user) address space, achieving a similar
>> result.
>> >> >> > HI David,
>> >> >> > The idea is you could write the code in XDP2 and it would be
>> compiled
>> >> to DPDK or eBPF and the compiler would handle the optimizations.
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > But I don't think XDP2 is going that direction - so it may be
>> >> stuckinto > the mess of kernel space networking. Adding eBPF only has
>> made
>> >> this more of > a mess, by the way (and adding a new "compiler" that
>> needs
>> >> to be veriried > as safe for the kernel).
>> >> >> > Think of XDP2 as the generalization of XDP to go beyond just the
>> >> kernel. The idea is that the user writes their datapath code once and
>> they
>> >> compile it to run in whatever targets they have-- DPDK, P4, other
>> >> programmable hardware, and yes XDP/eBPF. It's really not limited to
>> kernel
>> >> networking.
>> >> >> > As for the name XDP2, when we created XDP, eXpress DataPath, my
>> >> vision was that it would be implementation agnostic. eBPF was the first
>> >> instantiation for practicality, but now ten years later I think we can
>> >> realize the initial vision.
>> >> >> > Tom
>> >> >>
>> >> >>
>> >> >> At this point, different network architectures get focused at
>> different
>> >> use cases.
>> >> >> The days of the one-size-fits-all networking of BSD Unix is dead.
>> >>
>> >
>>
>>
>>
next prev parent reply other threads:[~2025-09-17 19:28 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 10:32 [Cake] " Frantisek Borsik
2025-09-09 20:25 ` [Cake] " David P. Reed
2025-09-09 21:02 ` Frantisek Borsik
2025-09-09 21:36 ` Tom Herbert
2025-09-10 8:54 ` [Cake] Re: [Bloat] " BeckW
2025-09-10 13:59 ` Tom Herbert
2025-09-10 14:06 ` Tom Herbert
2025-09-13 20:33 ` [Cake] " David P. Reed
2025-09-13 22:57 ` Tom Herbert
2025-09-14 18:00 ` David P. Reed
2025-09-14 18:38 ` Tom Herbert
2025-09-15 8:39 ` [Cake] Re: [Bloat] " BeckW
2025-09-15 15:16 ` Stephen Hemminger
2025-09-15 18:07 ` Frantisek Borsik
2025-09-15 18:35 ` Tom Herbert
2025-09-15 22:26 ` Frantisek Borsik
2025-09-15 23:16 ` David P. Reed
2025-09-16 0:05 ` Tom Herbert
2025-09-17 19:29 ` Frantisek Borsik [this message]
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUT LOOK.COM>
[not found] ` <FR2PPFEFD18174CA00474D0DC8DBDA3EE00DC0EA@FR2PPFEFD18174C.DEUP281.PROD.OUTLOO K.COM>
2025-09-13 20:35 ` David P. Reed
2025-09-15 23:16 ` [Cake] Re: [Rpm] " Robert McMahon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.bufferbloat.net/postorius/lists/cake.lists.bufferbloat.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJUtOOguw4-0+_Lv0htK6LTtjqTOyFneJb9ORQ+w3Q1rgL-N1w@mail.gmail.com \
--to=frantisek.borsik@gmail.com \
--cc=beckw@telekom.de \
--cc=bloat@lists.bufferbloat.net \
--cc=cake@lists.bufferbloat.net \
--cc=codel@lists.bufferbloat.net \
--cc=dpreed@deepplum.com \
--cc=rpm@lists.bufferbloat.net \
--cc=stephen@networkplumber.org \
--cc=tom@herbertland.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox