From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: mail.toke.dk; spf=pass smtp.mailfrom=; dkim=pass header.d=g001.emailsrvr.com; arc=none (Message is not ARC signed); dmarc=fail (Used From Domain Record) header.from=deepplum.com policy.dmarc=none
Received: from smtp99.iad3a.emailsrvr.com (smtp99.iad3a.emailsrvr.com [173.203.187.99])
	by mail.toke.dk (Postfix) with ESMTPS id E9C6B69A710
	for <bloat@lists.bufferbloat.net>; Tue, 16 Sep 2025 01:16:26 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com;
	s=feedback; t=1757978185;
	bh=XRiMpOY/DxbhepqZBn3v4CuHKNVMLJ0ZcKn4ljMc0Ig=;
	h=Date:Subject:From:To:From;
	b=qZQE2CSp7OdlUrYb0tysPFVYc6IjY6Pvb/Ji5tUTBBCYuEec0WUW8P9aICtyXNMOk
	 Z3gS0Qse6cLSNqdPKBxczgLuRnBmk6VslLqBCnwC72avypjJyibP/XgrT9njvGFNU1
	 MDG5vEo2vQ3vKjkJU8HMYonNu2ZZjAQ/+1n0PTr8=
Received: from app42.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140])
	by smtp29.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 11CE025540;
	Mon, 15 Sep 2025 19:16:25 -0400 (EDT)
Received: from deepplum.com (localhost.localdomain [127.0.0.1])
	by app42.wa-webapps.iad3a (Postfix) with ESMTP id E45CFE1BD4;
	Mon, 15 Sep 2025 19:16:24 -0400 (EDT)
Received: by apps.rackspace.com
    (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com)
    with HTTP; Mon, 15 Sep 2025 19:16:24 -0400 (EDT)
X-Auth-ID: dpreed@deepplum.com
Date: Mon, 15 Sep 2025 19:16:24 -0400 (EDT)
From: "David P. Reed" <dpreed@deepplum.com>
To: "Frantisek Borsik" <frantisek.borsik@gmail.com>
Cc: "Tom Herbert" <tom@herbertland.com>,
 stephen@networkplumber.org,
 "BeckW--- via Bloat" <bloat@lists.bufferbloat.net>,
 beckw@telekom.de,
 cake@lists.bufferbloat.net,
 codel@lists.bufferbloat.net,
 rpm@lists.bufferbloat.net
MIME-Version: 1.0
Content-Type: text/plain;charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Importance: Normal
X-Priority: 3 (Normal)
X-Type: plain
In-Reply-To: <CAJUtOOh81gJBcrA8C=C9AHQhuRRDy5KaoLptKZ9_y4+SxW2T_w@mail.gmail.com>
References: <CAJUtOOhdVJCiRqhOTmdiPNmZEq9_dA+k=evLWX1UsB=ySHpf_A@mail.gmail.com>
 <1757449551.421420786@apps.rackspace.com>
 <CAJUtOOhnp92Vxv4TkF_GUwVB1kaZcN9z+hZwS+0JBdHdM0cJjw@mail.gmail.com>
 <CALx6S37wDSc=d27twPBm_NHFSWYQUL74oMSyBVwY=z9uNEvJ_g@mail.gmail.com>
 <1757795591.523513612@apps.rackspace.com>
 <FR2PPFEFD18174C4925B861C3070972199CDC15A@FR2PPFEFD18174C.DEUP281.PROD.OUT
 LOOK.COM> <20250915081637.2cd0d07c@hermes.local>
 <CAJUtOOjbzFEgouvbzNX3owzT3gp_PFonXWQvTvgAAFC1V09P7A@mail.gmail.com>
 <CALx6S35SqVqnPf+AgvkSW4f+3Kv4kh7HqzATMJRBP7rgLUZOXw@mail.gmail.com>
 <CAJUtOOh81gJBcrA8C=C9AHQhuRRDy5KaoLptKZ9_y4+SxW2T_w@mail.gmail.com>
X-Client-IP: 209.6.168.128
Message-ID: <1757978184.931224085@apps.rackspace.com>
X-Mailer: webmail/19.0.28-RC
X-Classification-ID: 8ac2a7ba-b0eb-497b-b941-0dc6cca84ae4-1-1
Message-ID-Hash: KVTDMRGK76KDFGLUVXT2KVBXCEPST5DX
X-Message-ID-Hash: KVTDMRGK76KDFGLUVXT2KVBXCEPST5DX
X-MailFrom: dpreed@deepplum.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
Subject: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released)
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
Archived-At: <https://lists.bufferbloat.net/bloat/1757978184.931224085@apps.rackspace.com/>
List-Archive: <https://lists.bufferbloat.net/bloat/>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Owner: <mailto:bloat-owner@lists.bufferbloat.net>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Subscribe: <mailto:bloat-join@lists.bufferbloat.net>
List-Unsubscribe: <mailto:bloat-leave@lists.bufferbloat.net>

=0A=0AOn Monday, September 15, 2025 18:26, "Frantisek Borsik" <frantisek.bo=
rsik@gmail.com> said:=0A=0A> Fresh from Tom's oven:=0A> https://medium.com/=
@tom_84912/programming-a-parser-in-xdp2-is-as-easy-as-pie-8f26c8b3e704=0A=
=0ANice enough. I've always wondered why no one did a table driven packet p=
arser. Surely wireshark did this?=0A=0AHowever, the spaghetti mess in pract=
ice partially arises from skbuff (Linux) or mbuf (unix) structuring.  I can=
 testify to this because I've had to add a low level protocol on top of Eth=
ernet in a version of the FreeBSD kernel, and that protocol had to (for per=
formance reasons) modify headers "in place" to generate response packets wi=
thout copying or reallocating.=0A=0AThis is an additional reason why Linux'=
s network stack is a mess - skbuff management is generally not compatible w=
ith packet formats in NICs with acceleration features that parse or verify =
parts of the packets. =0A=0AAdmittedly, I've focused on minimizing end-to-e=
nd latency in protocols where the minimum speed is 10 Gb/sec Ethernet, wher=
e the hardware functionality (including pipelining the packet processing in=
 the NIC controller, possibly with FPGAs, are essential.=0A=0AWhile the hea=
der parsing is a chunk of the performance issue, specialized memory managem=
ent is also.=0A=0AHoping to see what replaces skbuff in EDK2.=0A=0A> =0A> =
=0A> All the best,=0A> =0A> Frank=0A> =0A> Frantisek (Frank) Borsik=0A> =0A=
> =0A> *In loving memory of Dave T=C3=A4ht: *1965-2025=0A> =0A> https://lib=
reqos.io/2025/04/01/in-loving-memory-of-dave/=0A> =0A> =0A> https://www.lin=
kedin.com/in/frantisekborsik=0A> =0A> Signal, Telegram, WhatsApp: +42191941=
6714=0A> =0A> iMessage, mobile: +420775230885=0A> =0A> Skype: casioa5302ca=
=0A> =0A> frantisek.borsik@gmail.com=0A> =0A> =0A> On Mon, Sep 15, 2025 at =
8:35=E2=80=AFPM Tom Herbert <tom@herbertland.com> wrote:=0A> =0A>> On Mon, =
Sep 15, 2025 at 11:07=E2=80=AFAM Frantisek Borsik=0A>> <frantisek.borsik@gm=
ail.com> wrote:=0A>> >=0A>> >=0A>> > "There were a few NIC's that offloaded=
 eBPF but they never really went=0A>> mainstream."=0A>> >=0A>> > And even t=
hen, they were doing only 40 Gbps, like https://netronome.com=0A>> and didn=
't even supported full eBPF...=0A>> >=0A>> > They only support a pretty sma=
ll subset of eBPF (in particular they=0A>> don't support the LPM map type, =
which was our biggest performance pain=0A>> point), and have a pretty cool =
user replaceable firmware system. They also=0A>> don't have the higher spee=
ds - above 40 Gbps - where the offloading would=0A>> be most useful."=0A>>=
=0A>> Yeah, the attempts at offloading eBPF were doomed to fail. It's a=0A>=
> restricted model, lacks parallelism, doesn't support inline=0A>> accelera=
tors, and requires the eBPF VM to make it no-staters. DPDK=0A>> would fail =
as well. The kernel/host environment and hardware=0A>> environments are qui=
te different. If we try to force the hardware to=0A>> look like the host to=
 make eBPF or DPDK portable then we'll lose the=0A>> performance advantages=
 of running in the hardware. We need a model=0A>> that allows the software =
to adapt to HW, not the other way around (of=0A>> course, in a perfect worl=
d we'd do software/hardware codesign from the=0A>> get-go).=0A>>=0A>> >=0A>=
> > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT / 11:20=
=0A>> CDT / 10:20 MDT / 9:20 PDT=0A>>=0A>> Can't wait!=0A>>=0A>> >=0A>> > h=
ttps://www.youtube.com/live/OBW5twvmHOI=0A>> >=0A>> >=0A>> > All the best,=
=0A>> >=0A>> > Frank=0A>> >=0A>> > Frantisek (Frank) Borsik=0A>> >=0A>> >=
=0A>> > In loving memory of Dave T=C3=A4ht: 1965-2025=0A>> >=0A>> > https:/=
/libreqos.io/2025/04/01/in-loving-memory-of-dave/=0A>> >=0A>> >=0A>> > http=
s://www.linkedin.com/in/frantisekborsik=0A>> >=0A>> > Signal, Telegram, Wha=
tsApp: +421919416714=0A>> >=0A>> > iMessage, mobile: +420775230885=0A>> >=
=0A>> > Skype: casioa5302ca=0A>> >=0A>> > frantisek.borsik@gmail.com=0A>> >=
=0A>> >=0A>> >=0A>> > On Mon, Sep 15, 2025 at 5:16=E2=80=AFPM Stephen Hemmi=
nger <=0A>> stephen@networkplumber.org> wrote:=0A>> >>=0A>> >> On Mon, 15 S=
ep 2025 08:39:48 +0000=0A>> >> BeckW--- via Bloat <bloat@lists.bufferbloat.=
net> wrote:=0A>> >>=0A>> >> > Programming networking hardware is a bit like=
 programming 8 bit=0A>> computers int the 1980s, the hardware is often too =
limited and varied to=0A>> support useful abstractions. This is also true f=
or CPU-based networking=0A>> once you get into the >10 Gbps realm, when cac=
hing and pipelining=0A>> architectures become relevant. Writing a network p=
rotocol compiler that=0A>> produces efficient code for different NICs and d=
ifferent CPUs is a daunting=0A>> task. And unlike with 8 bit computers, the=
re are no simple metrics ('you=0A>> need at least 32kb RAM to run this code=
' vs 'this NIC supports 4k queues=0A>> with PIE, Codel', 'this CPU has 20 M=
byte of Intel SmartCache').=0A>> >>=0A>> >> Linux kernel still lacks an eas=
y way to setup many features in Smart=0A>> NIC's. DPDK has rte_flow which a=
llows direct=0A>> >> access to hardware flow processing. But DPDK lacks any=
 reasonable form=0A>> of shaper control.=0A>> >>=0A>> >> > Ebpf is very clo=
se to what was described in this 1995 exokernel=0A>> paper(=0A>> https://pd=
os.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf).=0A>> The idea =
of the exokernel was to have easily loadable, verified code in the=0A>> ker=
nel -- eg the security-critical task of assigning a packet to a session=0A>=
> of a user -- and leave the rest of the protocol -- eg tcp retransmissions=
=0A>> -- to the user space. AFAIK few people use ebpf like this, but it sho=
uld be=0A>> possible.=0A>> >> >=0A>> >> > Ebpf manages the abstraction part=
 well, but sacrifices a lot of=0A>> performance -- eg lack of aggressive ba=
tching like vpp / fd.io does. With=0A>> DPDK,  you often find out that your=
 nic's hardware or driver doesn't=0A>> support the function that you hoped =
to use and end up optimizing for a=0A>> particular hardware. Even if driver=
 and hardware support a functionality,=0A>> it may very well be that hardwa=
re resources are too limited for your=0A>> particular use case. The abstrac=
tion is there, but your code is still=0A>> hardware specific.=0A>> >>=0A>> =
>> There were a few NIC's that offloaded eBPF but they never really went=0A=
>> mainstream.=0A>> >>=0A>> >=0A>> >=0A>> >>=0A>> >> > -----Urspr=C3=BCngli=
che Nachricht-----=0A>> >> > Von: David P. Reed <dpreed@deepplum.com>=0A>> =
>> > Gesendet: Samstag, 13. September 2025 22:33=0A>> >> > An: Tom Herbert =
<tom@herbertland.com>=0A>> >> > Cc: Frantisek Borsik <frantisek.borsik@gmai=
l.com>; Cake List <=0A>> cake@lists.bufferbloat.net>; codel@lists.bufferblo=
at.net; bloat <=0A>> bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <=
=0A>> rpm@lists.bufferbloat.net>=0A>> >> > Betreff: [Bloat] Re: [Cake] Re: =
XDP2 is here - from one and only Tom=0A>> Herbert (almost to the date, 10 y=
ears after XDP was released)=0A>> >> >=0A>> >> >=0A>> >> > Tom -=0A>> >> >=
=0A>> >> > An architecture-independent network framework independent of the=
 OS=0A>> kernel's peculiarities seems within reach (though a fair bit of wo=
rk), and=0A>> I think it would be a GOOD THING indeed. IMHO the Linux netwo=
rking stack in=0A>> the kernel is a horrific mess, and it doesn't have to b=
e.=0A>> >> >=0A>> >> > The reason it doesn't have to be is that there shoul=
d be no reason it=0A>> cannot run in ring3/userland, just like DPDK. And it=
 should be built using=0A>> "real-time" userland programming techniques. (a=
voiding the generic linux=0A>> scheduler). The ONLY reason for involving th=
e scheduler would be because=0A>> there aren't enough cores. Linux was desi=
gned to be a uniprocessor Unix,=0A>> and that just is no longer true at all=
. With hyperthreading, too, one need=0A>> never abandon a processor's conte=
xt in userspace to run some "userland"=0A>> application.=0A>> >> >=0A>> >> =
> This would rip a huge amount of kernel code out of the kernel. (at=0A>> l=
east 50%, and probably more). THe security issues of all those 3rd party=0A=
>> network drivers would go away.=0A>> >> >=0A>> >> > And the performance w=
ould be much higher for networking.  (running in=0A>> ring 3, especially if=
 you don't do system calls, is no performance penalty,=0A>> and interproces=
sor communications using shared memory is much lower latency=0A>> than Linu=
x IPC or mutexes).=0A>> >> >=0A>> >> > I like the idea of a compilation bas=
ed network stack, at a slightly=0A>> higher level than C. eBPF is NOT what =
I have in mind - it's an interpreter=0A>> with high overhead. The language =
should support high-performance=0A>> co-routining - shared memory, ideally.=
 I don't thing GC is a good thing.=0A>> Rust might be a good starting point=
 because its memory management is safe.=0A>> >> > To me, some of what the b=
ase of DPDK is like is good stuff. However,=0A>> it isn't architecturally n=
eutral.=0A>> >> >=0A>> >> > To me, the network stack should not be entangle=
d with interrupt=0A>> handling at all. "polling" is far more performant und=
er load. The only use=0A>> for interrupts is when the network stack is comp=
letely idle. That would be,=0A>> in userland, a "wait for interrupt" call (=
not a poll). Ideally, on recent=0A>> Intel machines, a userspace version of=
 MONITOR/MWAIT).=0A>> >> >=0A>> >> > Now I know that Linus and his crew are=
 really NOT gonna like this.=0A>> Linus is still thinking like MINIX, a uni=
processor time-sharing system with=0A>> rich OS functions in the kernel and=
 doing "file" reads and writes to=0A>> communicate with the kernel state. B=
ut it is a much more modern way to=0A>> think of real-time IO in a modern o=
perating system. (Windows and macOS are=0A>> also Unix-like, uniprocessor m=
onolithic kernel designs).=0A>> >> >=0A>> >> > So, if XDP2 got away from th=
e Linux kernel, it could be great.=0A>> >> > BTW, io_uring, etc. are half-m=
easures. They address getting away from=0A>> interrupts toward polling, but=
 they still make the mistake of keeping huge=0A>> drivers in the kernel.=0A=
>> >>=0A>> >> DPDK already supports use of XDP as a way to do userspace net=
working.=0A>> >> It is good generic way to get packets in/out but the dedic=
ated=0A>> userspace drivers allow=0A>> >> for more access to hardware. The =
XDP abstraction gets in the way of=0A>> little things like programming=0A>>=
 >> VLAN's, etc.=0A>> >>=0A>> >> The tradeoff is userspace networking works=
 great for infrastructure,=0A>> routers, switches, firewalls etc;=0A>> >> b=
ut userspace networking for network stacks to applications is hard to=0A>> =
do, and loses the isolation=0A>> >> that the kernel provides.=0A>> >>=0A>> =
>> >  > I think it is interesting as a concept. A project I am advising=0A>=
> has been  > using DPDK very effectively to get rid of the huge path and=
=0A>> locking delays  > in the current Linux network stack. XDP2 could be=
=0A>> supported in a ring3  > (user) address space, achieving a similar res=
ult.=0A>> >> > HI David,=0A>> >> > The idea is you could write the code in =
XDP2 and it would be compiled=0A>> to DPDK or eBPF and the compiler would h=
andle the optimizations.=0A>> >> >  >=0A>> >> >  >=0A>> >> >  >=0A>> >> >  =
> But I don't think XDP2 is going that direction - so it may be=0A>> stucki=
nto  > the mess of kernel space networking. Adding eBPF only has made=0A>> =
this more of  > a mess, by the way (and adding a new "compiler" that needs=
=0A>> to be veriried  > as safe for the kernel).=0A>> >> > Think of XDP2 as=
 the generalization of XDP to go beyond just the=0A>> kernel. The idea is t=
hat the user writes their datapath code once and they=0A>> compile it to ru=
n in whatever targets they have-- DPDK, P4, other=0A>> programmable hardwar=
e, and yes XDP/eBPF. It's really not limited to kernel=0A>> networking.=0A>=
> >> > As for the name XDP2, when we created XDP, eXpress DataPath, my=0A>>=
 vision was that it would be implementation agnostic. eBPF was the first=0A=
>> instantiation for practicality, but now ten years later I think we can=
=0A>> realize the initial vision.=0A>> >> > Tom=0A>> >>=0A>> >>=0A>> >> At =
this point, different network architectures get focused at different=0A>> u=
se cases.=0A>> >> The days of the one-size-fits-all networking of BSD Unix =
is dead.=0A>>=0A> =0A