From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mail.toke.dk; spf=pass smtp.mailfrom=; dkim=pass header.d=g001.emailsrvr.com; arc=none (Message is not ARC signed); dmarc=fail (Used From Domain Record) header.from=deepplum.com policy.dmarc=none Received: from smtp99.iad3a.emailsrvr.com (smtp99.iad3a.emailsrvr.com [173.203.187.99]) by mail.toke.dk (Postfix) with ESMTPS id E9C6B69A710 for ; Tue, 16 Sep 2025 01:16:26 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com; s=feedback; t=1757978185; bh=XRiMpOY/DxbhepqZBn3v4CuHKNVMLJ0ZcKn4ljMc0Ig=; h=Date:Subject:From:To:From; b=qZQE2CSp7OdlUrYb0tysPFVYc6IjY6Pvb/Ji5tUTBBCYuEec0WUW8P9aICtyXNMOk Z3gS0Qse6cLSNqdPKBxczgLuRnBmk6VslLqBCnwC72avypjJyibP/XgrT9njvGFNU1 MDG5vEo2vQ3vKjkJU8HMYonNu2ZZjAQ/+1n0PTr8= Received: from app42.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp29.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 11CE025540; Mon, 15 Sep 2025 19:16:25 -0400 (EDT) Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app42.wa-webapps.iad3a (Postfix) with ESMTP id E45CFE1BD4; Mon, 15 Sep 2025 19:16:24 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Mon, 15 Sep 2025 19:16:24 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Mon, 15 Sep 2025 19:16:24 -0400 (EDT) From: "David P. Reed" To: "Frantisek Borsik" Cc: "Tom Herbert" , stephen@networkplumber.org, "BeckW--- via Bloat" , beckw@telekom.de, cake@lists.bufferbloat.net, codel@lists.bufferbloat.net, rpm@lists.bufferbloat.net MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: References: <1757449551.421420786@apps.rackspace.com> <1757795591.523513612@apps.rackspace.com> <20250915081637.2cd0d07c@hermes.local> X-Client-IP: 209.6.168.128 Message-ID: <1757978184.931224085@apps.rackspace.com> X-Mailer: webmail/19.0.28-RC X-Classification-ID: 8ac2a7ba-b0eb-497b-b941-0dc6cca84ae4-1-1 Message-ID-Hash: KVTDMRGK76KDFGLUVXT2KVBXCEPST5DX X-Message-ID-Hash: KVTDMRGK76KDFGLUVXT2KVBXCEPST5DX X-MailFrom: dpreed@deepplum.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Subject: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released) List-Id: General list for discussing Bufferbloat Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: =0A=0AOn Monday, September 15, 2025 18:26, "Frantisek Borsik" said:=0A=0A> Fresh from Tom's oven:=0A> https://medium.com/= @tom_84912/programming-a-parser-in-xdp2-is-as-easy-as-pie-8f26c8b3e704=0A= =0ANice enough. I've always wondered why no one did a table driven packet p= arser. Surely wireshark did this?=0A=0AHowever, the spaghetti mess in pract= ice partially arises from skbuff (Linux) or mbuf (unix) structuring. I can= testify to this because I've had to add a low level protocol on top of Eth= ernet in a version of the FreeBSD kernel, and that protocol had to (for per= formance reasons) modify headers "in place" to generate response packets wi= thout copying or reallocating.=0A=0AThis is an additional reason why Linux'= s network stack is a mess - skbuff management is generally not compatible w= ith packet formats in NICs with acceleration features that parse or verify = parts of the packets. =0A=0AAdmittedly, I've focused on minimizing end-to-e= nd latency in protocols where the minimum speed is 10 Gb/sec Ethernet, wher= e the hardware functionality (including pipelining the packet processing in= the NIC controller, possibly with FPGAs, are essential.=0A=0AWhile the hea= der parsing is a chunk of the performance issue, specialized memory managem= ent is also.=0A=0AHoping to see what replaces skbuff in EDK2.=0A=0A> =0A> = =0A> All the best,=0A> =0A> Frank=0A> =0A> Frantisek (Frank) Borsik=0A> =0A= > =0A> *In loving memory of Dave T=C3=A4ht: *1965-2025=0A> =0A> https://lib= reqos.io/2025/04/01/in-loving-memory-of-dave/=0A> =0A> =0A> https://www.lin= kedin.com/in/frantisekborsik=0A> =0A> Signal, Telegram, WhatsApp: +42191941= 6714=0A> =0A> iMessage, mobile: +420775230885=0A> =0A> Skype: casioa5302ca= =0A> =0A> frantisek.borsik@gmail.com=0A> =0A> =0A> On Mon, Sep 15, 2025 at = 8:35=E2=80=AFPM Tom Herbert wrote:=0A> =0A>> On Mon, = Sep 15, 2025 at 11:07=E2=80=AFAM Frantisek Borsik=0A>> wrote:=0A>> >=0A>> >=0A>> > "There were a few NIC's that offloaded= eBPF but they never really went=0A>> mainstream."=0A>> >=0A>> > And even t= hen, they were doing only 40 Gbps, like https://netronome.com=0A>> and didn= 't even supported full eBPF...=0A>> >=0A>> > They only support a pretty sma= ll subset of eBPF (in particular they=0A>> don't support the LPM map type, = which was our biggest performance pain=0A>> point), and have a pretty cool = user replaceable firmware system. They also=0A>> don't have the higher spee= ds - above 40 Gbps - where the offloading would=0A>> be most useful."=0A>>= =0A>> Yeah, the attempts at offloading eBPF were doomed to fail. It's a=0A>= > restricted model, lacks parallelism, doesn't support inline=0A>> accelera= tors, and requires the eBPF VM to make it no-staters. DPDK=0A>> would fail = as well. The kernel/host environment and hardware=0A>> environments are qui= te different. If we try to force the hardware to=0A>> look like the host to= make eBPF or DPDK portable then we'll lose the=0A>> performance advantages= of running in the hardware. We need a model=0A>> that allows the software = to adapt to HW, not the other way around (of=0A>> course, in a perfect worl= d we'd do software/hardware codesign from the=0A>> get-go).=0A>>=0A>> >=0A>= > > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT / 11:20= =0A>> CDT / 10:20 MDT / 9:20 PDT=0A>>=0A>> Can't wait!=0A>>=0A>> >=0A>> > h= ttps://www.youtube.com/live/OBW5twvmHOI=0A>> >=0A>> >=0A>> > All the best,= =0A>> >=0A>> > Frank=0A>> >=0A>> > Frantisek (Frank) Borsik=0A>> >=0A>> >= =0A>> > In loving memory of Dave T=C3=A4ht: 1965-2025=0A>> >=0A>> > https:/= /libreqos.io/2025/04/01/in-loving-memory-of-dave/=0A>> >=0A>> >=0A>> > http= s://www.linkedin.com/in/frantisekborsik=0A>> >=0A>> > Signal, Telegram, Wha= tsApp: +421919416714=0A>> >=0A>> > iMessage, mobile: +420775230885=0A>> >= =0A>> > Skype: casioa5302ca=0A>> >=0A>> > frantisek.borsik@gmail.com=0A>> >= =0A>> >=0A>> >=0A>> > On Mon, Sep 15, 2025 at 5:16=E2=80=AFPM Stephen Hemmi= nger <=0A>> stephen@networkplumber.org> wrote:=0A>> >>=0A>> >> On Mon, 15 S= ep 2025 08:39:48 +0000=0A>> >> BeckW--- via Bloat wrote:=0A>> >>=0A>> >> > Programming networking hardware is a bit like= programming 8 bit=0A>> computers int the 1980s, the hardware is often too = limited and varied to=0A>> support useful abstractions. This is also true f= or CPU-based networking=0A>> once you get into the >10 Gbps realm, when cac= hing and pipelining=0A>> architectures become relevant. Writing a network p= rotocol compiler that=0A>> produces efficient code for different NICs and d= ifferent CPUs is a daunting=0A>> task. And unlike with 8 bit computers, the= re are no simple metrics ('you=0A>> need at least 32kb RAM to run this code= ' vs 'this NIC supports 4k queues=0A>> with PIE, Codel', 'this CPU has 20 M= byte of Intel SmartCache').=0A>> >>=0A>> >> Linux kernel still lacks an eas= y way to setup many features in Smart=0A>> NIC's. DPDK has rte_flow which a= llows direct=0A>> >> access to hardware flow processing. But DPDK lacks any= reasonable form=0A>> of shaper control.=0A>> >>=0A>> >> > Ebpf is very clo= se to what was described in this 1995 exokernel=0A>> paper(=0A>> https://pd= os.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf).=0A>> The idea = of the exokernel was to have easily loadable, verified code in the=0A>> ker= nel -- eg the security-critical task of assigning a packet to a session=0A>= > of a user -- and leave the rest of the protocol -- eg tcp retransmissions= =0A>> -- to the user space. AFAIK few people use ebpf like this, but it sho= uld be=0A>> possible.=0A>> >> >=0A>> >> > Ebpf manages the abstraction part= well, but sacrifices a lot of=0A>> performance -- eg lack of aggressive ba= tching like vpp / fd.io does. With=0A>> DPDK, you often find out that your= nic's hardware or driver doesn't=0A>> support the function that you hoped = to use and end up optimizing for a=0A>> particular hardware. Even if driver= and hardware support a functionality,=0A>> it may very well be that hardwa= re resources are too limited for your=0A>> particular use case. The abstrac= tion is there, but your code is still=0A>> hardware specific.=0A>> >>=0A>> = >> There were a few NIC's that offloaded eBPF but they never really went=0A= >> mainstream.=0A>> >>=0A>> >=0A>> >=0A>> >>=0A>> >> > -----Urspr=C3=BCngli= che Nachricht-----=0A>> >> > Von: David P. Reed =0A>> = >> > Gesendet: Samstag, 13. September 2025 22:33=0A>> >> > An: Tom Herbert = =0A>> >> > Cc: Frantisek Borsik ; Cake List <=0A>> cake@lists.bufferbloat.net>; codel@lists.bufferblo= at.net; bloat <=0A>> bloat@lists.bufferbloat.net>; Jeremy Austin via Rpm <= =0A>> rpm@lists.bufferbloat.net>=0A>> >> > Betreff: [Bloat] Re: [Cake] Re: = XDP2 is here - from one and only Tom=0A>> Herbert (almost to the date, 10 y= ears after XDP was released)=0A>> >> >=0A>> >> >=0A>> >> > Tom -=0A>> >> >= =0A>> >> > An architecture-independent network framework independent of the= OS=0A>> kernel's peculiarities seems within reach (though a fair bit of wo= rk), and=0A>> I think it would be a GOOD THING indeed. IMHO the Linux netwo= rking stack in=0A>> the kernel is a horrific mess, and it doesn't have to b= e.=0A>> >> >=0A>> >> > The reason it doesn't have to be is that there shoul= d be no reason it=0A>> cannot run in ring3/userland, just like DPDK. And it= should be built using=0A>> "real-time" userland programming techniques. (a= voiding the generic linux=0A>> scheduler). The ONLY reason for involving th= e scheduler would be because=0A>> there aren't enough cores. Linux was desi= gned to be a uniprocessor Unix,=0A>> and that just is no longer true at all= . With hyperthreading, too, one need=0A>> never abandon a processor's conte= xt in userspace to run some "userland"=0A>> application.=0A>> >> >=0A>> >> = > This would rip a huge amount of kernel code out of the kernel. (at=0A>> l= east 50%, and probably more). THe security issues of all those 3rd party=0A= >> network drivers would go away.=0A>> >> >=0A>> >> > And the performance w= ould be much higher for networking. (running in=0A>> ring 3, especially if= you don't do system calls, is no performance penalty,=0A>> and interproces= sor communications using shared memory is much lower latency=0A>> than Linu= x IPC or mutexes).=0A>> >> >=0A>> >> > I like the idea of a compilation bas= ed network stack, at a slightly=0A>> higher level than C. eBPF is NOT what = I have in mind - it's an interpreter=0A>> with high overhead. The language = should support high-performance=0A>> co-routining - shared memory, ideally.= I don't thing GC is a good thing.=0A>> Rust might be a good starting point= because its memory management is safe.=0A>> >> > To me, some of what the b= ase of DPDK is like is good stuff. However,=0A>> it isn't architecturally n= eutral.=0A>> >> >=0A>> >> > To me, the network stack should not be entangle= d with interrupt=0A>> handling at all. "polling" is far more performant und= er load. The only use=0A>> for interrupts is when the network stack is comp= letely idle. That would be,=0A>> in userland, a "wait for interrupt" call (= not a poll). Ideally, on recent=0A>> Intel machines, a userspace version of= MONITOR/MWAIT).=0A>> >> >=0A>> >> > Now I know that Linus and his crew are= really NOT gonna like this.=0A>> Linus is still thinking like MINIX, a uni= processor time-sharing system with=0A>> rich OS functions in the kernel and= doing "file" reads and writes to=0A>> communicate with the kernel state. B= ut it is a much more modern way to=0A>> think of real-time IO in a modern o= perating system. (Windows and macOS are=0A>> also Unix-like, uniprocessor m= onolithic kernel designs).=0A>> >> >=0A>> >> > So, if XDP2 got away from th= e Linux kernel, it could be great.=0A>> >> > BTW, io_uring, etc. are half-m= easures. They address getting away from=0A>> interrupts toward polling, but= they still make the mistake of keeping huge=0A>> drivers in the kernel.=0A= >> >>=0A>> >> DPDK already supports use of XDP as a way to do userspace net= working.=0A>> >> It is good generic way to get packets in/out but the dedic= ated=0A>> userspace drivers allow=0A>> >> for more access to hardware. The = XDP abstraction gets in the way of=0A>> little things like programming=0A>>= >> VLAN's, etc.=0A>> >>=0A>> >> The tradeoff is userspace networking works= great for infrastructure,=0A>> routers, switches, firewalls etc;=0A>> >> b= ut userspace networking for network stacks to applications is hard to=0A>> = do, and loses the isolation=0A>> >> that the kernel provides.=0A>> >>=0A>> = >> > > I think it is interesting as a concept. A project I am advising=0A>= > has been > using DPDK very effectively to get rid of the huge path and= =0A>> locking delays > in the current Linux network stack. XDP2 could be= =0A>> supported in a ring3 > (user) address space, achieving a similar res= ult.=0A>> >> > HI David,=0A>> >> > The idea is you could write the code in = XDP2 and it would be compiled=0A>> to DPDK or eBPF and the compiler would h= andle the optimizations.=0A>> >> > >=0A>> >> > >=0A>> >> > >=0A>> >> > = > But I don't think XDP2 is going that direction - so it may be=0A>> stucki= nto > the mess of kernel space networking. Adding eBPF only has made=0A>> = this more of > a mess, by the way (and adding a new "compiler" that needs= =0A>> to be veriried > as safe for the kernel).=0A>> >> > Think of XDP2 as= the generalization of XDP to go beyond just the=0A>> kernel. The idea is t= hat the user writes their datapath code once and they=0A>> compile it to ru= n in whatever targets they have-- DPDK, P4, other=0A>> programmable hardwar= e, and yes XDP/eBPF. It's really not limited to kernel=0A>> networking.=0A>= > >> > As for the name XDP2, when we created XDP, eXpress DataPath, my=0A>>= vision was that it would be implementation agnostic. eBPF was the first=0A= >> instantiation for practicality, but now ten years later I think we can= =0A>> realize the initial vision.=0A>> >> > Tom=0A>> >>=0A>> >>=0A>> >> At = this point, different network architectures get focused at different=0A>> u= se cases.=0A>> >> The days of the one-size-fits-all networking of BSD Unix = is dead.=0A>>=0A> =0A