From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mail.toke.dk; spf=pass smtp.mailfrom=; dkim=pass header.d=herbertland.com; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=herbertland.com policy.dmarc=reject Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by mail.toke.dk (Postfix) with ESMTPS id 958F66991D5 for ; Mon, 15 Sep 2025 20:35:36 +0200 (CEST) Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-5679dbcf9d8so3999413e87.0 for ; Mon, 15 Sep 2025 11:35:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=herbertland.com; s=google; t=1757961335; x=1758566135; darn=lists.bufferbloat.net; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NUWQl1xZXyt2Wykl+wi3cAAZmU0oWaXG3MpWQ2NX31U=; b=Tsj091s51SpeqdfLeqsK4P224yq3ONCcDHnO71etb+GB5anxV+trkGLyS/j0hH7v0M QhCOChAN7PUNDtrITchIhCdz9j/zXvnzuL0pvv8DO8L8EFWLkpyf17bLdZfns4DMqzdo DRN+8HKW7Rr8/HUsKgKmoxDtZbFg44CAvYzgJqueXs8UglNuJjDpWFVxB0Dczka34/SY VzFQ8dg/TJsxVSTIpFKxM2EKJRGbXRpvqnK3EBNIRchU/oQyyzQphNS62pY+YmytBDMr hK9ZkADX/RQ/Q4YHXP6GxmWo1WPc5BqPoQB2N+AtIwmta3HEIkRnsAPfDLnjxVDfBpf2 9/Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757961335; x=1758566135; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NUWQl1xZXyt2Wykl+wi3cAAZmU0oWaXG3MpWQ2NX31U=; b=bl68yCYjb7f4aWPpI0zSSzUR91v+73+IFezGb6R66p1JYKYFNBL4ISj70DGpVTD9zh k1VuDwyUEjJ0P5gd322xSOPhpRq1gmDCvcMWfjdgoqsMihxhbHWpkCKWo2Rut9SecKa/ 1nVxwFG4CtqK5SFNX2N6hMaANFCmmEDt/MeXXS379DmvovEAuBh05EcEsH9iJrnNog4N pIwva/fAgyA/+ZD8U/r7LhseZ6+FkPxWv4I0ssumiz716IvcBgylBwz/x1EAmIvWADwx yElYioqYP2VF2W04i6SHxm1jKZFWdO7m3iHVjHcq61sLd8fRangFRoFVlPAfkBifzmcT 3Mgw== X-Forwarded-Encrypted: i=1; AJvYcCWrBfS7+hu++yXHaU5CzOcbZ9StP50WjVgJGm7be9qgtUzAR4iecth3tI9X/iMbf6ZnM+S17g==@lists.bufferbloat.net X-Gm-Message-State: AOJu0Yx3uCoH6i0nYg4gBExqlJvarpmi9Dkke9x+xGsjjB9ZimWWOfuS stStoGKmfUtm3Aj4Ey6AkqNdULEDZ31D6HDHy5NSPtM1KHllRFLnPfKxzcCeGzgyQWdrMmZlquy DLtX9hEOc5c/CCuYyemWXft9ZJ8DXLTjLpsOqggNc X-Gm-Gg: ASbGncvFnbhizWWqwuHG7aaOJY1aBkdB6cTUbbkQ9w2RBuWWCNCpBW5YLbm7ZIhUT9g UR/UBBFQs7hcbFwAEwhso9uYwb3x+TRkrWqmr1+ngP6lPQvfhaDWKgSHik8UaREo81bHchjZQR/ 0X9Yx7KE5fwLMqtzTEbp5WvTtIlsNkiIuMtiHx3MOo2wljWfadCjShJtq9ECWJFpaibDdY01b3w XC+1wfduxm0VxEsdbhDolHub8KTajV3D43diIHH X-Google-Smtp-Source: AGHT+IFmngSGuX/sCmdb1D+VnSz4olZDd/3qKfwMTnh4k7kKI657oPelAMQbc+LS0I9y/usLqdSbn/bOFPevrhKuf0g= X-Received: by 2002:a05:6512:1307:b0:55f:542f:1f62 with SMTP id 2adb3069b0e04-5704bec628bmr4171492e87.19.1757961334877; Mon, 15 Sep 2025 11:35:34 -0700 (PDT) MIME-Version: 1.0 References: <1757449551.421420786@apps.rackspace.com> <1757795591.523513612@apps.rackspace.com> <20250915081637.2cd0d07c@hermes.local> In-Reply-To: From: Tom Herbert Date: Mon, 15 Sep 2025 11:35:23 -0700 X-Gm-Features: AS18NWCJ55iyjmqxH3ij3stnRtcol_zsPKzWRyQaXKaDD-ACS_o80zUiCJ8oMlY Message-ID: To: Frantisek Borsik Cc: stephen@networkplumber.org, BeckW--- via Bloat , BeckW@telekom.de, dpreed@deepplum.com, cake@lists.bufferbloat.net, codel@lists.bufferbloat.net, rpm@lists.bufferbloat.net Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: tom@herbertland.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation Message-ID-Hash: YB4NXB7HGE3XRQU2HZLZVWNX22C7BV2Z X-Message-ID-Hash: YB4NXB7HGE3XRQU2HZLZVWNX22C7BV2Z X-Mailman-Approved-At: Tue, 16 Sep 2025 10:38:03 +0200 X-Mailman-Version: 3.3.10 Precedence: list Subject: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released) List-Id: General list for discussing Bufferbloat Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, Sep 15, 2025 at 11:07=E2=80=AFAM Frantisek Borsik wrote: > > > "There were a few NIC's that offloaded eBPF but they never really went ma= instream." > > And even then, they were doing only 40 Gbps, like https://netronome.com a= nd didn't even supported full eBPF... > > They only support a pretty small subset of eBPF (in particular they don't= support the LPM map type, which was our biggest performance pain point), a= nd have a pretty cool user replaceable firmware system. They also don't hav= e the higher speeds - above 40 Gbps - where the offloading would be most us= eful." Yeah, the attempts at offloading eBPF were doomed to fail. It's a restricted model, lacks parallelism, doesn't support inline accelerators, and requires the eBPF VM to make it no-staters. DPDK would fail as well. The kernel/host environment and hardware environments are quite different. If we try to force the hardware to look like the host to make eBPF or DPDK portable then we'll lose the performance advantages of running in the hardware. We need a model that allows the software to adapt to HW, not the other way around (of course, in a perfect world we'd do software/hardware codesign from the get-go). > > Btw, Tom will be at FLOSS Weekly tomorrow (Tuesday), 12:20 EDT / 11:20 CD= T / 10:20 MDT / 9:20 PDT Can't wait! > > https://www.youtube.com/live/OBW5twvmHOI > > > All the best, > > Frank > > Frantisek (Frank) Borsik > > > In loving memory of Dave T=C3=A4ht: 1965-2025 > > https://libreqos.io/2025/04/01/in-loving-memory-of-dave/ > > > https://www.linkedin.com/in/frantisekborsik > > Signal, Telegram, WhatsApp: +421919416714 > > iMessage, mobile: +420775230885 > > Skype: casioa5302ca > > frantisek.borsik@gmail.com > > > > On Mon, Sep 15, 2025 at 5:16=E2=80=AFPM Stephen Hemminger wrote: >> >> On Mon, 15 Sep 2025 08:39:48 +0000 >> BeckW--- via Bloat wrote: >> >> > Programming networking hardware is a bit like programming 8 bit comput= ers int the 1980s, the hardware is often too limited and varied to support = useful abstractions. This is also true for CPU-based networking once you ge= t into the >10 Gbps realm, when caching and pipelining architectures become= relevant. Writing a network protocol compiler that produces efficient code= for different NICs and different CPUs is a daunting task. And unlike with = 8 bit computers, there are no simple metrics ('you need at least 32kb RAM t= o run this code' vs 'this NIC supports 4k queues with PIE, Codel', 'this CP= U has 20 Mbyte of Intel SmartCache'). >> >> Linux kernel still lacks an easy way to setup many features in Smart NIC= 's. DPDK has rte_flow which allows direct >> access to hardware flow processing. But DPDK lacks any reasonable form o= f shaper control. >> >> > Ebpf is very close to what was described in this 1995 exokernel paper(= https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf). The= idea of the exokernel was to have easily loadable, verified code in the ke= rnel -- eg the security-critical task of assigning a packet to a session of= a user -- and leave the rest of the protocol -- eg tcp retransmissions -- = to the user space. AFAIK few people use ebpf like this, but it should be po= ssible. >> > >> > Ebpf manages the abstraction part well, but sacrifices a lot of perfor= mance -- eg lack of aggressive batching like vpp / fd.io does. With DPDK, = you often find out that your nic's hardware or driver doesn't support the f= unction that you hoped to use and end up optimizing for a particular hardwa= re. Even if driver and hardware support a functionality, it may very well b= e that hardware resources are too limited for your particular use case. The= abstraction is there, but your code is still hardware specific. >> >> There were a few NIC's that offloaded eBPF but they never really went ma= instream. >> > > >> >> > -----Urspr=C3=BCngliche Nachricht----- >> > Von: David P. Reed >> > Gesendet: Samstag, 13. September 2025 22:33 >> > An: Tom Herbert >> > Cc: Frantisek Borsik ; Cake List ; codel@lists.bufferbloat.net; bloat ; Jeremy Austin via Rpm >> > Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom H= erbert (almost to the date, 10 years after XDP was released) >> > >> > >> > Tom - >> > >> > An architecture-independent network framework independent of the OS ke= rnel's peculiarities seems within reach (though a fair bit of work), and I = think it would be a GOOD THING indeed. IMHO the Linux networking stack in t= he kernel is a horrific mess, and it doesn't have to be. >> > >> > The reason it doesn't have to be is that there should be no reason it = cannot run in ring3/userland, just like DPDK. And it should be built using = "real-time" userland programming techniques. (avoiding the generic linux sc= heduler). The ONLY reason for involving the scheduler would be because ther= e aren't enough cores. Linux was designed to be a uniprocessor Unix, and th= at just is no longer true at all. With hyperthreading, too, one need never = abandon a processor's context in userspace to run some "userland" applicati= on. >> > >> > This would rip a huge amount of kernel code out of the kernel. (at lea= st 50%, and probably more). THe security issues of all those 3rd party netw= ork drivers would go away. >> > >> > And the performance would be much higher for networking. (running in = ring 3, especially if you don't do system calls, is no performance penalty,= and interprocessor communications using shared memory is much lower latenc= y than Linux IPC or mutexes). >> > >> > I like the idea of a compilation based network stack, at a slightly hi= gher level than C. eBPF is NOT what I have in mind - it's an interpreter wi= th high overhead. The language should support high-performance co-routining= - shared memory, ideally. I don't thing GC is a good thing. Rust might be = a good starting point because its memory management is safe. >> > To me, some of what the base of DPDK is like is good stuff. However, i= t isn't architecturally neutral. >> > >> > To me, the network stack should not be entangled with interrupt handli= ng at all. "polling" is far more performant under load. The only use for in= terrupts is when the network stack is completely idle. That would be, in us= erland, a "wait for interrupt" call (not a poll). Ideally, on recent Intel = machines, a userspace version of MONITOR/MWAIT). >> > >> > Now I know that Linus and his crew are really NOT gonna like this. Lin= us is still thinking like MINIX, a uniprocessor time-sharing system with ri= ch OS functions in the kernel and doing "file" reads and writes to communic= ate with the kernel state. But it is a much more modern way to think of rea= l-time IO in a modern operating system. (Windows and macOS are also Unix-li= ke, uniprocessor monolithic kernel designs). >> > >> > So, if XDP2 got away from the Linux kernel, it could be great. >> > BTW, io_uring, etc. are half-measures. They address getting away from = interrupts toward polling, but they still make the mistake of keeping huge = drivers in the kernel. >> >> DPDK already supports use of XDP as a way to do userspace networking. >> It is good generic way to get packets in/out but the dedicated userspace= drivers allow >> for more access to hardware. The XDP abstraction gets in the way of litt= le things like programming >> VLAN's, etc. >> >> The tradeoff is userspace networking works great for infrastructure, rou= ters, switches, firewalls etc; >> but userspace networking for network stacks to applications is hard to d= o, and loses the isolation >> that the kernel provides. >> >> > > I think it is interesting as a concept. A project I am advising has= been > using DPDK very effectively to get rid of the huge path and lockin= g delays > in the current Linux network stack. XDP2 could be supported in = a ring3 > (user) address space, achieving a similar result. >> > HI David, >> > The idea is you could write the code in XDP2 and it would be compiled = to DPDK or eBPF and the compiler would handle the optimizations. >> > > >> > > >> > > >> > > But I don't think XDP2 is going that direction - so it may be stuck= into > the mess of kernel space networking. Adding eBPF only has made this= more of > a mess, by the way (and adding a new "compiler" that needs to b= e veriried > as safe for the kernel). >> > Think of XDP2 as the generalization of XDP to go beyond just the kerne= l. The idea is that the user writes their datapath code once and they compi= le it to run in whatever targets they have-- DPDK, P4, other programmable h= ardware, and yes XDP/eBPF. It's really not limited to kernel networking. >> > As for the name XDP2, when we created XDP, eXpress DataPath, my vision= was that it would be implementation agnostic. eBPF was the first instantia= tion for practicality, but now ten years later I think we can realize the i= nitial vision. >> > Tom >> >> >> At this point, different network architectures get focused at different = use cases. >> The days of the one-size-fits-all networking of BSD Unix is dead.