From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mail.toke.dk; spf=pass smtp.mailfrom=; dkim=pass header.d=networkplumber-org.20230601.gappssmtp.com; arc=none (Message is not ARC signed); dmarc=fail (Used From Domain Record) header.from=networkplumber.org policy.dmarc=quarantine Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) by mail.toke.dk (Postfix) with ESMTPS id C32976981C9 for ; Mon, 15 Sep 2025 17:16:45 +0200 (CEST) Received: by mail-qt1-x82b.google.com with SMTP id d75a77b69052e-4b7abed161bso7192561cf.3 for ; Mon, 15 Sep 2025 08:16:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1757949403; x=1758554203; darn=lists.bufferbloat.net; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=eu+5wpt32X8maOCkVLGsiejkUiDncyZGDly4jUU6XBc=; b=i5lNGZv6+q2sApou4pA7YRYYuUiOKkcjpqa2OheksHTvO3cdp/tridlwT09n7U8O54 8NrJm94efIy8sDsf26Jlfl8D/34tG9gl+jTGmN1jy0zuqLALe8HBeZP2rXA/DiRgDlls ZgwG/EsNMui8gINz6h/GHJqYPJ0lwXJ5o63cGQMLLrouUTgE/Dsswx9K5qxGWzD+OOjf D3nPBVHAnfM9lOncRKxsPtUNZNzaSoylczWDnN4tkz5hLewJ/B9XIpRgZIWh/WJdJJbt kKrdAHTN8OTjrP2O6o2gq9+4mP0aep0dWJLXaUH+hXKVgBk+HUakujs4GyinmDr/ae4x IunQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757949403; x=1758554203; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eu+5wpt32X8maOCkVLGsiejkUiDncyZGDly4jUU6XBc=; b=Ev8zhoH962bD9lVMiNDnsi8LyoRqedhR/irXoxzXltb4rb4k0bKWEXOZGALBw588jv NH5aqLmGetglgd3DpB3S+Iz4rhEsyW+dBy/orrkEN1YzO+Zb1nwjzGcxWexAsj1dJd3q 8udbeyIwmyQgh3xZ19rYRDqkevXHb6ZQVsOe1C5lo3oIprmq/P3YmNa35JtvPXeIrvze vwdR5/sbm5LUbYUf4PFPS9o9G/Jai3kXl5B3btXO6BF5aJ7K2WKJ6nhMO5UkQ0PNle3h saZvo8BCvoYy8u6EzWkXLqxmJmpdcmcrz+LRSjj5+zeV/FByEPKXdpJ6QuKQRIvGpSLV Uynw== X-Forwarded-Encrypted: i=1; AJvYcCXbFxLX8iumau7uAUPomftqrWXjk8c5y1unG+N1snaujrQMs24mBLxhVNy/P5XN8JwQG1bpvg==@lists.bufferbloat.net X-Gm-Message-State: AOJu0Yy7h1KsVX4Viw7CqwYA10AlbkZZHUEHUFnQeCQ6fREy9ZHZ6+2m wUbxTtvXjdbb+KRZT/DzTLfLU1C6sD+szHApjGMIdJIKKyp0Mtg+SojgFlrYjS6Jk0o= X-Gm-Gg: ASbGncsOSpWZaXBuew9J/ENw1mr1oc5Y7/45EKlRzoEiyWtMnndrxTshnQ+/fGF+oqJ huoa6NONO0VWw8dScbrTTxTANLg87SOQDHQsm+oUd28NSWMtmV1LA5DigLKf2SKKUftlsgnalGL sGYOA/G1z2GNM2C+yVaSR/x718Edvge3cXea66AXYj7xUoRNZmul8Cg4oK/KqXspG6HwN3KBJGY bq9ibzuu5vC5ggMAhQcTm0oeUzAWegHtR3IoMfI2S6OtCOAaRy+1EEKdR5jd20xn77UBbMB8zM6 yLi/DAihtcS4WSbgniJ9vuj1v4el1X+MeMMDY9VObF/PNzk6XIXRNmt9qHQDG+nD8CI4Q3Mb1Ik PY6Xg5miEzwFr/alub9mQLqRP+lDKD1l/pz4jnG3OEwpinmgwWY/4I5FoqSoftWnedafM3E+vaa jKW+SmJrQAdg== X-Google-Smtp-Source: AGHT+IHeni2Lrc2Vug2TjhvX+dd5f15UX5AXHD5Ykt9KFUoZ38E/lS9OKeC4I4a5YpuIzofShKYQPQ== X-Received: by 2002:a05:622a:64b:b0:4b2:f92f:7ee5 with SMTP id d75a77b69052e-4b77cf98ac3mr147794321cf.5.1757949402601; Mon, 15 Sep 2025 08:16:42 -0700 (PDT) Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-4b639deecf4sm70733931cf.49.2025.09.15.08.16.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Sep 2025 08:16:42 -0700 (PDT) Date: Mon, 15 Sep 2025 08:16:37 -0700 From: Stephen Hemminger To: BeckW--- via Bloat Cc: BeckW@telekom.de, , , , , , Message-ID: <20250915081637.2cd0d07c@hermes.local> In-Reply-To: References: <1757449551.421420786@apps.rackspace.com> <1757795591.523513612@apps.rackspace.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: XU7WUEIBI37OK47HHCS2HBBRNKSVKFDQ X-Message-ID-Hash: XU7WUEIBI37OK47HHCS2HBBRNKSVKFDQ X-MailFrom: stephen@networkplumber.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Subject: [Codel] Re: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herbert (almost to the date, 10 years after XDP was released) List-Id: CoDel AQM discussions Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 15 Sep 2025 08:39:48 +0000 BeckW--- via Bloat wrote: > Programming networking hardware is a bit like programming 8 bit computers= int the 1980s, the hardware is often too limited and varied to support use= ful abstractions. This is also true for CPU-based networking once you get i= nto the >10 Gbps realm, when caching and pipelining architectures become re= levant. Writing a network protocol compiler that produces efficient code fo= r different NICs and different CPUs is a daunting task. And unlike with 8 b= it computers, there are no simple metrics ('you need at least 32kb RAM to r= un this code' vs 'this NIC supports 4k queues with PIE, Codel', 'this CPU h= as 20 Mbyte of Intel SmartCache'). Linux kernel still lacks an easy way to setup many features in Smart NIC's.= DPDK has rte_flow which allows direct access to hardware flow processing. But DPDK lacks any reasonable form of s= haper control. > Ebpf is very close to what was described in this 1995 exokernel paper( ht= tps://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf). The id= ea of the exokernel was to have easily loadable, verified code in the kerne= l -- eg the security-critical task of assigning a packet to a session of a = user -- and leave the rest of the protocol -- eg tcp retransmissions -- to = the user space. AFAIK few people use ebpf like this, but it should be possi= ble. >=20 > Ebpf manages the abstraction part well, but sacrifices a lot of performan= ce -- eg lack of aggressive batching like vpp / fd.io does. With DPDK, you= often find out that your nic's hardware or driver doesn't support the func= tion that you hoped to use and end up optimizing for a particular hardware.= Even if driver and hardware support a functionality, it may very well be t= hat hardware resources are too limited for your particular use case. The ab= straction is there, but your code is still hardware specific. There were a few NIC's that offloaded eBPF but they never really went mains= tream. > -----Urspr=C3=BCngliche Nachricht----- > Von: David P. Reed > Gesendet: Samstag, 13. September 2025 22:33 > An: Tom Herbert > Cc: Frantisek Borsik ; Cake List ; codel@lists.bufferbloat.net; bloat ; Jeremy Austin via Rpm > Betreff: [Bloat] Re: [Cake] Re: XDP2 is here - from one and only Tom Herb= ert (almost to the date, 10 years after XDP was released) >=20 >=20 > Tom - >=20 > An architecture-independent network framework independent of the OS kerne= l's peculiarities seems within reach (though a fair bit of work), and I thi= nk it would be a GOOD THING indeed. IMHO the Linux networking stack in the = kernel is a horrific mess, and it doesn't have to be. >=20 > The reason it doesn't have to be is that there should be no reason it can= not run in ring3/userland, just like DPDK. And it should be built using "re= al-time" userland programming techniques. (avoiding the generic linux sched= uler). The ONLY reason for involving the scheduler would be because there a= ren't enough cores. Linux was designed to be a uniprocessor Unix, and that = just is no longer true at all. With hyperthreading, too, one need never aba= ndon a processor's context in userspace to run some "userland" application. >=20 > This would rip a huge amount of kernel code out of the kernel. (at least = 50%, and probably more). THe security issues of all those 3rd party network= drivers would go away. >=20 > And the performance would be much higher for networking. (running in rin= g 3, especially if you don't do system calls, is no performance penalty, an= d interprocessor communications using shared memory is much lower latency t= han Linux IPC or mutexes). >=20 > I like the idea of a compilation based network stack, at a slightly highe= r level than C. eBPF is NOT what I have in mind - it's an interpreter with = high overhead. The language should support high-performance co-routining - = shared memory, ideally. I don't thing GC is a good thing. Rust might be a g= ood starting point because its memory management is safe. > To me, some of what the base of DPDK is like is good stuff. However, it i= sn't architecturally neutral. >=20 > To me, the network stack should not be entangled with interrupt handling = at all. "polling" is far more performant under load. The only use for inter= rupts is when the network stack is completely idle. That would be, in userl= and, a "wait for interrupt" call (not a poll). Ideally, on recent Intel mac= hines, a userspace version of MONITOR/MWAIT). >=20 > Now I know that Linus and his crew are really NOT gonna like this. Linus = is still thinking like MINIX, a uniprocessor time-sharing system with rich = OS functions in the kernel and doing "file" reads and writes to communicate= with the kernel state. But it is a much more modern way to think of real-t= ime IO in a modern operating system. (Windows and macOS are also Unix-like,= uniprocessor monolithic kernel designs). >=20 > So, if XDP2 got away from the Linux kernel, it could be great. > BTW, io_uring, etc. are half-measures. They address getting away from int= errupts toward polling, but they still make the mistake of keeping huge dri= vers in the kernel. DPDK already supports use of XDP as a way to do userspace networking. It is good generic way to get packets in/out but the dedicated userspace dr= ivers allow for more access to hardware. The XDP abstraction gets in the way of little = things like programming VLAN's, etc. The tradeoff is userspace networking works great for infrastructure, router= s, switches, firewalls etc; but userspace networking for network stacks to applications is hard to do, = and loses the isolation that the kernel provides. > > I think it is interesting as a concept. A project I am advising has be= en > using DPDK very effectively to get rid of the huge path and locking d= elays > in the current Linux network stack. XDP2 could be supported in a r= ing3 > (user) address space, achieving a similar result. =20 > HI David, > The idea is you could write the code in XDP2 and it would be compiled to = DPDK or eBPF and the compiler would handle the optimizations. > > > > > > > > But I don't think XDP2 is going that direction - so it may be stuckint= o > the mess of kernel space networking. Adding eBPF only has made this mo= re of > a mess, by the way (and adding a new "compiler" that needs to be v= eriried > as safe for the kernel). =20 > Think of XDP2 as the generalization of XDP to go beyond just the kernel. = The idea is that the user writes their datapath code once and they compile = it to run in whatever targets they have-- DPDK, P4, other programmable hard= ware, and yes XDP/eBPF. It's really not limited to kernel networking. > As for the name XDP2, when we created XDP, eXpress DataPath, my vision wa= s that it would be implementation agnostic. eBPF was the first instantiatio= n for practicality, but now ten years later I think we can realize the init= ial vision. > Tom At this point, different network architectures get focused at different use= cases. The days of the one-size-fits-all networking of BSD Unix is dead.