After heavy testing and resolving a few issues under heavy load, cpumap-pping has tagged release 1.0.0 RC1. It should be ready for the v1.3 release of LibreQoS. What does it do? cpumap-pping merges two projects: - xdp-cpumap-tc provides some of the heavy lifting behind LibreQoS. It maps IP addresses to Linux tc classifiers/qdiscs. I recently added IPv6 and IPv4 subnet (e.g. match on 192.168.0.0/24), which is included in cpumap-pping. By mapping directly to filters, the cpumap is able to shift traffic shaping processing to individual CPUs - bypassing the performance limits of the default Linux traffic shaper. Because BPF programs run in kernel space (in a sandbox), it is able to sustain very high performance. - xdp-pping is an in-kernel BPF version of the excellent Polere Pping by Kathie Nichols. Previous versions of LibreQoS ran the original pping to gather TCP round-trip time data, providing accurate Quality of Experience (QoE) metric to help optimize your ISP and monitor the benefits of the Cake shaper. pping is a great tool, but tended to consume too much CPU time (on a single core) under heavy load. xdp-pping could sustain very high loads, and still provide accurate RTT information. Combining the two to run separately was troublesome, and duplicated a lot of work: both programs would individually parse Ethernet headers (cpumap also parses VLAN headers, pping did not), TCP headers, extract addresses, etc. For LibreQoS, it just made sense to combine them. cpumap-pping is a drop-in replacement (fully compatible) for xdp-cpumap-tc in LibreQoS. Once in place, instead of running pping and reading the results, you periodically run xdp_pping and retrieve the current snapshot of performance data - ready classified to match the queues that LibreQoS is already setting up. The results are handed out in a convenient JSON format: [ {"tc":"3:54", "avg": 35.39, "min": 15.95, "max": 60.03, "median": 33.67, "samples": 58}, ] Only a subset of TCP data is sampled. Rather than process every packet, the first 58 "ack" sequences are timed for each tc handle. This has two advantages: - It greatly lowers system resource usage (once the sample buffer is full the sampler doesn't do *any* additional work). - It ensures fairness in sampling; otherwise, there's a tendency to over-sample busy queues and possibly miss relatively quiet ones. Performance - Testing in Hyper-V (4 VMs, one shaper, one iperf server, two iperf clients) shows that individual flow throughput maxes out around 4.2 gbit/s per-core, with full Cake shaping. Without the pping component, performance is around 4.21 gbit/s per-core. In other words, it's very fast. - Testing in the real-world shows that on an under-powered 10 CPU VM (in Proxmox, on an older server) never exceeds 40% CPU usage on a single core while shaping approximately 1.8 gbit/s of traffic. This is with approximately 1,200 IP address entries in the cpu map, and 600 queues. - Profiling shows that the BPF programs (one on XDP ingress, one on TC egress) consume a maximum of 4,000 nanoseconds in the current release. That's adding a ceiling of 0.004 ms to customer ping times. Robert has been working hard on connecting this to the graphing system. You can see some great examples of progress in this - now closed - issue . For example: [image: image.png]