After heavy testing and resolving a few issues under heavy load, cpumap-pping has tagged release 1.0.0 RC1. It should be ready for the v1.3 release of LibreQoS.

What does it do?

cpumap-pping merges two projects:

xdp-cpumap-tc provides some of the heavy lifting behind LibreQoS. It maps IP addresses to Linux tc classifiers/qdiscs. I recently added IPv6 and IPv4 subnet (e.g. match on 192.168.0.0/24), which is included in cpumap-pping. By mapping directly to filters, the cpumap is able to shift traffic shaping processing to individual CPUs - bypassing the performance limits of the default Linux traffic shaper. Because BPF programs run in kernel space (in a sandbox), it is able to sustain very high performance.
xdp-pping is an in-kernel BPF version of the excellent Polere Pping by Kathie Nichols. Previous versions of LibreQoS ran the original pping to gather TCP round-trip time data, providing accurate Quality of Experience (QoE) metric to help optimize your ISP and monitor the benefits of the Cake shaper. pping is a great tool, but tended to consume too much CPU time (on a single core) under heavy load. xdp-pping could sustain very high loads, and still provide accurate RTT information.

Combining the two to run separately was troublesome, and duplicated a lot of work: both programs would individually parse Ethernet headers (cpumap also parses VLAN headers, pping did not), TCP headers, extract addresses, etc. For LibreQoS, it just made sense to combine them.

cpumap-pping is a drop-in replacement (fully compatible) for xdp-cpumap-tc in LibreQoS. Once in place, instead of running pping and reading the results, you periodically run xdp_pping and retrieve the current snapshot of performance data - ready classified to match the queues that LibreQoS is already setting up. The results are handed out in a convenient JSON format:

[
{"tc":"3:54", "avg": 35.39, "min": 15.95, "max": 60.03, "median": 33.67, "samples": 58},
]

Only a subset of TCP data is sampled. Rather than process every packet, the first 58 "ack" sequences are timed for each tc handle. This has two advantages:

It greatly lowers system resource usage (once the sample buffer is full the sampler doesn't do any additional work).
It ensures fairness in sampling; otherwise, there's a tendency to over-sample busy queues and possibly miss relatively quiet ones.

Performance

Testing in Hyper-V (4 VMs, one shaper, one iperf server, two iperf clients) shows that individual flow throughput maxes out around 4.2 gbit/s per-core, with full Cake shaping. Without the pping component, performance is around 4.21 gbit/s per-core. In other words, it's very fast.
Testing in the real-world shows that on an under-powered 10 CPU VM (in Proxmox, on an older server) never exceeds 40% CPU usage on a single core while shaping approximately 1.8 gbit/s of traffic. This is with approximately 1,200 IP address entries in the cpu map, and 600 queues.
Profiling shows that the BPF programs (one on XDP ingress, one on TC egress) consume a maximum of 4,000 nanoseconds in the current release. That's adding a ceiling of 0.004 ms to customer ping times.

Robert has been working hard on connecting this to the graphing system. You can see some great examples of progress in this - now closed - issue. For example: