After heavy testing and resolving a few issues under heavy load,
cpumap-pping <https://github.com/thebracket/cpumap-pping> has tagged
release 1.0.0 RC1. It should be ready for the v1.3 release of LibreQoS.
What does it do?

cpumap-pping merges two projects:

   -

   xdp-cpumap-tc <https://github.com/xdp-project/xdp-cpumap-tc> provides
   some of the heavy lifting behind LibreQoS. It maps IP addresses to Linux
   tc classifiers/qdiscs. I recently added IPv6 and IPv4 subnet (e.g. match
   on 192.168.0.0/24), which is included in cpumap-pping. By mapping
   directly to filters, the cpumap is able to shift traffic shaping
   processing to individual CPUs - bypassing the performance limits of the
   default Linux traffic shaper. Because BPF programs run in kernel space (in
   a sandbox), it is able to sustain very high performance.
   -

   xdp-pping <https://github.com/xdp-project/bpf-examples/tree/master/pping>
   is an in-kernel BPF version of the excellent Polere Pping
   <https://github.com/pollere/pping> by Kathie Nichols. Previous versions
   of LibreQoS ran the original pping to gather TCP round-trip time data,
   providing accurate Quality of Experience (QoE) metric to help optimize your
   ISP and monitor the benefits of the Cake shaper. pping is a great tool,
   but tended to consume too much CPU time (on a single core) under heavy
   load. xdp-pping could sustain very high loads, and still provide
   accurate RTT information.

Combining the two to run separately was troublesome, and duplicated a lot
of work: both programs would individually parse Ethernet headers (cpumap
also parses VLAN headers, pping did not), TCP headers, extract addresses,
etc. For LibreQoS, it just made sense to combine them.

cpumap-pping is a drop-in replacement (fully compatible) for xdp-cpumap-tc
in LibreQoS. Once in place, instead of running pping and reading the
results, you periodically run xdp_pping and retrieve the current snapshot
of performance data - ready classified to match the queues that LibreQoS is
already setting up. The results are handed out in a convenient JSON format:

[
{"tc":"3:54", "avg": 35.39, "min": 15.95, "max": 60.03, "median":
33.67, "samples": 58},
]

Only a subset of TCP data is sampled. Rather than process every packet, the
first 58 "ack" sequences are timed for each tc handle. This has two
advantages:

   - It greatly lowers system resource usage (once the sample buffer is
   full the sampler doesn't do *any* additional work).
   - It ensures fairness in sampling; otherwise, there's a tendency to
   over-sample busy queues and possibly miss relatively quiet ones.

Performance

   - Testing in Hyper-V (4 VMs, one shaper, one iperf server, two iperf
   clients) shows that individual flow throughput maxes out around 4.2 gbit/s
   per-core, with full Cake shaping. Without the pping component,
   performance is around 4.21 gbit/s per-core. In other words, it's very fast.
   - Testing in the real-world shows that on an under-powered 10 CPU VM (in
   Proxmox, on an older server) never exceeds 40% CPU usage on a single core
   while shaping approximately 1.8 gbit/s of traffic. This is with
   approximately 1,200 IP address entries in the cpu map, and 600 queues.
   - Profiling shows that the BPF programs (one on XDP ingress, one on TC
   egress) consume a maximum of 4,000 nanoseconds in the current release.
   That's adding a ceiling of 0.004 ms to customer ping times.

Robert has been working hard on connecting this to the graphing system. You
can see some great examples of progress in this - now closed - issue
<https://github.com/thebracket/cpumap-pping/issues/2>. For example:
[image: image.png]