After heavy testing and resolving a few issues under heavy load, cpumap-pping has tagged release 1.0.0 RC1
. It should be ready for the v1.3 release of LibreQoS.
cpumap-pping
merges two projects:
xdp-cpumap-tc provides some of the heavy lifting behind LibreQoS. It maps IP addresses to Linux tc
classifiers/qdiscs. I recently added IPv6 and IPv4 subnet (e.g. match on 192.168.0.0/24), which is included in cpumap-pping
. By mapping directly to filters, the cpumap
is able to shift traffic shaping processing to individual CPUs -
bypassing the performance limits of the default Linux traffic shaper.
Because BPF programs run in kernel space (in a sandbox), it is able to
sustain very high performance.
xdp-pping is an in-kernel BPF version of the excellent Polere Pping by Kathie Nichols. Previous versions of LibreQoS ran the original pping
to gather TCP round-trip time data, providing accurate Quality of
Experience (QoE) metric to help optimize your ISP and monitor the
benefits of the Cake shaper. pping
is a great tool, but tended to consume too much CPU time (on a single core) under heavy load. xdp-pping
could sustain very high loads, and still provide accurate RTT information.
Combining the two to run separately was troublesome, and
duplicated a lot of work: both programs would individually parse
Ethernet headers (cpumap
also parses VLAN headers, pping
did not), TCP headers, extract addresses, etc. For LibreQoS, it just made sense to combine them.
cpumap-pping
is a drop-in replacement (fully compatible) for xdp-cpumap-tc
in LibreQoS. Once in place, instead of running pping
and reading the results, you periodically run xdp_pping
and retrieve the current snapshot of performance data - ready
classified to match the queues that LibreQoS is already setting up. The
results are handed out in a convenient JSON format:
[
{"tc":"3:54", "avg": 35.39, "min": 15.95, "max": 60.03, "median": 33.67, "samples": 58},
]
Only a subset of TCP data is sampled. Rather than process every packet, the first 58 "ack" sequences are timed for each tc handle. This has two advantages:
pping
component, performance is around 4.21 gbit/s per-core. In other words, it's very fast.0.004 ms
to customer ping times.