I'd probably go with an HTB queue per target IP group, and not attach a
discipline to it - with only a ceiling set at the top. That'll do truly minimal
shaping, and you can still use cpumap-pping to get the data you want.
(The current branch I'm testing/working on also reports the local IP
address, which I'm finding pretty helpful). Otherwise, you're going to
make building both tools part of the setup process* and still have
to parse IP pairs for results. Hopefully, there's a decent Python
LPM Trie out there (to handle subnets and IPv6) to make that
easier.
I'm (obviously!) going to respectfully disagree with Toke on this one.
I didn't dive into cpumap-pping for fun; I tried *really hard* to work
with the original epping/xdp-pping. It's a great tool, really fantastic work.
It's also not really designed for the same purpose.
The original Polere pping is wonderful, but isn't going to scale - the
way it ingests packets isn't going to scale across multiple CPUs,
and having a core pegging 100% on a busy shaper box was
degrading overall performance. epping solves the scalability
issue wonderfully, and (rightly) remains focused on giving you
a complete report of all of the data is accessed while it was
running. If you want to run a monitoring session and see what's
going on, it's a *fantastic* way to do it - serious props there. I
benchmarked it at about 15 gbit/s on single-stream testing,
which is *really* impressive (no other BPF programs active,
no shaping).
The first issue I ran into is that stacking XDP programs isn't
all that well defined a process. You can make it work, but
it gets messy when both programs have setup/teardown
routines. I kinda, sorta managed to get the two running at
once, and it mostly worked. There *really* needs to be an
easier way that doesn't run headlong into Ubuntu's lovely
"you updated the kernel and tools, we didn't think you'd
need bpftool so we didn't include it" issues, adjusting scripts
until neither says "oops, there's already an XDP program
here! Bye!". I know that this is a pretty new thing, but the
tooling hasn't really caught up yet to make this a comfortable
process. I'm pretty sure I spent more time trying to run both
at once than it took to make a combined version that sort-of
ran. (I had a working version in an afternoon)
With the two literally concatenated (but compiled together),
it worked - but there was a noticeable performance cost. That's
where orthogonal design choices hit - epping/xdp-pping is
sampling everything (it can even go looking for DNS and ICMP!).
A QoE box *really* needs to go out of its way to avoid adding
any latency, otherwise you're self-defeating. A representative
sample is really all you need - while for epping's target,
a really detailed sample is what you need. When faced with
differing design goals like that, my response is always to
make a tool that very efficiently does what I need.
Combining the packet parsing** was the obvious low-hanging
fruit. It is faster, but not by very much. But I really hate
it when code repeats itself. It seriously set off my OCD
watching both find the IP header offset, determine protocol
(IPv4 vs IPv6), etc. Small performance win.
Bailing out as soon as we determine that we aren't looking
at a TCP packet was a big performance win. You can achieve
the same by carefully setting up the "config" for epping,
but there's not a lot of point in keeping the DNS/ICMP code
when it's not needed. Still a performance win, and not
needing to maintain a configuration (that will be the same
each time) makes setup easier.
Running by default on the TC (egress) rather than XDP
is a big win, too - but only after cpumap-tc has shunted
processing to the appropriate CPU. Now processing is
divided between CPUs, and cache locality is more likely
to happen - the packet we are reading is in the local
core's cache when cpumap-pping reads it, and there's
a decent chance it'll still be there (at least L2) by the time
it gets to the actual queue discipline.
Changing the reporting mechanism was a really big win,
in terms of performance and the tool aligning with what's
needed:
* Since xdp-cpumap has already done the work to determine
that a flow belongs in TC handle X:Y - and mapping RTT
performance to customer/circuit is *exactly* what we're
trying to do - it just makes sense to take that value and
use it as a key for the results.
* Since we don't care about every packet - rather, we want
a periodic representative sample - we can use an efficient
per TC handle circular buffer in which to store results.
* In turn, I realized that we could just *sample* rather than
continually churning the circular buffer. So each flow's
buffer has a capacity, and the monitor bails out once a flow
buffer is full of RTT results. Really big performance win.
"return" is a really fast call. :-) (The buffers are reset when
read)
* Perfmaps are great, but I didn't want to require a daemon
run (mapping the perfmap results) and in turn output
results in a LibreQoS-friendly format when a much simpler
mechanism gets the same result - without another program
sitting handling the mmap'd performance flows all the time.
So the result was really fast and does exactly what I need.
It's not meant to be "better" than the original; for the original's
purpose, it's not great. For rapidly building QoE metrics on
a live shaper box, with absolutely minimal overhead and a
focus on sipping the firehose rather than trying to drink it
all - it's about right.
Philosophically, I've always favored tools that do exactly
what I need.
Likewise, if someone would like to come up with a really
good recipe that runs both rather than a combined
program - that'd be awesome. If it can match the
performance of cpumap-pping, I'll happily switch
BracketQoS to use it.
You're obviously welcome to any of the code; if it can help
the original projects, that's wonderful. Right now, I don't
have the time to come up with a better way of layering
XDP/TC programs!
* - I keep wondering if I shouldn't roll some .deb packages
and a configurator to make setup easier!
** - there *really* should be a standard flow dissector. The
Linux traffic shaper's dissector can handle VLAN tags and
an MPLS header. xdp-cpumap-tc handles VLANs with
aplomb and doesn't touch MPLS. epping calls out to the
xdp-project's dissector which appears to handle
VLANs and also doesn't touch MPLS).
Thanks,
Herbert