I'd probably go with an HTB queue per target IP group, and not attach a
discipline to it - with only a ceiling set at the top. That'll do truly
minimal
shaping, and you can still use cpumap-pping to get the data you want.
(The current branch I'm testing/working on also reports the local IP
address, which I'm finding pretty helpful). Otherwise, you're going to
make building both tools part of the setup process* and still have
to parse IP pairs for results. Hopefully, there's a decent Python
LPM Trie out there (to handle subnets and IPv6) to make that
easier.

I'm (obviously!) going to respectfully disagree with Toke on this one.
I didn't dive into cpumap-pping for fun; I tried *really hard* to work
with the original epping/xdp-pping. It's a great tool, really fantastic
work.
It's also not really designed for the same purpose.

The original Polere pping is wonderful, but isn't going to scale - the
way it ingests packets isn't going to scale across multiple CPUs,
and having a core pegging 100% on a busy shaper box was
degrading overall performance. epping solves the scalability
issue wonderfully, and (rightly) remains focused on giving you
a complete report of all of the data is accessed while it was
running. If you want to run a monitoring session and see what's
going on, it's a *fantastic* way to do it - serious props there. I
benchmarked it at about 15 gbit/s on single-stream testing,
which is *really* impressive (no other BPF programs active,
no shaping).

The first issue I ran into is that stacking XDP programs isn't
all that well defined a process. You can make it work, but
it gets messy when both programs have setup/teardown
routines. I kinda, sorta managed to get the two running at
once, and it mostly worked. There *really* needs to be an
easier way that doesn't run headlong into Ubuntu's lovely
"you updated the kernel and tools, we didn't think you'd
need bpftool so we didn't include it" issues, adjusting scripts
until neither says "oops, there's already an XDP program
here! Bye!". I know that this is a pretty new thing, but the
tooling hasn't really caught up yet to make this a comfortable
process. I'm pretty sure I spent more time trying to run both
at once than it took to make a combined version that sort-of
ran. (I had a working version in an afternoon)

With the two literally concatenated (but compiled together),
it worked - but there was a noticeable performance cost. That's
where orthogonal design choices hit - epping/xdp-pping is
sampling everything (it can even go looking for DNS and ICMP!).
A QoE box *really* needs to go out of its way to avoid adding
any latency, otherwise you're self-defeating. A representative
sample is really all you need - while for epping's target,
a really detailed sample is what you need. When faced with
differing design goals like that, my response is always to
make a tool that very efficiently does what I need.

Combining the packet parsing** was the obvious low-hanging
fruit. It is faster, but not by very much. But I really hate
it when code repeats itself. It seriously set off my OCD
watching both find the IP header offset, determine protocol
(IPv4 vs IPv6), etc. Small performance win.

Bailing out as soon as we determine that we aren't looking
at a TCP packet was a big performance win. You can achieve
the same by carefully setting up the "config" for epping,
but there's not a lot of point in keeping the DNS/ICMP code
when it's not needed. Still a performance win, and not
needing to maintain a configuration (that will be the same
each time) makes setup easier.

Running by default on the TC (egress) rather than XDP
is a big win, too - but only after cpumap-tc has shunted
processing to the appropriate CPU. Now processing is
divided between CPUs, and cache locality is more likely
to happen - the packet we are reading is in the local
core's cache when cpumap-pping reads it, and there's
a decent chance it'll still be there (at least L2) by the time
it gets to the actual queue discipline.

Changing the reporting mechanism was a really big win,
in terms of performance and the tool aligning with what's
needed:
* Since xdp-cpumap has already done the work to determine
  that a flow belongs in TC handle X:Y - and mapping RTT
  performance to customer/circuit is *exactly* what we're
  trying to do - it just makes sense to take that value and
  use it as a key for the results.
* Since we don't care about every packet - rather, we want
  a periodic representative sample - we can use an efficient
  per TC handle circular buffer in which to store results.
* In turn, I realized that we could just *sample* rather than
  continually churning the circular buffer. So each flow's
  buffer has a capacity, and the monitor bails out once a flow
  buffer is full of RTT results. Really big performance win.
  "return" is a really fast call. :-) (The buffers are reset when
  read)
* Perfmaps are great, but I didn't want to require a daemon
  run (mapping the perfmap results) and in turn output
  results in a LibreQoS-friendly format when a much simpler
  mechanism gets the same result - without another program
  sitting handling the mmap'd performance flows all the time.

So the result was really fast and does exactly what I need.
It's not meant to be "better" than the original; for the original's
purpose, it's not great. For rapidly building QoE metrics on
a live shaper box, with absolutely minimal overhead and a
focus on sipping the firehose rather than trying to drink it
all - it's about right.

Philosophically, I've always favored tools that do exactly
what I need.

Likewise, if someone would like to come up with a really
good recipe that runs both rather than a combined
program - that'd be awesome. If it can match the
performance of cpumap-pping, I'll happily switch
BracketQoS to use it.

You're obviously welcome to any of the code; if it can help
the original projects, that's wonderful. Right now, I don't
have the time to come up with a better way of layering
XDP/TC programs!

* - I keep wondering if I shouldn't roll some .deb packages
and a configurator to make setup easier!

** - there *really* should be a standard flow dissector. The
Linux traffic shaper's dissector can handle VLAN tags and
an MPLS header. xdp-cpumap-tc handles VLANs with
aplomb and doesn't touch MPLS. epping calls out to the
xdp-project's dissector which appears to handle
VLANs and also doesn't touch MPLS).

Thanks,
Herbert

On Tue, Nov 8, 2022 at 8:23 AM Toke Høiland-Jørgensen via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Robert Chacón via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>
> > I was hoping to add a monitoring mode which could be used before "turning
> > on" LibreQoS, ideally before v1.3 release. This way operators can really
> > see what impact it's having on end-user and network latency.
> >
> > The simplest solution I can think of is to implement Monitoring Mode
> using
> > cpumap-pping as we already do - with plain HTB and leaf classes with no
> > CAKE qdisc applied, and with HTB and leaf class rates set to impossibly
> > high amounts (no plan enforcement). This would allow for before/after
> > comparisons of Nodes (Access Points). My only concern with this approach
> is
> > that HTB, even with rates set impossibly high, may not be truly
> > transparent. It would be pretty easy to implement though.
> >
> > Alternatively we could use ePPing
> > <https://github.com/xdp-project/bpf-examples/tree/master/pping> but I
> worry
> > about throughput and the possibility of latency tracking being slightly
> > different from cpumap-pping, which could limit the utility of a
> comparison.
> > We'd have to match IPs in a way that's a bit more involved here.
> >
> > Thoughts?
>
> Well, this kind of thing is exactly why I think concatenating the two
> programs (cpumap and pping) into a single BPF program was a mistake:
> those are two distinct pieces of functionality, and you want to be able
> to run them separately, as your "monitor mode" use case shows. The
> overhead of parsing the packet twice is trivial compared to everything
> else those apps are doing, so I don't think the gain is worth losing
> that flexibility.
>
> So I definitely think using the regular epping is the right thing to do
> here. Simon is looking into improving its reporting so it can be
> per-subnet using a user-supplied configuration file for the actual
> subnets, which should hopefully make this feasible. I'm sure he'll chime
> in here once he has something to test and/or with any questions that pop
> up in the process.
>
> Longer term, I'm hoping all of Herbert's other improvements to epping
> reporting/formatting can make it into upstream epping, so LibreQoS can
> just use that for everything :)
>
> -Toke
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>