I'd probably go with an HTB queue per target IP group, and not attach a

discipline to it - with only a ceiling set at the top. That'll do truly minimal

shaping, and you can still use cpumap-pping to get the data you want.

(The current branch I'm testing/working on also reports the local IP

address, which I'm finding pretty helpful). Otherwise, you're going to

make building both tools part of the setup process* and still have

to parse IP pairs for results. Hopefully, there's a decent Python

LPM Trie out there (to handle subnets and IPv6) to make that

easier.

I'm (obviously!) going to respectfully disagree with Toke on this one.

I didn't dive into cpumap-pping for fun; I tried *really hard* to work

with the original epping/xdp-pping. It's a great tool, really fantastic work.

It's also not really designed for the same purpose.

The original Polere pping is wonderful, but isn't going to scale - the

way it ingests packets isn't going to scale across multiple CPUs,

and having a core pegging 100% on a busy shaper box was

degrading overall performance. epping solves the scalability

issue wonderfully, and (rightly) remains focused on giving you

a complete report of all of the data is accessed while it was

running. If you want to run a monitoring session and see what's

going on, it's a *fantastic* way to do it - serious props there. I

benchmarked it at about 15 gbit/s on single-stream testing,

which is *really* impressive (no other BPF programs active,

no shaping).

The first issue I ran into is that stacking XDP programs isn't

all that well defined a process. You can make it work, but

it gets messy when both programs have setup/teardown

routines. I kinda, sorta managed to get the two running at

once, and it mostly worked. There *really* needs to be an

easier way that doesn't run headlong into Ubuntu's lovely

"you updated the kernel and tools, we didn't think you'd

need bpftool so we didn't include it" issues, adjusting scripts

until neither says "oops, there's already an XDP program

here! Bye!". I know that this is a pretty new thing, but the

tooling hasn't really caught up yet to make this a comfortable

process. I'm pretty sure I spent more time trying to run both

at once than it took to make a combined version that sort-of

ran. (I had a working version in an afternoon)

With the two literally concatenated (but compiled together),

it worked - but there was a noticeable performance cost. That's

where orthogonal design choices hit - epping/xdp-pping is

sampling everything (it can even go looking for DNS and ICMP!).

A QoE box *really* needs to go out of its way to avoid adding

any latency, otherwise you're self-defeating. A representative

sample is really all you need - while for epping's target,

a really detailed sample is what you need. When faced with

differing design goals like that, my response is always to

make a tool that very efficiently does what I need.

Combining the packet parsing** was the obvious low-hanging

fruit. It is faster, but not by very much. But I really hate

it when code repeats itself. It seriously set off my OCD

watching both find the IP header offset, determine protocol

(IPv4 vs IPv6), etc. Small performance win.

Bailing out as soon as we determine that we aren't looking

at a TCP packet was a big performance win. You can achieve

the same by carefully setting up the "config" for epping,

but there's not a lot of point in keeping the DNS/ICMP code

when it's not needed. Still a performance win, and not

needing to maintain a configuration (that will be the same

each time) makes setup easier.

Running by default on the TC (egress) rather than XDP

is a big win, too - but only after cpumap-tc has shunted

processing to the appropriate CPU. Now processing is

divided between CPUs, and cache locality is more likely

to happen - the packet we are reading is in the local

core's cache when cpumap-pping reads it, and there's

a decent chance it'll still be there (at least L2) by the time

it gets to the actual queue discipline.

Changing the reporting mechanism was a really big win,

in terms of performance and the tool aligning with what's

needed:

* Since xdp-cpumap has already done the work to determine

that a flow belongs in TC handle X:Y - and mapping RTT

performance to customer/circuit is *exactly* what we're

trying to do - it just makes sense to take that value and

use it as a key for the results.

* Since we don't care about every packet - rather, we want

a periodic representative sample - we can use an efficient

per TC handle circular buffer in which to store results.

* In turn, I realized that we could just *sample* rather than

continually churning the circular buffer. So each flow's

buffer has a capacity, and the monitor bails out once a flow

buffer is full of RTT results. Really big performance win.

"return" is a really fast call. :-) (The buffers are reset when

read)

* Perfmaps are great, but I didn't want to require a daemon

run (mapping the perfmap results) and in turn output

results in a LibreQoS-friendly format when a much simpler

mechanism gets the same result - without another program

sitting handling the mmap'd performance flows all the time.

So the result was really fast and does exactly what I need.

It's not meant to be "better" than the original; for the original's

purpose, it's not great. For rapidly building QoE metrics on

a live shaper box, with absolutely minimal overhead and a

focus on sipping the firehose rather than trying to drink it

all - it's about right.

Philosophically, I've always favored tools that do exactly

what I need.

Likewise, if someone would like to come up with a really

good recipe that runs both rather than a combined

program - that'd be awesome. If it can match the

performance of cpumap-pping, I'll happily switch

BracketQoS to use it.

You're obviously welcome to any of the code; if it can help

the original projects, that's wonderful. Right now, I don't

have the time to come up with a better way of layering

XDP/TC programs!

* - I keep wondering if I shouldn't roll some .deb packages

and a configurator to make setup easier!

** - there *really* should be a standard flow dissector. The

Linux traffic shaper's dissector can handle VLAN tags and

an MPLS header. xdp-cpumap-tc handles VLANs with

aplomb and doesn't touch MPLS. epping calls out to the

xdp-project's dissector which appears to handle

VLANs and also doesn't touch MPLS).

Thanks,

Herbert

On Tue, Nov 8, 2022 at 8:23 AM Toke Høiland-Jørgensen via LibreQoS <libreqos@lists.bufferbloat.net> wrote:

Robert Chacón via LibreQoS <libreqos@lists.bufferbloat.net> writes:

> I was hoping to add a monitoring mode which could be used before "turning
> on" LibreQoS, ideally before v1.3 release. This way operators can really
> see what impact it's having on end-user and network latency.
>
> The simplest solution I can think of is to implement Monitoring Mode using
> cpumap-pping as we already do - with plain HTB and leaf classes with no
> CAKE qdisc applied, and with HTB and leaf class rates set to impossibly
> high amounts (no plan enforcement). This would allow for before/after
> comparisons of Nodes (Access Points). My only concern with this approach is
> that HTB, even with rates set impossibly high, may not be truly
> transparent. It would be pretty easy to implement though.
>
> Alternatively we could use ePPing
> <https://github.com/xdp-project/bpf-examples/tree/master/pping> but I worry
> about throughput and the possibility of latency tracking being slightly
> different from cpumap-pping, which could limit the utility of a comparison.
> We'd have to match IPs in a way that's a bit more involved here.
>
> Thoughts?

Well, this kind of thing is exactly why I think concatenating the two
programs (cpumap and pping) into a single BPF program was a mistake:
those are two distinct pieces of functionality, and you want to be able
to run them separately, as your "monitor mode" use case shows. The
overhead of parsing the packet twice is trivial compared to everything
else those apps are doing, so I don't think the gain is worth losing
that flexibility.

So I definitely think using the regular epping is the right thing to do
here. Simon is looking into improving its reporting so it can be
per-subnet using a user-supplied configuration file for the actual
subnets, which should hopefully make this feasible. I'm sure he'll chime
in here once he has something to test and/or with any questions that pop
up in the process.

Longer term, I'm hoping all of Herbert's other improvements to epping
reporting/formatting can make it into upstream epping, so LibreQoS can
just use that for everything :)

-Toke
_______________________________________________
LibreQoS mailing list
LibreQoS@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/libreqos