[LibreQoS] Before/After Performance Comparison (Monitoring Mode)

Tue Nov 8 13:53:27 EST 2022

Hi,
Will just chime in with my own perspective on this ePPing (what I've
internally named my eBPF based pping to) vs xdp-cpumap-pping debate and
adress some of the points mentinoned.

First of all I want to say that I'm very impressed with all the work
Herbert has made with both xdp-cpumap-tc and xdp-cpumap-pping. There
seems to be very rapid progress and some very nice performance numbers
being presented. I'm also very happy that some of my work with ePPing
can benefit LibreQoS, even if I, like Toke, would hope that we could
perhaps benefit a bit more from each others work.

That said I can see some of the benefits from keeping cpumap-pping its
own thing and understand if that's the route you want to head down.
Regardless I hope we can at least exchange some ideas and learn from
each other.

On Tue, 2022-11-08 at 10:02 -0600, Herbert Wolverson via LibreQoS
wrote:
> I'm (obviously!) going to respectfully disagree with Toke on this one.
> I didn't dive into cpumap-pping for fun; I tried *really hard* to work
> with the original epping/xdp-pping. It's a great tool, really fantastic work.
> It's also not really designed for the same purpose.

ePPing is very heavily inspired by Kathie's pping, and perhaps a bit
too much so at times. Allowing an ISP to monitor the latency its
customers experience is definintely a use case we would want to support
with ePPing, and are working on some improvements to make it work
better for that (as Toke mentioned we're currently looking at adding
some aggregation support instead of reporting individual RTT samples).
So I will definintely have a look at some of the changes Herbert has
done with cpumap-pping to see if it makes sense to implement some of
them as alternatives for ePPing as well.
>
> The original Polere pping is wonderful, but isn't going to scale - the
> way it ingests packets isn't going to scale across multiple CPUs,
> and having a core pegging 100% on a busy shaper box was
> degrading overall performance. epping solves the scalability
> issue wonderfully, and (rightly) remains focused on giving you
> a complete report of all of the data is accessed while it was
> running. If you want to run a monitoring session and see what's
> going on, it's a *fantastic* way to do it - serious props there. I
> benchmarked it at about 15 gbit/s on single-stream testing,
> which is *really* impressive (no other BPF programs active,
> no shaping).
>
> The first issue I ran into is that stacking XDP programs isn't
> all that well defined a process. You can make it work, but
> it gets messy when both programs have setup/teardown
> routines. I kinda, sorta managed to get the two running at
> once, and it mostly worked. There *really* needs to be an
> easier way that doesn't run headlong into Ubuntu's lovely
> "you updated the kernel and tools, we didn't think you'd
> need bpftool so we didn't include it" issues, adjusting scripts
> until neither says "oops, there's already an XDP program
> here! Bye!". I know that this is a pretty new thing, but the
> tooling hasn't really caught up yet to make this a comfortable
> process. I'm pretty sure I spent more time trying to run both
> at once than it took to make a combined version that sort-of
> ran. (I had a working version in an afternoon)

I will admit that I don't have much experience with chaining XDP
programs, but libxdp has been designed to solve that. ePPing uses
libxdp to load its XDP program since a while a back. But for that to
work together with xdp-cpumap-tc I guess xdp-cpumap-tc would need to be
modified to also use libxdp. I remeber reading that there was some
other issue regarding how ePPing handled VLAN tags, but don't recall
the details, although that seems like it should be possible to solve.
>
> With the two literally concatenated (but compiled together),
> it worked - but there was a noticeable performance cost. That's
> where orthogonal design choices hit - epping/xdp-pping is
> sampling everything (it can even go looking for DNS and ICMP!).
> A QoE box *really* needs to go out of its way to avoid adding
> any latency, otherwise you're self-defeating. A representative
> sample is really all you need - while for epping's target,
> a really detailed sample is what you need. When faced with
> differing design goals like that, my response is always to
> make a tool that very efficiently does what I need.
>
> Combining the packet parsing** was the obvious low-hanging
> fruit. It is faster, but not by very much. But I really hate
> it when code repeats itself. It seriously set off my OCD
> watching both find the IP header offset, determine protocol
> (IPv4 vs IPv6), etc. Small performance win.
>
> Bailing out as soon as we determine that we aren't looking
> at a TCP packet was a big performance win. You can achieve
> the same by carefully setting up the "config" for epping,
> but there's not a lot of point in keeping the DNS/ICMP code
> when it's not needed. Still a performance win, and not
> needing to maintain a configuration (that will be the same
> each time) makes setup easier.

Just want to clarify that ePPing does not support DNS (yet), even if we
may add it at some point. So for now it's just TCP and ICMP. ePPing can
easily ignore non-TCP traffic (it does so by default these days, you
have to explicitly enable tracking of ICMP traffic), and the runtime
overhead from the additional ICMP code should be minimal if ePPing is
not set up to track ICMP (the JIT-compilation ought to optimize it all
away with dead code elimination as those branches will never bit hit
then).

That said, the additional code for different protocols of course add to
the overall code complexity. Furthermore it may make it a bit more
challenging to optimize ePPing for specific protocols as I also try
keep a somewhat common core which can work for all protocols we add (so
we don't end with a completely separate branch of code for each
protocol).
>
> Running by default on the TC (egress) rather than XDP
> is a big win, too - but only after cpumap-tc has shunted
> processing to the appropriate CPU. Now processing is
> divided between CPUs, and cache locality is more likely
> to happen - the packet we are reading is in the local
> core's cache when cpumap-pping reads it, and there's
> a decent chance it'll still be there (at least L2) by the time
> it gets to the actual queue discipline.
>
> Changing the reporting mechanism was a really big win,
> in terms of performance and the tool aligning with what's
> needed:
> * Since xdp-cpumap has already done the work to determine
>   that a flow belongs in TC handle X:Y - and mapping RTT
>   performance to customer/circuit is *exactly* what we're
>   trying to do - it just makes sense to take that value and
>   use it as a key for the results.
> * Since we don't care about every packet - rather, we want
>   a periodic representative sample - we can use an efficient
>   per TC handle circular buffer in which to store results.
> * In turn, I realized that we could just *sample* rather than
>   continually churning the circular buffer. So each flow's
>   buffer has a capacity, and the monitor bails out once a flow
>   buffer is full of RTT results. Really big performance win.
>   "return" is a really fast call. :-) (The buffers are reset when
>   read)
> * Perfmaps are great, but I didn't want to require a daemon
>   run (mapping the perfmap results) and in turn output
>   results in a LibreQoS-friendly format when a much simpler
>   mechanism gets the same result - without another program
>   sitting handling the mmap'd performance flows all the time.
>
> So the result was really fast and does exactly what I need.
> It's not meant to be "better" than the original; for the original's
> purpose, it's not great. For rapidly building QoE metrics on
> a live shaper box, with absolutely minimal overhead and a
> focus on sipping the firehose rather than trying to drink it
> all - it's about right.

As already mentioned, we are working on aggregating RTT reports for
ePPing. Spitting out individual RTT samples as ePPing does now may be
useful in some cases, but can get rather overwhelming (both in terms of
overhead for ePPing itself, but also just for analyzing all those RTT
samples somehow). In some of our own tests we've had ePPing report over
125,000 RTT samples per second, which is of course a bit overkill.

I plan to take a bit closer look at all the optimizations Herbert has
done to see which can also be added to ePPing (at least as an option).
>
> Philosophically, I've always favored tools that do exactly
> what I need.

While I like the simplicity of this philosophy, you will end up with a
lot of very similar but slightly different tools if everyone uses sets
of tools that are tailored to their exact use case. In the long run
that seems a bit cumbersome to maintain, but of course maintaining a
more general tool has its own complexities.
>
> Likewise, if someone would like to come up with a really
> good recipe that runs both rather than a combined
> program - that'd be awesome. If it can match the
> performance of cpumap-pping, I'll happily switch
> BracketQoS to use it.

Long term I think this would be a nice, but getting there might take
some time. ePPing + xdp-cpumap-tc would likely always have a bit more
overhead compared to xdp-cpumap-pping (due to for example parsing the
packets multiple times), but I don't think it should be impossible to
make that overhead relatively small compared to the overall work xdp-
cpumap-tc and ePPing are doing.
>
> You're obviously welcome to any of the code; if it can help
> the original projects, that's wonderful. Right now, I don't
> have the time to come up with a better way of layering
> XDP/TC programs!

Thanks for keeping this open source, I will definintely have a look at
the code and see if I can use some of it for ePPing.

With best regards, Simon Sundberg.

> * - I keep wondering if I shouldn't roll some .deb packages
> and a configurator to make setup easier!
>
> ** - there *really* should be a standard flow dissector. The
> Linux traffic shaper's dissector can handle VLAN tags and
> an MPLS header. xdp-cpumap-tc handles VLANs with
> aplomb and doesn't touch MPLS. epping calls out to the
> xdp-project's dissector which appears to handle
> VLANs and also doesn't touch MPLS).
>
> Thanks,
> Herbert
>
> On Tue, Nov 8, 2022 at 8:23 AM Toke Høiland-Jørgensen via LibreQoS
> <libreqos at lists.bufferbloat.net> wrote:
> > Robert Chacón via LibreQoS <libreqos at lists.bufferbloat.net> writes:
> >
> > > I was hoping to add a monitoring mode which could be used before
> > > "turning
> > > on" LibreQoS, ideally before v1.3 release. This way operators can
> > > really
> > > see what impact it's having on end-user and network latency.
> > >
> > > The simplest solution I can think of is to implement Monitoring
> > > Mode using
> > > cpumap-pping as we already do - with plain HTB and leaf classes
> > > with no
> > > CAKE qdisc applied, and with HTB and leaf class rates set to
> > > impossibly
> > > high amounts (no plan enforcement). This would allow for
> > > before/after
> > > comparisons of Nodes (Access Points). My only concern with this
> > > approach is
> > > that HTB, even with rates set impossibly high, may not be truly
> > > transparent. It would be pretty easy to implement though.
> > >
> > > Alternatively we could use ePPing
> > > <https://github.com/xdp-project/bpf-examples/tree/master/pping>
> > > but I worry
> > > about throughput and the possibility of latency tracking being
> > > slightly
> > > different from cpumap-pping, which could limit the utility of a
> > > comparison.
> > > We'd have to match IPs in a way that's a bit more involved here.
> > >
> > > Thoughts?
> >
> > Well, this kind of thing is exactly why I think concatenating the
> > two
> > programs (cpumap and pping) into a single BPF program was a
> > mistake:
> > those are two distinct pieces of functionality, and you want to be
> > able
> > to run them separately, as your "monitor mode" use case shows. The
> > overhead of parsing the packet twice is trivial compared to
> > everything
> > else those apps are doing, so I don't think the gain is worth
> > losing
> > that flexibility.
> >
> > So I definitely think using the regular epping is the right thing
> > to do
> > here. Simon is looking into improving its reporting so it can be
> > per-subnet using a user-supplied configuration file for the actual
> > subnets, which should hopefully make this feasible. I'm sure he'll
> > chime
> > in here once he has something to test and/or with any questions
> > that pop
> > up in the process.
> >
> > Longer term, I'm hoping all of Herbert's other improvements to
> > epping
> > reporting/formatting can make it into upstream epping, so LibreQoS
> > can
> > just use that for everything :)
> >
> > -Toke
> > _______________________________________________
> > LibreQoS mailing list
> > LibreQoS at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/libreqos
> _______________________________________________
> LibreQoS mailing list
> LibreQoS at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos

När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.