Hey,

My current (unfinished) progress on this is now available here:
https://github.com/thebracket/cpumap-pping-hackjob

I mean it about the warnings, this isn't at all stable, debugged - and
can't promise that it won't unleash the nasal demons
(to use a popular C++ phrase). The name is descriptive! ;-)

With that said, I'm pretty happy so far:

* It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
onto a dedicated CPU. It has to run on both
  the inbound and outbound classifiers, since otherwise it would only see
half the conversation.
* It does assume that your ingress and egress CPUs are mapped to the same
interface; I do that anyway in BracketQoS. Not doing
  that opens up a potential world of pain, since writes to the shared maps
would require a locking scheme. Too much locking, and you lose all of the
benefit of using multiple CPUs to begin with.
* It is pretty wasteful of RAM, but most of the shaper systems I've worked
with have lots of it.
* I've been gradually removing features that I don't want for BracketQoS. A
hypothetical future "useful to everyone" version wouldn't do that.
* Rate limiting is working, but I removed the requirement for a shared
configuration provided from userland - so right now it's always set to
report at 1 second intervals per stream.

My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and
a "shaper" VM in between running a slightly hacked-up LibreQoS.
iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a
cake/HTB queue setup) is around 5 gbit/s at present, on my
test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)

Output currently consists of debug messages reading:
  cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
Send performance event (5,1), 374696
  cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
Send performance event (5,1), 247069
  cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
Send performance event (5,1), 5217155
  cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
Send performance event (5,1), 4515394
  cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
Send performance event (5,1), 4481289
  cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
Send performance event (5,1), 4255268
  cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
Send performance event (5,1), 5249493
  cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
Send performance event (5,1), 3795993
  cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
Send performance event (5,1), 3949519
  cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
Send performance event (5,1), 4365335
  cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
Send performance event (5,1), 4154910
  cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
Send performance event (5,1), 4405582
  cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
Send flow event
  cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
Send flow event

The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
allocated by the xdp-cpumap parent.
I get pretty low latency between VMs; I'll set up a test with some
real-world data very soon.

I plan to keep hacking away, but feel free to take a peek.

Thanks,
Herbert

On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
wrote:

> Hi, thanks for adding me to the conversation. Just a couple of quick
> notes.
>
> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> > [ Adding Simon to Cc ]
> >
> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
> >
> > > Hey,
> > >
> > > I've had some pretty good success with merging xdp-pping (
> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
> )
> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> > >
> > > I ported over most of the xdp-pping code, and then changed the entry
> point
> > > and packet parsing code to make use of the work already done in
> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need
> to do
> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
> them -
> > > otherwise the two tc instances don't properly share data.
> > >
>
> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> the same CPU core at both ingress or egress. Otherwise, if a flow may
> be processed by different cores on ingress and egress the per-CPU maps
> will not really work reliably as each core will have a different view
> on the state of the flow, if there's been a previous packet with a
> certain TSval from that flow etc.
>
> Furthermore, if a flow is always processed on the same core (on both
> ingress and egress) I think per-CPU maps may be a bit wasteful on
> memory. From my understanding the keys for per-CPU maps are still
> shared across all CPUs, it's just that each CPU gets its own value. So
> all CPUs will then have their own data for each flow, but it's only the
> CPU processing the flow that will have any relevant data for the flow
> while the remaining CPUs will just have an empty state for that flow.
> Under the same assumption that packets within the same flow are always
> processed on the same core there should generally not be any
> concurrency issues with having a global (non-per-CPU) either as packets
> from the same flow cannot be processed concurrently then (and thus no
> concurrent access to the same value in the map). I am however still
> very unclear on if there's any considerable performance impact between
> global and per-CPU map versions if the same key is not accessed
> concurrently.
>
> > > Right now, output
> > > is just stubbed - I've still got to port the perfmap output code.
> Instead,
> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
> see
> > > roughly what the output would look like.
> > >
> > > With debug enabled and just logging I'm now getting about 4.9
> Gbits/sec on
> > > single-stream iperf between two VMs (with a shaper VM in the middle).
> :-)
> >
> > Just FYI, that "just logging" is probably the biggest source of
> > overhead, then. What Simon found was that sending the data from kernel
> > to userspace is one of the most expensive bits of epping, at least when
> > the number of data points goes up (which is does as additional flows are
> > added).
>
> Yhea, reporting individual RTTs when there's lots of them (you may get
> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> direct overhead from the tool itself, but also becomes demanding for
> whatever you use all those RTT samples for (i.e. need to log, parse,
> analyze etc. a very large amount of RTTs). One way to deal with that is
> of course to just apply some sort of sampling (the -r/--rate-limit and
> -R/--rtt-rate
> >
> > > So my question: how would you prefer to receive this data? I'll have to
> > > write a daemon that provides userspace control (periodic cleanup as
> well as
> > > reading the performance stream), so the world's kinda our oyster. I can
> > > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> > > a condensed format that only shows what you want to use, an efficient
> > > binary format if you feel like parsing that...
> >
> > It would be great if we could combine efforts a bit here so we don't
> > fork the codebase more than we have to. I.e., if "upstream" epping and
> > whatever daemon you end up writing can agree on data format etc that
> > would be fantastic! Added Simon to Cc to facilitate this :)
> >
> > Briefly what I've discussed before with Simon was to have the ability to
> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> > utility periodically pull them out. What we discussed was doing this
> > using an LPM map (which is not in that PR yet). The idea would be that
> > userspace would populate the LPM map with the keys (prefixes) they
> > wanted statistics for (in LibreQOS context that could be one key per
> > customer, for instance). Epping would then do a map lookup into the LPM,
> > and if it gets a match it would update the statistics in that map entry
> > (keeping a histogram of latency values seen, basically). Simon's PR
> > below uses this technique where userspace will "reset" the histogram
> > every time it loads it by swapping out two different map entries when it
> > does a read; this allows you to control the sampling rate from
> > userspace, and you'll just get the data since the last time you polled.
>
> Thank's Toke for summarzing both the current state and the plan going
> forward. I will just note that this PR (and all my other work with
> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
> on hold for a couple of weeks right now as I'm trying to finish up a
> paper.
>
> > I was thinking that if we all can agree on the map format, then your
> > polling daemon could be one userspace "client" for that, and the epping
> > binary itself could be another; but we could keep compatibility between
> > the two, so we don't duplicate effort.
> >
> > Similarly, refactoring of the epping code itself so it can be plugged
> > into the cpumap-tc code would be a good goal...
>
> Should probably do that...at some point. In general I think it's a bit
> of an interesting problem to think about how to chain multiple XDP/tc
> programs together in an efficent way. Most XDP and tc programs will do
> some amount of packet parsing and when you have many chained programs
> parsing the same packets this obviously becomes a bit wasteful. In the
> same time it would be nice if one didn't need to manually merge
> multiple programs together into a single one like this to get rid of
> this duplicated parsing, or at least make that process of merging those
> programs as simple as possible.
>
>
> > -Toke
> >
> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>
> När du skickar e-post till Karlstads universitet behandlar vi dina
> personuppgifter<https://www.kau.se/gdpr>.
> When you send an e-mail to Karlstad University, we will process your
> personal data<https://www.kau.se/en/gdpr>.
>