It's probably not entirely thread-safe right now (ran into some issues
reading per_cpu maps back from userspace; hopefully, I'll get that figured
out) - but the commits I just pushed have it basically working on
single-stream testing. :-)

Setup cpumap as usual, and periodically run xdp-pping. This gives you
per-connection RTT information in JSON:

[
{"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
{}]

(With the extra {} because I'm not tracking the tail and haven't done comma
removal). The tool also empties the various maps used to gather data,
acting as a "reset" point. There's a max of 60 samples per queue, in a
ringbuffer setup (so newest will start to overwrite the oldest).

I'll start trying to test on a larger scale now.

On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
robert.chacon@jackrabbitwireless.com> wrote:

> Hey Herbert,
>
> Fantastic work! Super exciting to see this coming together, especially so
> quickly.
> I'll test it soon.
> I understand and agree with your decision to omit certain features (ICMP
> tracking,DNS tracking, etc) to optimize performance for our use case. Like
> you said, in order to merge the functionality without a performance hit,
> merging them is sort of the only way right now. Otherwise there would be a
> lot of redundancy and lost throughput for an ISP's use. Though hopefully
> long term there will be a way to keep all projects working independently
> but interoperably with a plugin system of some kind.
>
> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
> optimizations for high sub counts (8000+ subs) as well as stateful changes
> to the queue structure.
> I'm working to set up a physical lab to test high throughput and high
> client count scenarios.
> When testing beyond ~32,000 filters we get "no space left on device" from
> xdp-cpumap-tc, which I think relates to the bpf map size limitation you
> mentioned. Maybe in the coming months we can take a look at that.
>
> Anyway great work on the cpumap-pping program! Excited to see more on this.
>
> Thanks,
> Robert
>
> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
>
>> Hey,
>>
>> My current (unfinished) progress on this is now available here:
>> https://github.com/thebracket/cpumap-pping-hackjob
>>
>> I mean it about the warnings, this isn't at all stable, debugged - and
>> can't promise that it won't unleash the nasal demons
>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>
>> With that said, I'm pretty happy so far:
>>
>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
>> onto a dedicated CPU. It has to run on both
>>   the inbound and outbound classifiers, since otherwise it would only see
>> half the conversation.
>> * It does assume that your ingress and egress CPUs are mapped to the same
>> interface; I do that anyway in BracketQoS. Not doing
>>   that opens up a potential world of pain, since writes to the shared
>> maps would require a locking scheme. Too much locking, and you lose all of
>> the benefit of using multiple CPUs to begin with.
>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>> worked with have lots of it.
>> * I've been gradually removing features that I don't want for BracketQoS.
>> A hypothetical future "useful to everyone" version wouldn't do that.
>> * Rate limiting is working, but I removed the requirement for a shared
>> configuration provided from userland - so right now it's always set to
>> report at 1 second intervals per stream.
>>
>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world",
>> and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via
>> a cake/HTB queue setup) is around 5 gbit/s at present, on my
>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast
>> SSDs)
>>
>> Output currently consists of debug messages reading:
>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
>> Flow open event
>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
>> Send performance event (5,1), 374696
>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
>> Flow open event
>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
>> Send performance event (5,1), 247069
>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
>> Send performance event (5,1), 5217155
>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4515394
>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4481289
>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4255268
>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
>> Send performance event (5,1), 5249493
>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
>> Send performance event (5,1), 3795993
>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
>> Send performance event (5,1), 3949519
>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4365335
>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4154910
>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4405582
>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
>> Send flow event
>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
>> Send flow event
>>
>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
>> allocated by the xdp-cpumap parent.
>> I get pretty low latency between VMs; I'll set up a test with some
>> real-world data very soon.
>>
>> I plan to keep hacking away, but feel free to take a peek.
>>
>> Thanks,
>> Herbert
>>
>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
>> wrote:
>>
>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>>> notes.
>>>
>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>> > [ Adding Simon to Cc ]
>>> >
>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>> writes:
>>> >
>>> > > Hey,
>>> > >
>>> > > I've had some pretty good success with merging xdp-pping (
>>> > >
>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>>> > >
>>> > > I ported over most of the xdp-pping code, and then changed the entry
>>> point
>>> > > and packet parsing code to make use of the work already done in
>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no
>>> need to do
>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
>>> them -
>>> > > otherwise the two tc instances don't properly share data.
>>> > >
>>>
>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>>> be processed by different cores on ingress and egress the per-CPU maps
>>> will not really work reliably as each core will have a different view
>>> on the state of the flow, if there's been a previous packet with a
>>> certain TSval from that flow etc.
>>>
>>> Furthermore, if a flow is always processed on the same core (on both
>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>> memory. From my understanding the keys for per-CPU maps are still
>>> shared across all CPUs, it's just that each CPU gets its own value. So
>>> all CPUs will then have their own data for each flow, but it's only the
>>> CPU processing the flow that will have any relevant data for the flow
>>> while the remaining CPUs will just have an empty state for that flow.
>>> Under the same assumption that packets within the same flow are always
>>> processed on the same core there should generally not be any
>>> concurrency issues with having a global (non-per-CPU) either as packets
>>> from the same flow cannot be processed concurrently then (and thus no
>>> concurrent access to the same value in the map). I am however still
>>> very unclear on if there's any considerable performance impact between
>>> global and per-CPU map versions if the same key is not accessed
>>> concurrently.
>>>
>>> > > Right now, output
>>> > > is just stubbed - I've still got to port the perfmap output code.
>>> Instead,
>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
>>> see
>>> > > roughly what the output would look like.
>>> > >
>>> > > With debug enabled and just logging I'm now getting about 4.9
>>> Gbits/sec on
>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>> middle). :-)
>>> >
>>> > Just FYI, that "just logging" is probably the biggest source of
>>> > overhead, then. What Simon found was that sending the data from kernel
>>> > to userspace is one of the most expensive bits of epping, at least when
>>> > the number of data points goes up (which is does as additional flows
>>> are
>>> > added).
>>>
>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>>> direct overhead from the tool itself, but also becomes demanding for
>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>>> -R/--rtt-rate
>>> >
>>> > > So my question: how would you prefer to receive this data? I'll have
>>> to
>>> > > write a daemon that provides userspace control (periodic cleanup as
>>> well as
>>> > > reading the performance stream), so the world's kinda our oyster. I
>>> can
>>> > > stick to Kathie's original format (and dump it to a named pipe,
>>> perhaps?),
>>> > > a condensed format that only shows what you want to use, an efficient
>>> > > binary format if you feel like parsing that...
>>> >
>>> > It would be great if we could combine efforts a bit here so we don't
>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>>> > whatever daemon you end up writing can agree on data format etc that
>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>> >
>>> > Briefly what I've discussed before with Simon was to have the ability
>>> to
>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>>> > utility periodically pull them out. What we discussed was doing this
>>> > using an LPM map (which is not in that PR yet). The idea would be that
>>> > userspace would populate the LPM map with the keys (prefixes) they
>>> > wanted statistics for (in LibreQOS context that could be one key per
>>> > customer, for instance). Epping would then do a map lookup into the
>>> LPM,
>>> > and if it gets a match it would update the statistics in that map entry
>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>>> > below uses this technique where userspace will "reset" the histogram
>>> > every time it loads it by swapping out two different map entries when
>>> it
>>> > does a read; this allows you to control the sampling rate from
>>> > userspace, and you'll just get the data since the last time you polled.
>>>
>>> Thank's Toke for summarzing both the current state and the plan going
>>> forward. I will just note that this PR (and all my other work with
>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>>> on hold for a couple of weeks right now as I'm trying to finish up a
>>> paper.
>>>
>>> > I was thinking that if we all can agree on the map format, then your
>>> > polling daemon could be one userspace "client" for that, and the epping
>>> > binary itself could be another; but we could keep compatibility between
>>> > the two, so we don't duplicate effort.
>>> >
>>> > Similarly, refactoring of the epping code itself so it can be plugged
>>> > into the cpumap-tc code would be a good goal...
>>>
>>> Should probably do that...at some point. In general I think it's a bit
>>> of an interesting problem to think about how to chain multiple XDP/tc
>>> programs together in an efficent way. Most XDP and tc programs will do
>>> some amount of packet parsing and when you have many chained programs
>>> parsing the same packets this obviously becomes a bit wasteful. In the
>>> same time it would be nice if one didn't need to manually merge
>>> multiple programs together into a single one like this to get rid of
>>> this duplicated parsing, or at least make that process of merging those
>>> programs as simple as possible.
>>>
>>>
>>> > -Toke
>>> >
>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>
>>> När du skickar e-post till Karlstads universitet behandlar vi dina
>>> personuppgifter<https://www.kau.se/gdpr>.
>>> When you send an e-mail to Karlstad University, we will process your
>>> personal data<https://www.kau.se/en/gdpr>.
>>>
>> _______________________________________________
>> LibreQoS mailing list
>> LibreQoS@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/libreqos
>>
>
>
> --
> Robert Chacón
> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
>