<div dir="ltr"><div>Hey,</div><div><br></div><div>My current (unfinished) progress on this is now available here: <a href="https://github.com/thebracket/cpumap-pping-hackjob">https://github.com/thebracket/cpumap-pping-hackjob</a></div><div><br></div><div>I mean it about the warnings, this isn't at all stable, debugged - and can't promise that it won't unleash the nasal demons</div><div>(to use a popular C++ phrase). The name is descriptive! ;-)</div><div><br></div><div>With that said, I'm pretty happy so far:</div><div><br></div><div>* It runs only on the classifier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on both</div><div>  the inbound and outbound classifiers, since otherwise it would only see half the conversation.<br></div><div>* It does assume that your ingress and egress CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not doing</div><div>  that opens up a potential world of pain, since writes to the shared maps would require a locking scheme. Too much locking, and you lose all of the benefit of using multiple CPUs to begin with.</div><div>* It is pretty wasteful of RAM, but most of the shaper systems I've worked with have lots of it.</div><div>* I've been gradually removing features that I don't want for BracketQoS. A hypothetical future "useful to everyone" version wouldn't do that.</div><div>* Rate limiting is working, but I removed the requirement for a shared configuration provided from userland - so right now it's always set to report at 1 second intervals per stream.<br></div><div><br></div><div>My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS. <br></div><div>iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my <br></div><div>test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)</div><div><br></div><div>Output currently consists of debug messages reading:</div><div>  cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc) Flow open event<br>  cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc) Send performance event (5,1), 374696<br>  cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc) Flow open event<br>  cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc) Send performance event (5,1), 247069<br>  cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 5217155<br>  cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc) Send performance event (5,1), 4515394<br>  cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289<br>  cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc) Send performance event (5,1), 4255268<br>  cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc) Send performance event (5,1), 5249493<br>  cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 3795993<br>  cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc) Send performance event (5,1), 3949519<br>  cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335<br>  cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc) Send performance event (5,1), 4154910<br>  cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc) Send performance event (5,1), 4405582<br>  cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc) Send flow event<br>  cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc) Send flow event</div><div><br></div><div>The times haven't been tweaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap parent. <br></div><div>I get pretty low latency between VMs; I'll set up a test with some real-world data very soon.<br></div><div><br></div><div>I plan to keep hacking away, but feel free to take a peek.</div><div><br></div><div>Thanks,</div><div>Herbert<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <<a href="mailto:Simon.Sundberg@kau.se">Simon.Sundberg@kau.se</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi, thanks for adding me to the conversation. Just a couple of quick<br>

notes.<br>

<br>

On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:<br>

> [ Adding Simon to Cc ]<br>

><br>

> Herbert Wolverson via LibreQoS <<a href="mailto:libreqos@lists.bufferbloat.net" target="_blank">libreqos@lists.bufferbloat.net</a>> writes:<br>

><br>

> > Hey,<br>

> ><br>

> > I've had some pretty good success with merging xdp-pping (<br>

> > <a href="https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h" rel="noreferrer" target="_blank">https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h</a> )<br>

> > into xdp-cpumap-tc ( <a href="https://github.com/xdp-project/xdp-cpumap-tc" rel="noreferrer" target="_blank">https://github.com/xdp-project/xdp-cpumap-tc</a> ).<br>

> ><br>

> > I ported over most of the xdp-pping code, and then changed the entry point<br>

> > and packet parsing code to make use of the work already done in<br>

> > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do<br>

> > it twice). Then I switched the maps to per-cpu maps, and had to pin them -<br>

> > otherwise the two tc instances don't properly share data.<br>

> ><br>

<br>

I guess the xdp-cpumap-tc ensures that the same flow is processed on<br>

the same CPU core at both ingress or egress. Otherwise, if a flow may<br>

be processed by different cores on ingress and egress the per-CPU maps<br>

will not really work reliably as each core will have a different view<br>

on the state of the flow, if there's been a previous packet with a<br>

certain TSval from that flow etc.<br>

<br>

Furthermore, if a flow is always processed on the same core (on both<br>

ingress and egress) I think per-CPU maps may be a bit wasteful on<br>

memory. From my understanding the keys for per-CPU maps are still<br>

shared across all CPUs, it's just that each CPU gets its own value. So<br>

all CPUs will then have their own data for each flow, but it's only the<br>

CPU processing the flow that will have any relevant data for the flow<br>

while the remaining CPUs will just have an empty state for that flow.<br>

Under the same assumption that packets within the same flow are always<br>

processed on the same core there should generally not be any<br>

concurrency issues with having a global (non-per-CPU) either as packets<br>

from the same flow cannot be processed concurrently then (and thus no<br>

concurrent access to the same value in the map). I am however still<br>

very unclear on if there's any considerable performance impact between<br>

global and per-CPU map versions if the same key is not accessed<br>

concurrently.<br>

<br>

> > Right now, output<br>

> > is just stubbed - I've still got to port the perfmap output code. Instead,<br>

> > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see<br>

> > roughly what the output would look like.<br>

> ><br>

> > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on<br>

> > single-stream iperf between two VMs (with a shaper VM in the middle). :-)<br>

><br>

> Just FYI, that "just logging" is probably the biggest source of<br>

> overhead, then. What Simon found was that sending the data from kernel<br>

> to userspace is one of the most expensive bits of epping, at least when<br>

> the number of data points goes up (which is does as additional flows are<br>

> added).<br>

<br>

Yhea, reporting individual RTTs when there's lots of them (you may get<br>

upwards of 1000 RTTs/s per flow) is not only problematic in terms of<br>

direct overhead from the tool itself, but also becomes demanding for<br>

whatever you use all those RTT samples for (i.e. need to log, parse,<br>

analyze etc. a very large amount of RTTs). One way to deal with that is<br>

of course to just apply some sort of sampling (the -r/--rate-limit and<br>

-R/--rtt-rate<br>

><br>

> > So my question: how would you prefer to receive this data? I'll have to<br>

> > write a daemon that provides userspace control (periodic cleanup as well as<br>

> > reading the performance stream), so the world's kinda our oyster. I can<br>

> > stick to Kathie's original format (and dump it to a named pipe, perhaps?),<br>

> > a condensed format that only shows what you want to use, an efficient<br>

> > binary format if you feel like parsing that...<br>

><br>

> It would be great if we could combine efforts a bit here so we don't<br>

> fork the codebase more than we have to. I.e., if "upstream" epping and<br>

> whatever daemon you end up writing can agree on data format etc that<br>

> would be fantastic! Added Simon to Cc to facilitate this :)<br>

><br>

> Briefly what I've discussed before with Simon was to have the ability to<br>

> aggregate the metrics in the kernel (WiP PR [0]) and have a userspace<br>

> utility periodically pull them out. What we discussed was doing this<br>

> using an LPM map (which is not in that PR yet). The idea would be that<br>

> userspace would populate the LPM map with the keys (prefixes) they<br>

> wanted statistics for (in LibreQOS context that could be one key per<br>

> customer, for instance). Epping would then do a map lookup into the LPM,<br>

> and if it gets a match it would update the statistics in that map entry<br>

> (keeping a histogram of latency values seen, basically). Simon's PR<br>

> below uses this technique where userspace will "reset" the histogram<br>

> every time it loads it by swapping out two different map entries when it<br>

> does a read; this allows you to control the sampling rate from<br>

> userspace, and you'll just get the data since the last time you polled.<br>

<br>

Thank's Toke for summarzing both the current state and the plan going<br>

forward. I will just note that this PR (and all my other work with<br>

ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less<br>

on hold for a couple of weeks right now as I'm trying to finish up a<br>

paper.<br>

<br>

> I was thinking that if we all can agree on the map format, then your<br>

> polling daemon could be one userspace "client" for that, and the epping<br>

> binary itself could be another; but we could keep compatibility between<br>

> the two, so we don't duplicate effort.<br>

><br>

> Similarly, refactoring of the epping code itself so it can be plugged<br>

> into the cpumap-tc code would be a good goal...<br>

<br>

Should probably do that...at some point. In general I think it's a bit<br>

of an interesting problem to think about how to chain multiple XDP/tc<br>

programs together in an efficent way. Most XDP and tc programs will do<br>

some amount of packet parsing and when you have many chained programs<br>

parsing the same packets this obviously becomes a bit wasteful. In the<br>

same time it would be nice if one didn't need to manually merge<br>

multiple programs together into a single one like this to get rid of<br>

this duplicated parsing, or at least make that process of merging those<br>

programs as simple as possible.<br>

<br>

<br>

> -Toke<br>

><br>

> [0] <a href="https://github.com/xdp-project/bpf-examples/pull/59" rel="noreferrer" target="_blank">https://github.com/xdp-project/bpf-examples/pull/59</a><br>

<br>

När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<<a href="https://www.kau.se/gdpr" rel="noreferrer" target="_blank">https://www.kau.se/gdpr</a>>.<br>

When you send an e-mail to Karlstad University, we will process your personal data<<a href="https://www.kau.se/en/gdpr" rel="noreferrer" target="_blank">https://www.kau.se/en/gdpr</a>>.<br>

</blockquote></div>