<div dir="ltr"><div>I have no doubt that logging is the biggest slow-down, followed by some dumb things (e.g. I just significantly <br></div><div>increased performance by not accidentally copying addresses twice...) I'm honestly pleasantly surprised</div><div>by how performant the debug logging is!</div><div><br></div><div>In the short-term, this is a fork. I'm not planning on keeping it that way, but I'm early enough into the</div><div>task that I need the freedom to really mess things up without upsetting upstream. ;-) At some point very <br></div><div>soon, I'll post a temporary GitHub repo with the hacked and messy version in, with a view to getting <br></div><div>more eyes on it before it transforms into something more generally useful. Cleaning up the more</div><div>embarrassing "written in a hurry" code.<br></div><div><br></div><div>The per-stream RTT buffer looks great. I'll definitely try to use that. I was a little alarmed to discover</div><div>that running clean-up on the kernel side is practically impossible, making a management daemon a</div><div>necessity (since the XDP mapping is long-running, the packet timing is likely to be running whether or</div><div>not LibreQOS is actively reading from it). A ready-summarized buffer format makes a LOT of sense.</div><div>At least until I run out of memory. ;-)</div><div><br></div><div>Thanks,</div><div>Herbert<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 17, 2022 at 9:13 AM Toke Høiland-Jørgensen <<a href="mailto:toke@toke.dk">toke@toke.dk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">[ Adding Simon to Cc ]<br>

<br>

Herbert Wolverson via LibreQoS <<a href="mailto:libreqos@lists.bufferbloat.net" target="_blank">libreqos@lists.bufferbloat.net</a>> writes:<br>

<br>

> Hey,<br>

><br>

> I've had some pretty good success with merging xdp-pping (<br>

> <a href="https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h" rel="noreferrer" target="_blank">https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h</a> )<br>

> into xdp-cpumap-tc ( <a href="https://github.com/xdp-project/xdp-cpumap-tc" rel="noreferrer" target="_blank">https://github.com/xdp-project/xdp-cpumap-tc</a> ).<br>

><br>

> I ported over most of the xdp-pping code, and then changed the entry point<br>

> and packet parsing code to make use of the work already done in<br>

> xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do<br>

> it twice). Then I switched the maps to per-cpu maps, and had to pin them -<br>

> otherwise the two tc instances don't properly share data. Right now, output<br>

> is just stubbed - I've still got to port the perfmap output code. Instead,<br>

> I'm dumping a bunch of extra data to the kernel debug pipe, so I can see<br>

> roughly what the output would look like.<br>

><br>

> With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on<br>

> single-stream iperf between two VMs (with a shaper VM in the middle). :-)<br>

<br>

Just FYI, that "just logging" is probably the biggest source of<br>

overhead, then. What Simon found was that sending the data from kernel<br>

to userspace is one of the most expensive bits of epping, at least when<br>

the number of data points goes up (which is does as additional flows are<br>

added).<br>

<br>

> So my question: how would you prefer to receive this data? I'll have to<br>

> write a daemon that provides userspace control (periodic cleanup as well as<br>

> reading the performance stream), so the world's kinda our oyster. I can<br>

> stick to Kathie's original format (and dump it to a named pipe, perhaps?),<br>

> a condensed format that only shows what you want to use, an efficient<br>

> binary format if you feel like parsing that...<br>

<br>

It would be great if we could combine efforts a bit here so we don't<br>

fork the codebase more than we have to. I.e., if "upstream" epping and<br>

whatever daemon you end up writing can agree on data format etc that<br>

would be fantastic! Added Simon to Cc to facilitate this :)<br>

<br>

Briefly what I've discussed before with Simon was to have the ability to<br>

aggregate the metrics in the kernel (WiP PR [0]) and have a userspace<br>

utility periodically pull them out. What we discussed was doing this<br>

using an LPM map (which is not in that PR yet). The idea would be that<br>

userspace would populate the LPM map with the keys (prefixes) they<br>

wanted statistics for (in LibreQOS context that could be one key per<br>

customer, for instance). Epping would then do a map lookup into the LPM,<br>

and if it gets a match it would update the statistics in that map entry<br>

(keeping a histogram of latency values seen, basically). Simon's PR<br>

below uses this technique where userspace will "reset" the histogram<br>

every time it loads it by swapping out two different map entries when it<br>

does a read; this allows you to control the sampling rate from<br>

userspace, and you'll just get the data since the last time you polled.<br>

<br>

I was thinking that if we all can agree on the map format, then your<br>

polling daemon could be one userspace "client" for that, and the epping<br>

binary itself could be another; but we could keep compatibility between<br>

the two, so we don't duplicate effort.<br>

<br>

Similarly, refactoring of the epping code itself so it can be plugged<br>

into the cpumap-tc code would be a good goal...<br>

<br>

-Toke<br>

<br>

[0] <a href="https://github.com/xdp-project/bpf-examples/pull/59" rel="noreferrer" target="_blank">https://github.com/xdp-project/bpf-examples/pull/59</a><br>

</blockquote></div>