I have no doubt that logging is the biggest slow-down, followed by some dumb things (e.g. I just significantly increased performance by not accidentally copying addresses twice...) I'm honestly pleasantly surprised by how performant the debug logging is! In the short-term, this is a fork. I'm not planning on keeping it that way, but I'm early enough into the task that I need the freedom to really mess things up without upsetting upstream. ;-) At some point very soon, I'll post a temporary GitHub repo with the hacked and messy version in, with a view to getting more eyes on it before it transforms into something more generally useful. Cleaning up the more embarrassing "written in a hurry" code. The per-stream RTT buffer looks great. I'll definitely try to use that. I was a little alarmed to discover that running clean-up on the kernel side is practically impossible, making a management daemon a necessity (since the XDP mapping is long-running, the packet timing is likely to be running whether or not LibreQOS is actively reading from it). A ready-summarized buffer format makes a LOT of sense. At least until I run out of memory. ;-) Thanks, Herbert On Mon, Oct 17, 2022 at 9:13 AM Toke Høiland-Jørgensen wrote: > [ Adding Simon to Cc ] > > Herbert Wolverson via LibreQoS writes: > > > Hey, > > > > I've had some pretty good success with merging xdp-pping ( > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h ) > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ). > > > > I ported over most of the xdp-pping code, and then changed the entry > point > > and packet parsing code to make use of the work already done in > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to > do > > it twice). Then I switched the maps to per-cpu maps, and had to pin them > - > > otherwise the two tc instances don't properly share data. Right now, > output > > is just stubbed - I've still got to port the perfmap output code. > Instead, > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see > > roughly what the output would look like. > > > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec > on > > single-stream iperf between two VMs (with a shaper VM in the middle). :-) > > Just FYI, that "just logging" is probably the biggest source of > overhead, then. What Simon found was that sending the data from kernel > to userspace is one of the most expensive bits of epping, at least when > the number of data points goes up (which is does as additional flows are > added). > > > So my question: how would you prefer to receive this data? I'll have to > > write a daemon that provides userspace control (periodic cleanup as well > as > > reading the performance stream), so the world's kinda our oyster. I can > > stick to Kathie's original format (and dump it to a named pipe, > perhaps?), > > a condensed format that only shows what you want to use, an efficient > > binary format if you feel like parsing that... > > It would be great if we could combine efforts a bit here so we don't > fork the codebase more than we have to. I.e., if "upstream" epping and > whatever daemon you end up writing can agree on data format etc that > would be fantastic! Added Simon to Cc to facilitate this :) > > Briefly what I've discussed before with Simon was to have the ability to > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace > utility periodically pull them out. What we discussed was doing this > using an LPM map (which is not in that PR yet). The idea would be that > userspace would populate the LPM map with the keys (prefixes) they > wanted statistics for (in LibreQOS context that could be one key per > customer, for instance). Epping would then do a map lookup into the LPM, > and if it gets a match it would update the statistics in that map entry > (keeping a histogram of latency values seen, basically). Simon's PR > below uses this technique where userspace will "reset" the histogram > every time it loads it by swapping out two different map entries when it > does a read; this allows you to control the sampling rate from > userspace, and you'll just get the data since the last time you polled. > > I was thinking that if we all can agree on the map format, then your > polling daemon could be one userspace "client" for that, and the epping > binary itself could be another; but we could keep compatibility between > the two, so we don't duplicate effort. > > Similarly, refactoring of the epping code itself so it can be plugged > into the cpumap-tc code would be a good goal... > > -Toke > > [0] https://github.com/xdp-project/bpf-examples/pull/59 >