I have no doubt that logging is the biggest slow-down, followed by some
dumb things (e.g. I just significantly
increased performance by not accidentally copying addresses twice...) I'm
honestly pleasantly surprised
by how performant the debug logging is!

In the short-term, this is a fork. I'm not planning on keeping it that way,
but I'm early enough into the
task that I need the freedom to really mess things up without upsetting
upstream. ;-) At some point very
soon, I'll post a temporary GitHub repo with the hacked and messy version
in, with a view to getting
more eyes on it before it transforms into something more generally useful.
Cleaning up the more
embarrassing "written in a hurry" code.

The per-stream RTT buffer looks great. I'll definitely try to use that. I
was a little alarmed to discover
that running clean-up on the kernel side is practically impossible, making
a management daemon a
necessity (since the XDP mapping is long-running, the packet timing is
likely to be running whether or
not LibreQOS is actively reading from it). A ready-summarized buffer format
makes a LOT of sense.
At least until I run out of memory. ;-)

Thanks,
Herbert

On Mon, Oct 17, 2022 at 9:13 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> [ Adding Simon to Cc ]
>
> Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>
> > Hey,
> >
> > I've had some pretty good success with merging xdp-pping (
> > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> >
> > I ported over most of the xdp-pping code, and then changed the entry
> point
> > and packet parsing code to make use of the work already done in
> > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to
> do
> > it twice). Then I switched the maps to per-cpu maps, and had to pin them
> -
> > otherwise the two tc instances don't properly share data. Right now,
> output
> > is just stubbed - I've still got to port the perfmap output code.
> Instead,
> > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> > roughly what the output would look like.
> >
> > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec
> on
> > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>
> Just FYI, that "just logging" is probably the biggest source of
> overhead, then. What Simon found was that sending the data from kernel
> to userspace is one of the most expensive bits of epping, at least when
> the number of data points goes up (which is does as additional flows are
> added).
>
> > So my question: how would you prefer to receive this data? I'll have to
> > write a daemon that provides userspace control (periodic cleanup as well
> as
> > reading the performance stream), so the world's kinda our oyster. I can
> > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> > a condensed format that only shows what you want to use, an efficient
> > binary format if you feel like parsing that...
>
> It would be great if we could combine efforts a bit here so we don't
> fork the codebase more than we have to. I.e., if "upstream" epping and
> whatever daemon you end up writing can agree on data format etc that
> would be fantastic! Added Simon to Cc to facilitate this :)
>
> Briefly what I've discussed before with Simon was to have the ability to
> aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> utility periodically pull them out. What we discussed was doing this
> using an LPM map (which is not in that PR yet). The idea would be that
> userspace would populate the LPM map with the keys (prefixes) they
> wanted statistics for (in LibreQOS context that could be one key per
> customer, for instance). Epping would then do a map lookup into the LPM,
> and if it gets a match it would update the statistics in that map entry
> (keeping a histogram of latency values seen, basically). Simon's PR
> below uses this technique where userspace will "reset" the histogram
> every time it loads it by swapping out two different map entries when it
> does a read; this allows you to control the sampling rate from
> userspace, and you'll just get the data since the last time you polled.
>
> I was thinking that if we all can agree on the map format, then your
> polling daemon could be one userspace "client" for that, and the epping
> binary itself could be another; but we could keep compatibility between
> the two, so we don't duplicate effort.
>
> Similarly, refactoring of the epping code itself so it can be plugged
> into the cpumap-tc code would be a good goal...
>
> -Toke
>
> [0] https://github.com/xdp-project/bpf-examples/pull/59
>