From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 175B03B2A4 for ; Mon, 17 Oct 2022 10:59:48 -0400 (EDT) Received: by mail-pf1-x42d.google.com with SMTP id 67so11308757pfz.12 for ; Mon, 17 Oct 2022 07:59:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=T+3o9BXkaSXjYRW4GWSJucwpBVmxwJg0PwUKeAJZx+0=; b=cJTZirutaAARlvw9MSzPXghfvwo8x4pM6fQVs+sEaxv7yzBpQnQ6zQwRPyqtr8UYQL NC99rgToX3RfLQyYBeIQwoidzLIeTSoGcJG9FlL43RRxoRnA5F0HfhBWIQ7yluZ/Dl9d Xk8hM+rmLzodyXWCXZCtb0uswdQWrzBwzC5aabdk2knwCZk2CEVEUZz8tHYpNszIDKVW ig+LgLBI39K51xQIbww3eBRlQJ6iSOYkGcGT3MBYrMhiKKv6pZOzWAsrfl+a08iV5xyr +t4SqSRbAKfJ3d55PAjV9Gy2RGTWN+L48OuqttDccCGwzeSCdqAHx9Xby6UhiwxlRE1G ffzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=T+3o9BXkaSXjYRW4GWSJucwpBVmxwJg0PwUKeAJZx+0=; b=A4abzOmoR3YAqaCHXdoUrqhXax0GcguvvnPhP7Y9raNRX+/sLkKEvg/iCVApII5kk6 1Bi33OzsJEurnQ6cNDLOukORw+gkrKabe47d9iEDnb7w2VQVo3KDimZx4xRBv9VW6QPE m82U1LeZpW2bYv1mo+hTwTyREcYDuBqC6jdQKfjJEWsrd+9D2CHucX/Yb/3AgBz7Fd7l BshQIP69FN32n+raBK7tpEBK6gg0wHRcw/90esrN7xCjfdW6PGIeUD9ujU+TWly4+yHd yzxvhCxF1yDBX9gWPZxaajXDhPltkmH3hfvdviHezRHxFMA7k4JklabTon/WOd5tZsHb DU8Q== X-Gm-Message-State: ACrzQf2SvVop4DrLkmbR+OtNJphUPGqhXGtb29QjaBmYaoHGuJ5LUig/ q0deAItHY/onDHOyEyQjsqra2ogE0US59zUAtk1NHaxt X-Received: by 2002:a65:408b:0:b0:42a:55fb:60b0 with SMTP id t11-20020a65408b000000b0042a55fb60b0mt8316110pgp.431.1666018786768; Mon, 17 Oct 2022 07:59:46 -0700 (PDT) MIME-Version: 1.0 References: <87bkqatu61.fsf@toke.dk> In-Reply-To: <87bkqatu61.fsf@toke.dk> From: Herbert Wolverson Date: Mon, 17 Oct 2022 09:59:36 -0500 Message-ID: Cc: libreqos@lists.bufferbloat.net, Simon Sundberg Content-Type: multipart/alternative; boundary="000000000000fb0d8505eb3c39be" Subject: Re: [LibreQoS] In BPF pping - so far X-BeenThere: libreqos@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Oct 2022 14:59:48 -0000 --000000000000fb0d8505eb3c39be Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I have no doubt that logging is the biggest slow-down, followed by some dumb things (e.g. I just significantly increased performance by not accidentally copying addresses twice...) I'm honestly pleasantly surprised by how performant the debug logging is! In the short-term, this is a fork. I'm not planning on keeping it that way, but I'm early enough into the task that I need the freedom to really mess things up without upsetting upstream. ;-) At some point very soon, I'll post a temporary GitHub repo with the hacked and messy version in, with a view to getting more eyes on it before it transforms into something more generally useful. Cleaning up the more embarrassing "written in a hurry" code. The per-stream RTT buffer looks great. I'll definitely try to use that. I was a little alarmed to discover that running clean-up on the kernel side is practically impossible, making a management daemon a necessity (since the XDP mapping is long-running, the packet timing is likely to be running whether or not LibreQOS is actively reading from it). A ready-summarized buffer format makes a LOT of sense. At least until I run out of memory. ;-) Thanks, Herbert On Mon, Oct 17, 2022 at 9:13 AM Toke H=C3=B8iland-J=C3=B8rgensen wrote: > [ Adding Simon to Cc ] > > Herbert Wolverson via LibreQoS writes: > > > Hey, > > > > I've had some pretty good success with merging xdp-pping ( > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h ) > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ). > > > > I ported over most of the xdp-pping code, and then changed the entry > point > > and packet parsing code to make use of the work already done in > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need t= o > do > > it twice). Then I switched the maps to per-cpu maps, and had to pin the= m > - > > otherwise the two tc instances don't properly share data. Right now, > output > > is just stubbed - I've still got to port the perfmap output code. > Instead, > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can se= e > > roughly what the output would look like. > > > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec > on > > single-stream iperf between two VMs (with a shaper VM in the middle). := -) > > Just FYI, that "just logging" is probably the biggest source of > overhead, then. What Simon found was that sending the data from kernel > to userspace is one of the most expensive bits of epping, at least when > the number of data points goes up (which is does as additional flows are > added). > > > So my question: how would you prefer to receive this data? I'll have to > > write a daemon that provides userspace control (periodic cleanup as wel= l > as > > reading the performance stream), so the world's kinda our oyster. I can > > stick to Kathie's original format (and dump it to a named pipe, > perhaps?), > > a condensed format that only shows what you want to use, an efficient > > binary format if you feel like parsing that... > > It would be great if we could combine efforts a bit here so we don't > fork the codebase more than we have to. I.e., if "upstream" epping and > whatever daemon you end up writing can agree on data format etc that > would be fantastic! Added Simon to Cc to facilitate this :) > > Briefly what I've discussed before with Simon was to have the ability to > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace > utility periodically pull them out. What we discussed was doing this > using an LPM map (which is not in that PR yet). The idea would be that > userspace would populate the LPM map with the keys (prefixes) they > wanted statistics for (in LibreQOS context that could be one key per > customer, for instance). Epping would then do a map lookup into the LPM, > and if it gets a match it would update the statistics in that map entry > (keeping a histogram of latency values seen, basically). Simon's PR > below uses this technique where userspace will "reset" the histogram > every time it loads it by swapping out two different map entries when it > does a read; this allows you to control the sampling rate from > userspace, and you'll just get the data since the last time you polled. > > I was thinking that if we all can agree on the map format, then your > polling daemon could be one userspace "client" for that, and the epping > binary itself could be another; but we could keep compatibility between > the two, so we don't duplicate effort. > > Similarly, refactoring of the epping code itself so it can be plugged > into the cpumap-tc code would be a good goal... > > -Toke > > [0] https://github.com/xdp-project/bpf-examples/pull/59 > --000000000000fb0d8505eb3c39be Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I have no doubt that logging is the biggest slow-down= , followed by some dumb things (e.g. I just significantly
in= creased performance by not accidentally copying addresses twice...) I'm= honestly pleasantly surprised
by how performant the debug loggin= g is!

In the short-term, this is a fork. I'm n= ot planning on keeping it that way, but I'm early enough into the
=
task that I need the freedom to really mess things up without upsettin= g upstream. ;-) At some point very
soon, I'll post a tem= porary GitHub repo with the hacked and messy version in, with a view to get= ting
more eyes on it before it transforms into something mor= e generally useful. Cleaning up the more
embarrassing "writt= en in a hurry" code.

The per-stream RTT b= uffer looks great. I'll definitely try to use that. I was a little alar= med to discover
that running clean-up on the kernel side is pract= ically impossible, making a management daemon a
necessity (since = the XDP mapping is long-running, the packet timing is likely to be running = whether or
not LibreQOS is actively reading from it). A ready-sum= marized buffer format makes a LOT of sense.
At least until I run = out of memory. ;-)

Thanks,
Herbert

On Mon, Oct 17, 2022 at 9:13 AM Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke.dk> wrote:
[ Adding Simon to Cc ]

Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:<= br>
> Hey,
>
> I've had some pretty good success with merging xdp-pping (
> https://github.com/xdp-pro= ject/bpf-examples/blob/master/pping/pping.h )
> into xdp-cpumap-tc ( https://github.com/xdp-project= /xdp-cpumap-tc ).
>
> I ported over most of the xdp-pping code, and then changed the entry p= oint
> and packet parsing code to make use of the work already done in
> xdp-cpumap-tc (it's already parsed a big chunk of the packet, no n= eed to do
> it twice). Then I switched the maps to per-cpu maps, and had to pin th= em -
> otherwise the two tc instances don't properly share data. Right no= w, output
> is just stubbed - I've still got to port the perfmap output code. = Instead,
> I'm dumping a bunch of extra data to the kernel debug pipe, so I c= an see
> roughly what the output would look like.
>
> With debug enabled and just logging I'm now getting about 4.9 Gbit= s/sec on
> single-stream iperf between two VMs (with a shaper VM in the middle). = :-)

Just FYI, that "just logging" is probably the biggest source of overhead, then. What Simon found was that sending the data from kernel
to userspace is one of the most expensive bits of epping, at least when
the number of data points goes up (which is does as additional flows are added).

> So my question: how would you prefer to receive this data? I'll ha= ve to
> write a daemon that provides userspace control (periodic cleanup as we= ll as
> reading the performance stream), so the world's kinda our oyster. = I can
> stick to Kathie's original format (and dump it to a named pipe, pe= rhaps?),
> a condensed format that only shows what you want to use, an efficient<= br> > binary format if you feel like parsing that...

It would be great if we could combine efforts a bit here so we don't fork the codebase more than we have to. I.e., if "upstream" eppin= g and
whatever daemon you end up writing can agree on data format etc that
would be fantastic! Added Simon to Cc to facilitate this :)

Briefly what I've discussed before with Simon was to have the ability t= o
aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
utility periodically pull them out. What we discussed was doing this
using an LPM map (which is not in that PR yet). The idea would be that
userspace would populate the LPM map with the keys (prefixes) they
wanted statistics for (in LibreQOS context that could be one key per
customer, for instance). Epping would then do a map lookup into the LPM, and if it gets a match it would update the statistics in that map entry
(keeping a histogram of latency values seen, basically). Simon's PR
below uses this technique where userspace will "reset" the histog= ram
every time it loads it by swapping out two different map entries when it does a read; this allows you to control the sampling rate from
userspace, and you'll just get the data since the last time you polled.=

I was thinking that if we all can agree on the map format, then your
polling daemon could be one userspace "client" for that, and the = epping
binary itself could be another; but we could keep compatibility between
the two, so we don't duplicate effort.

Similarly, refactoring of the epping code itself so it can be plugged
into the cpumap-tc code would be a good goal...

-Toke

[0] https://github.com/xdp-project/bpf-examples/p= ull/59
--000000000000fb0d8505eb3c39be--