[LibreQoS] In BPF pping

Many ISPs need the kinds of quality shaping cake can do
 help / color / mirror / Atom feed

* [LibreQoS] In BPF pping - so far
@ 2022-10-16  1:59 Herbert Wolverson
  2022-10-16  2:26 ` Robert Chacón
  2022-10-17 14:13 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-16  1:59 UTC (permalink / raw)
  To: libreqos

[-- Attachment #1: Type: text/plain, Size: 1386 bytes --]

Hey,

I've had some pretty good success with merging xdp-pping (
https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).

I ported over most of the xdp-pping code, and then changed the entry point
and packet parsing code to make use of the work already done in
xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
it twice). Then I switched the maps to per-cpu maps, and had to pin them -
otherwise the two tc instances don't properly share data. Right now, output
is just stubbed - I've still got to port the perfmap output code. Instead,
I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
roughly what the output would look like.

With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
single-stream iperf between two VMs (with a shaper VM in the middle). :-)

So my question: how would you prefer to receive this data? I'll have to
write a daemon that provides userspace control (periodic cleanup as well as
reading the performance stream), so the world's kinda our oyster. I can
stick to Kathie's original format (and dump it to a named pipe, perhaps?),
a condensed format that only shows what you want to use, an efficient
binary format if you feel like parsing that...

(I'll post some code soon, getting sleepy)

Thanks,
Herbert

[-- Attachment #2: Type: text/html, Size: 1764 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-16  1:59 [LibreQoS] In BPF pping - so far Herbert Wolverson
@ 2022-10-16  2:26 ` Robert Chacón
  2022-10-17 14:13 ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 20+ messages in thread
From: Robert Chacón @ 2022-10-16  2:26 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 2355 bytes --]

Hey Herbert,

Wow. Awesome work! How exciting. We may finally get highly scalable TCP
latency tracking in LibreQoS and BracketQoS.
Regarding how we receive the data, I suppose whatever is most efficient and
scalable for networks with high subscriber counts.
In v1.1 we were just parsing some data from the console output:

   - rtt1
   - IP address 1
   - IP address 2

I am a big fan of having some sort of JSON structure to pull info from.
What do you recommend here for optimal efficiency?

Thanks,
Robert

On Sat, Oct 15, 2022 at 7:59 PM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Hey,
>
> I've had some pretty good success with merging xdp-pping (
> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>
> I ported over most of the xdp-pping code, and then changed the entry point
> and packet parsing code to make use of the work already done in
> xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
> it twice). Then I switched the maps to per-cpu maps, and had to pin them -
> otherwise the two tc instances don't properly share data. Right now, output
> is just stubbed - I've still got to port the perfmap output code. Instead,
> I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> roughly what the output would look like.
>
> With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
> single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>
> So my question: how would you prefer to receive this data? I'll have to
> write a daemon that provides userspace control (periodic cleanup as well as
> reading the performance stream), so the world's kinda our oyster. I can
> stick to Kathie's original format (and dump it to a named pipe, perhaps?),
> a condensed format that only shows what you want to use, an efficient
> binary format if you feel like parsing that...
>
> (I'll post some code soon, getting sleepy)
>
> Thanks,
> Herbert
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>


-- 
Robert Chacón
CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>

[-- Attachment #2: Type: text/html, Size: 3391 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-16  1:59 [LibreQoS] In BPF pping - so far Herbert Wolverson
  2022-10-16  2:26 ` Robert Chacón
@ 2022-10-17 14:13 ` Toke Høiland-Jørgensen
  2022-10-17 14:59   ` Herbert Wolverson
  2022-10-17 15:14   ` Simon Sundberg
  1 sibling, 2 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-17 14:13 UTC (permalink / raw)
  To: Herbert Wolverson, libreqos; +Cc: Simon Sundberg

[ Adding Simon to Cc ]

Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:

> Hey,
>
> I've had some pretty good success with merging xdp-pping (
> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>
> I ported over most of the xdp-pping code, and then changed the entry point
> and packet parsing code to make use of the work already done in
> xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
> it twice). Then I switched the maps to per-cpu maps, and had to pin them -
> otherwise the two tc instances don't properly share data. Right now, output
> is just stubbed - I've still got to port the perfmap output code. Instead,
> I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> roughly what the output would look like.
>
> With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
> single-stream iperf between two VMs (with a shaper VM in the middle). :-)

Just FYI, that "just logging" is probably the biggest source of
overhead, then. What Simon found was that sending the data from kernel
to userspace is one of the most expensive bits of epping, at least when
the number of data points goes up (which is does as additional flows are
added).

> So my question: how would you prefer to receive this data? I'll have to
> write a daemon that provides userspace control (periodic cleanup as well as
> reading the performance stream), so the world's kinda our oyster. I can
> stick to Kathie's original format (and dump it to a named pipe, perhaps?),
> a condensed format that only shows what you want to use, an efficient
> binary format if you feel like parsing that...

It would be great if we could combine efforts a bit here so we don't
fork the codebase more than we have to. I.e., if "upstream" epping and
whatever daemon you end up writing can agree on data format etc that
would be fantastic! Added Simon to Cc to facilitate this :)

Briefly what I've discussed before with Simon was to have the ability to
aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
utility periodically pull them out. What we discussed was doing this
using an LPM map (which is not in that PR yet). The idea would be that
userspace would populate the LPM map with the keys (prefixes) they
wanted statistics for (in LibreQOS context that could be one key per
customer, for instance). Epping would then do a map lookup into the LPM,
and if it gets a match it would update the statistics in that map entry
(keeping a histogram of latency values seen, basically). Simon's PR
below uses this technique where userspace will "reset" the histogram
every time it loads it by swapping out two different map entries when it
does a read; this allows you to control the sampling rate from
userspace, and you'll just get the data since the last time you polled.

I was thinking that if we all can agree on the map format, then your
polling daemon could be one userspace "client" for that, and the epping
binary itself could be another; but we could keep compatibility between
the two, so we don't duplicate effort.

Similarly, refactoring of the epping code itself so it can be plugged
into the cpumap-tc code would be a good goal...

-Toke

[0] https://github.com/xdp-project/bpf-examples/pull/59

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-17 14:13 ` Toke Høiland-Jørgensen
@ 2022-10-17 14:59   ` Herbert Wolverson
  2022-10-17 15:14   ` Simon Sundberg
  1 sibling, 0 replies; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-17 14:59 UTC (permalink / raw)
  Cc: libreqos, Simon Sundberg

[-- Attachment #1: Type: text/plain, Size: 4861 bytes --]

I have no doubt that logging is the biggest slow-down, followed by some
dumb things (e.g. I just significantly
increased performance by not accidentally copying addresses twice...) I'm
honestly pleasantly surprised
by how performant the debug logging is!

In the short-term, this is a fork. I'm not planning on keeping it that way,
but I'm early enough into the
task that I need the freedom to really mess things up without upsetting
upstream. ;-) At some point very
soon, I'll post a temporary GitHub repo with the hacked and messy version
in, with a view to getting
more eyes on it before it transforms into something more generally useful.
Cleaning up the more
embarrassing "written in a hurry" code.

The per-stream RTT buffer looks great. I'll definitely try to use that. I
was a little alarmed to discover
that running clean-up on the kernel side is practically impossible, making
a management daemon a
necessity (since the XDP mapping is long-running, the packet timing is
likely to be running whether or
not LibreQOS is actively reading from it). A ready-summarized buffer format
makes a LOT of sense.
At least until I run out of memory. ;-)

Thanks,
Herbert

On Mon, Oct 17, 2022 at 9:13 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> [ Adding Simon to Cc ]
>
> Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>
> > Hey,
> >
> > I've had some pretty good success with merging xdp-pping (
> > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> >
> > I ported over most of the xdp-pping code, and then changed the entry
> point
> > and packet parsing code to make use of the work already done in
> > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to
> do
> > it twice). Then I switched the maps to per-cpu maps, and had to pin them
> -
> > otherwise the two tc instances don't properly share data. Right now,
> output
> > is just stubbed - I've still got to port the perfmap output code.
> Instead,
> > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> > roughly what the output would look like.
> >
> > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec
> on
> > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>
> Just FYI, that "just logging" is probably the biggest source of
> overhead, then. What Simon found was that sending the data from kernel
> to userspace is one of the most expensive bits of epping, at least when
> the number of data points goes up (which is does as additional flows are
> added).
>
> > So my question: how would you prefer to receive this data? I'll have to
> > write a daemon that provides userspace control (periodic cleanup as well
> as
> > reading the performance stream), so the world's kinda our oyster. I can
> > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> > a condensed format that only shows what you want to use, an efficient
> > binary format if you feel like parsing that...
>
> It would be great if we could combine efforts a bit here so we don't
> fork the codebase more than we have to. I.e., if "upstream" epping and
> whatever daemon you end up writing can agree on data format etc that
> would be fantastic! Added Simon to Cc to facilitate this :)
>
> Briefly what I've discussed before with Simon was to have the ability to
> aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> utility periodically pull them out. What we discussed was doing this
> using an LPM map (which is not in that PR yet). The idea would be that
> userspace would populate the LPM map with the keys (prefixes) they
> wanted statistics for (in LibreQOS context that could be one key per
> customer, for instance). Epping would then do a map lookup into the LPM,
> and if it gets a match it would update the statistics in that map entry
> (keeping a histogram of latency values seen, basically). Simon's PR
> below uses this technique where userspace will "reset" the histogram
> every time it loads it by swapping out two different map entries when it
> does a read; this allows you to control the sampling rate from
> userspace, and you'll just get the data since the last time you polled.
>
> I was thinking that if we all can agree on the map format, then your
> polling daemon could be one userspace "client" for that, and the epping
> binary itself could be another; but we could keep compatibility between
> the two, so we don't duplicate effort.
>
> Similarly, refactoring of the epping code itself so it can be plugged
> into the cpumap-tc code would be a good goal...
>
> -Toke
>
> [0] https://github.com/xdp-project/bpf-examples/pull/59
>

[-- Attachment #2: Type: text/html, Size: 6008 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-17 14:13 ` Toke Høiland-Jørgensen
  2022-10-17 14:59   ` Herbert Wolverson
@ 2022-10-17 15:14   ` Simon Sundberg
  2022-10-17 18:45     ` Herbert Wolverson
  1 sibling, 1 reply; 20+ messages in thread
From: Simon Sundberg @ 2022-10-17 15:14 UTC (permalink / raw)
  To: toke, herberticus, libreqos

Hi, thanks for adding me to the conversation. Just a couple of quick
notes.

On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> [ Adding Simon to Cc ]
>
> Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>
> > Hey,
> >
> > I've had some pretty good success with merging xdp-pping (
> > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> >
> > I ported over most of the xdp-pping code, and then changed the entry point
> > and packet parsing code to make use of the work already done in
> > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
> > it twice). Then I switched the maps to per-cpu maps, and had to pin them -
> > otherwise the two tc instances don't properly share data.
> >

I guess the xdp-cpumap-tc ensures that the same flow is processed on
the same CPU core at both ingress or egress. Otherwise, if a flow may
be processed by different cores on ingress and egress the per-CPU maps
will not really work reliably as each core will have a different view
on the state of the flow, if there's been a previous packet with a
certain TSval from that flow etc.

Furthermore, if a flow is always processed on the same core (on both
ingress and egress) I think per-CPU maps may be a bit wasteful on
memory. From my understanding the keys for per-CPU maps are still
shared across all CPUs, it's just that each CPU gets its own value. So
all CPUs will then have their own data for each flow, but it's only the
CPU processing the flow that will have any relevant data for the flow
while the remaining CPUs will just have an empty state for that flow.
Under the same assumption that packets within the same flow are always
processed on the same core there should generally not be any
concurrency issues with having a global (non-per-CPU) either as packets
from the same flow cannot be processed concurrently then (and thus no
concurrent access to the same value in the map). I am however still
very unclear on if there's any considerable performance impact between
global and per-CPU map versions if the same key is not accessed
concurrently.

> > Right now, output
> > is just stubbed - I've still got to port the perfmap output code. Instead,
> > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> > roughly what the output would look like.
> >
> > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
> > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>
> Just FYI, that "just logging" is probably the biggest source of
> overhead, then. What Simon found was that sending the data from kernel
> to userspace is one of the most expensive bits of epping, at least when
> the number of data points goes up (which is does as additional flows are
> added).

Yhea, reporting individual RTTs when there's lots of them (you may get
upwards of 1000 RTTs/s per flow) is not only problematic in terms of
direct overhead from the tool itself, but also becomes demanding for
whatever you use all those RTT samples for (i.e. need to log, parse,
analyze etc. a very large amount of RTTs). One way to deal with that is
of course to just apply some sort of sampling (the -r/--rate-limit and
-R/--rtt-rate
>
> > So my question: how would you prefer to receive this data? I'll have to
> > write a daemon that provides userspace control (periodic cleanup as well as
> > reading the performance stream), so the world's kinda our oyster. I can
> > stick to Kathie's original format (and dump it to a named pipe, perhaps?),
> > a condensed format that only shows what you want to use, an efficient
> > binary format if you feel like parsing that...
>
> It would be great if we could combine efforts a bit here so we don't
> fork the codebase more than we have to. I.e., if "upstream" epping and
> whatever daemon you end up writing can agree on data format etc that
> would be fantastic! Added Simon to Cc to facilitate this :)
>
> Briefly what I've discussed before with Simon was to have the ability to
> aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> utility periodically pull them out. What we discussed was doing this
> using an LPM map (which is not in that PR yet). The idea would be that
> userspace would populate the LPM map with the keys (prefixes) they
> wanted statistics for (in LibreQOS context that could be one key per
> customer, for instance). Epping would then do a map lookup into the LPM,
> and if it gets a match it would update the statistics in that map entry
> (keeping a histogram of latency values seen, basically). Simon's PR
> below uses this technique where userspace will "reset" the histogram
> every time it loads it by swapping out two different map entries when it
> does a read; this allows you to control the sampling rate from
> userspace, and you'll just get the data since the last time you polled.

Thank's Toke for summarzing both the current state and the plan going
forward. I will just note that this PR (and all my other work with
ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
on hold for a couple of weeks right now as I'm trying to finish up a
paper.

> I was thinking that if we all can agree on the map format, then your
> polling daemon could be one userspace "client" for that, and the epping
> binary itself could be another; but we could keep compatibility between
> the two, so we don't duplicate effort.
>
> Similarly, refactoring of the epping code itself so it can be plugged
> into the cpumap-tc code would be a good goal...

Should probably do that...at some point. In general I think it's a bit
of an interesting problem to think about how to chain multiple XDP/tc
programs together in an efficent way. Most XDP and tc programs will do
some amount of packet parsing and when you have many chained programs
parsing the same packets this obviously becomes a bit wasteful. In the
same time it would be nice if one didn't need to manually merge
multiple programs together into a single one like this to get rid of
this duplicated parsing, or at least make that process of merging those
programs as simple as possible.

> -Toke
>
> [0] https://github.com/xdp-project/bpf-examples/pull/59

När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-17 15:14   ` Simon Sundberg
@ 2022-10-17 18:45     ` Herbert Wolverson
  2022-10-17 20:33       ` Robert Chacón
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-17 18:45 UTC (permalink / raw)
  To: Simon Sundberg; +Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 10791 bytes --]

Hey,

My current (unfinished) progress on this is now available here:
https://github.com/thebracket/cpumap-pping-hackjob

I mean it about the warnings, this isn't at all stable, debugged - and
can't promise that it won't unleash the nasal demons
(to use a popular C++ phrase). The name is descriptive! ;-)

With that said, I'm pretty happy so far:

* It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
onto a dedicated CPU. It has to run on both
  the inbound and outbound classifiers, since otherwise it would only see
half the conversation.
* It does assume that your ingress and egress CPUs are mapped to the same
interface; I do that anyway in BracketQoS. Not doing
  that opens up a potential world of pain, since writes to the shared maps
would require a locking scheme. Too much locking, and you lose all of the
benefit of using multiple CPUs to begin with.
* It is pretty wasteful of RAM, but most of the shaper systems I've worked
with have lots of it.
* I've been gradually removing features that I don't want for BracketQoS. A
hypothetical future "useful to everyone" version wouldn't do that.
* Rate limiting is working, but I removed the requirement for a shared
configuration provided from userland - so right now it's always set to
report at 1 second intervals per stream.

My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and
a "shaper" VM in between running a slightly hacked-up LibreQoS.
iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a
cake/HTB queue setup) is around 5 gbit/s at present, on my
test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)

Output currently consists of debug messages reading:
  cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
Send performance event (5,1), 374696
  cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
Send performance event (5,1), 247069
  cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
Send performance event (5,1), 5217155
  cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
Send performance event (5,1), 4515394
  cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
Send performance event (5,1), 4481289
  cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
Send performance event (5,1), 4255268
  cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
Send performance event (5,1), 5249493
  cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
Send performance event (5,1), 3795993
  cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
Send performance event (5,1), 3949519
  cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
Send performance event (5,1), 4365335
  cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
Send performance event (5,1), 4154910
  cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
Send performance event (5,1), 4405582
  cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
Send flow event
  cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
Send flow event

The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
allocated by the xdp-cpumap parent.
I get pretty low latency between VMs; I'll set up a test with some
real-world data very soon.

I plan to keep hacking away, but feel free to take a peek.

Thanks,
Herbert

On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
wrote:

> Hi, thanks for adding me to the conversation. Just a couple of quick
> notes.
>
> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> > [ Adding Simon to Cc ]
> >
> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
> >
> > > Hey,
> > >
> > > I've had some pretty good success with merging xdp-pping (
> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
> )
> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> > >
> > > I ported over most of the xdp-pping code, and then changed the entry
> point
> > > and packet parsing code to make use of the work already done in
> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need
> to do
> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
> them -
> > > otherwise the two tc instances don't properly share data.
> > >
>
> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> the same CPU core at both ingress or egress. Otherwise, if a flow may
> be processed by different cores on ingress and egress the per-CPU maps
> will not really work reliably as each core will have a different view
> on the state of the flow, if there's been a previous packet with a
> certain TSval from that flow etc.
>
> Furthermore, if a flow is always processed on the same core (on both
> ingress and egress) I think per-CPU maps may be a bit wasteful on
> memory. From my understanding the keys for per-CPU maps are still
> shared across all CPUs, it's just that each CPU gets its own value. So
> all CPUs will then have their own data for each flow, but it's only the
> CPU processing the flow that will have any relevant data for the flow
> while the remaining CPUs will just have an empty state for that flow.
> Under the same assumption that packets within the same flow are always
> processed on the same core there should generally not be any
> concurrency issues with having a global (non-per-CPU) either as packets
> from the same flow cannot be processed concurrently then (and thus no
> concurrent access to the same value in the map). I am however still
> very unclear on if there's any considerable performance impact between
> global and per-CPU map versions if the same key is not accessed
> concurrently.
>
> > > Right now, output
> > > is just stubbed - I've still got to port the perfmap output code.
> Instead,
> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
> see
> > > roughly what the output would look like.
> > >
> > > With debug enabled and just logging I'm now getting about 4.9
> Gbits/sec on
> > > single-stream iperf between two VMs (with a shaper VM in the middle).
> :-)
> >
> > Just FYI, that "just logging" is probably the biggest source of
> > overhead, then. What Simon found was that sending the data from kernel
> > to userspace is one of the most expensive bits of epping, at least when
> > the number of data points goes up (which is does as additional flows are
> > added).
>
> Yhea, reporting individual RTTs when there's lots of them (you may get
> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> direct overhead from the tool itself, but also becomes demanding for
> whatever you use all those RTT samples for (i.e. need to log, parse,
> analyze etc. a very large amount of RTTs). One way to deal with that is
> of course to just apply some sort of sampling (the -r/--rate-limit and
> -R/--rtt-rate
> >
> > > So my question: how would you prefer to receive this data? I'll have to
> > > write a daemon that provides userspace control (periodic cleanup as
> well as
> > > reading the performance stream), so the world's kinda our oyster. I can
> > > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> > > a condensed format that only shows what you want to use, an efficient
> > > binary format if you feel like parsing that...
> >
> > It would be great if we could combine efforts a bit here so we don't
> > fork the codebase more than we have to. I.e., if "upstream" epping and
> > whatever daemon you end up writing can agree on data format etc that
> > would be fantastic! Added Simon to Cc to facilitate this :)
> >
> > Briefly what I've discussed before with Simon was to have the ability to
> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> > utility periodically pull them out. What we discussed was doing this
> > using an LPM map (which is not in that PR yet). The idea would be that
> > userspace would populate the LPM map with the keys (prefixes) they
> > wanted statistics for (in LibreQOS context that could be one key per
> > customer, for instance). Epping would then do a map lookup into the LPM,
> > and if it gets a match it would update the statistics in that map entry
> > (keeping a histogram of latency values seen, basically). Simon's PR
> > below uses this technique where userspace will "reset" the histogram
> > every time it loads it by swapping out two different map entries when it
> > does a read; this allows you to control the sampling rate from
> > userspace, and you'll just get the data since the last time you polled.
>
> Thank's Toke for summarzing both the current state and the plan going
> forward. I will just note that this PR (and all my other work with
> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
> on hold for a couple of weeks right now as I'm trying to finish up a
> paper.
>
> > I was thinking that if we all can agree on the map format, then your
> > polling daemon could be one userspace "client" for that, and the epping
> > binary itself could be another; but we could keep compatibility between
> > the two, so we don't duplicate effort.
> >
> > Similarly, refactoring of the epping code itself so it can be plugged
> > into the cpumap-tc code would be a good goal...
>
> Should probably do that...at some point. In general I think it's a bit
> of an interesting problem to think about how to chain multiple XDP/tc
> programs together in an efficent way. Most XDP and tc programs will do
> some amount of packet parsing and when you have many chained programs
> parsing the same packets this obviously becomes a bit wasteful. In the
> same time it would be nice if one didn't need to manually merge
> multiple programs together into a single one like this to get rid of
> this duplicated parsing, or at least make that process of merging those
> programs as simple as possible.
>
>
> > -Toke
> >
> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>
> När du skickar e-post till Karlstads universitet behandlar vi dina
> personuppgifter<https://www.kau.se/gdpr>.
> When you send an e-mail to Karlstad University, we will process your
> personal data<https://www.kau.se/en/gdpr>.
>

[-- Attachment #2: Type: text/html, Size: 12808 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-17 18:45     ` Herbert Wolverson
@ 2022-10-17 20:33       ` Robert Chacón
  2022-10-18 18:01         ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: Robert Chacón @ 2022-10-17 20:33 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: Simon Sundberg, libreqos

[-- Attachment #1: Type: text/plain, Size: 12654 bytes --]

Hey Herbert,

Fantastic work! Super exciting to see this coming together, especially so
quickly.
I'll test it soon.
I understand and agree with your decision to omit certain features (ICMP
tracking,DNS tracking, etc) to optimize performance for our use case. Like
you said, in order to merge the functionality without a performance hit,
merging them is sort of the only way right now. Otherwise there would be a
lot of redundancy and lost throughput for an ISP's use. Though hopefully
long term there will be a way to keep all projects working independently
but interoperably with a plugin system of some kind.

By the way, I'm making some headway on LibreQoS v1.3. Focusing on
optimizations for high sub counts (8000+ subs) as well as stateful changes
to the queue structure.
I'm working to set up a physical lab to test high throughput and high
client count scenarios.
When testing beyond ~32,000 filters we get "no space left on device" from
xdp-cpumap-tc, which I think relates to the bpf map size limitation you
mentioned. Maybe in the coming months we can take a look at that.

Anyway great work on the cpumap-pping program! Excited to see more on this.

Thanks,
Robert

On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Hey,
>
> My current (unfinished) progress on this is now available here:
> https://github.com/thebracket/cpumap-pping-hackjob
>
> I mean it about the warnings, this isn't at all stable, debugged - and
> can't promise that it won't unleash the nasal demons
> (to use a popular C++ phrase). The name is descriptive! ;-)
>
> With that said, I'm pretty happy so far:
>
> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
> onto a dedicated CPU. It has to run on both
>   the inbound and outbound classifiers, since otherwise it would only see
> half the conversation.
> * It does assume that your ingress and egress CPUs are mapped to the same
> interface; I do that anyway in BracketQoS. Not doing
>   that opens up a potential world of pain, since writes to the shared maps
> would require a locking scheme. Too much locking, and you lose all of the
> benefit of using multiple CPUs to begin with.
> * It is pretty wasteful of RAM, but most of the shaper systems I've worked
> with have lots of it.
> * I've been gradually removing features that I don't want for BracketQoS.
> A hypothetical future "useful to everyone" version wouldn't do that.
> * Rate limiting is working, but I removed the requirement for a shared
> configuration provided from userland - so right now it's always set to
> report at 1 second intervals per stream.
>
> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and
> a "shaper" VM in between running a slightly hacked-up LibreQoS.
> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via
> a cake/HTB queue setup) is around 5 gbit/s at present, on my
> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast
> SSDs)
>
> Output currently consists of debug messages reading:
>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
> Flow open event
>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
> Send performance event (5,1), 374696
>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
> Flow open event
>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
> Send performance event (5,1), 247069
>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
> Send performance event (5,1), 5217155
>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
> Send performance event (5,1), 4515394
>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
> Send performance event (5,1), 4481289
>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
> Send performance event (5,1), 4255268
>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
> Send performance event (5,1), 5249493
>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
> Send performance event (5,1), 3795993
>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
> Send performance event (5,1), 3949519
>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
> Send performance event (5,1), 4365335
>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
> Send performance event (5,1), 4154910
>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
> Send performance event (5,1), 4405582
>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
> Send flow event
>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
> Send flow event
>
> The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
> allocated by the xdp-cpumap parent.
> I get pretty low latency between VMs; I'll set up a test with some
> real-world data very soon.
>
> I plan to keep hacking away, but feel free to take a peek.
>
> Thanks,
> Herbert
>
> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
> wrote:
>
>> Hi, thanks for adding me to the conversation. Just a couple of quick
>> notes.
>>
>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>> > [ Adding Simon to Cc ]
>> >
>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>> >
>> > > Hey,
>> > >
>> > > I've had some pretty good success with merging xdp-pping (
>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
>> )
>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>> > >
>> > > I ported over most of the xdp-pping code, and then changed the entry
>> point
>> > > and packet parsing code to make use of the work already done in
>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need
>> to do
>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
>> them -
>> > > otherwise the two tc instances don't properly share data.
>> > >
>>
>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>> be processed by different cores on ingress and egress the per-CPU maps
>> will not really work reliably as each core will have a different view
>> on the state of the flow, if there's been a previous packet with a
>> certain TSval from that flow etc.
>>
>> Furthermore, if a flow is always processed on the same core (on both
>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>> memory. From my understanding the keys for per-CPU maps are still
>> shared across all CPUs, it's just that each CPU gets its own value. So
>> all CPUs will then have their own data for each flow, but it's only the
>> CPU processing the flow that will have any relevant data for the flow
>> while the remaining CPUs will just have an empty state for that flow.
>> Under the same assumption that packets within the same flow are always
>> processed on the same core there should generally not be any
>> concurrency issues with having a global (non-per-CPU) either as packets
>> from the same flow cannot be processed concurrently then (and thus no
>> concurrent access to the same value in the map). I am however still
>> very unclear on if there's any considerable performance impact between
>> global and per-CPU map versions if the same key is not accessed
>> concurrently.
>>
>> > > Right now, output
>> > > is just stubbed - I've still got to port the perfmap output code.
>> Instead,
>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
>> see
>> > > roughly what the output would look like.
>> > >
>> > > With debug enabled and just logging I'm now getting about 4.9
>> Gbits/sec on
>> > > single-stream iperf between two VMs (with a shaper VM in the middle).
>> :-)
>> >
>> > Just FYI, that "just logging" is probably the biggest source of
>> > overhead, then. What Simon found was that sending the data from kernel
>> > to userspace is one of the most expensive bits of epping, at least when
>> > the number of data points goes up (which is does as additional flows are
>> > added).
>>
>> Yhea, reporting individual RTTs when there's lots of them (you may get
>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>> direct overhead from the tool itself, but also becomes demanding for
>> whatever you use all those RTT samples for (i.e. need to log, parse,
>> analyze etc. a very large amount of RTTs). One way to deal with that is
>> of course to just apply some sort of sampling (the -r/--rate-limit and
>> -R/--rtt-rate
>> >
>> > > So my question: how would you prefer to receive this data? I'll have
>> to
>> > > write a daemon that provides userspace control (periodic cleanup as
>> well as
>> > > reading the performance stream), so the world's kinda our oyster. I
>> can
>> > > stick to Kathie's original format (and dump it to a named pipe,
>> perhaps?),
>> > > a condensed format that only shows what you want to use, an efficient
>> > > binary format if you feel like parsing that...
>> >
>> > It would be great if we could combine efforts a bit here so we don't
>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>> > whatever daemon you end up writing can agree on data format etc that
>> > would be fantastic! Added Simon to Cc to facilitate this :)
>> >
>> > Briefly what I've discussed before with Simon was to have the ability to
>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>> > utility periodically pull them out. What we discussed was doing this
>> > using an LPM map (which is not in that PR yet). The idea would be that
>> > userspace would populate the LPM map with the keys (prefixes) they
>> > wanted statistics for (in LibreQOS context that could be one key per
>> > customer, for instance). Epping would then do a map lookup into the LPM,
>> > and if it gets a match it would update the statistics in that map entry
>> > (keeping a histogram of latency values seen, basically). Simon's PR
>> > below uses this technique where userspace will "reset" the histogram
>> > every time it loads it by swapping out two different map entries when it
>> > does a read; this allows you to control the sampling rate from
>> > userspace, and you'll just get the data since the last time you polled.
>>
>> Thank's Toke for summarzing both the current state and the plan going
>> forward. I will just note that this PR (and all my other work with
>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>> on hold for a couple of weeks right now as I'm trying to finish up a
>> paper.
>>
>> > I was thinking that if we all can agree on the map format, then your
>> > polling daemon could be one userspace "client" for that, and the epping
>> > binary itself could be another; but we could keep compatibility between
>> > the two, so we don't duplicate effort.
>> >
>> > Similarly, refactoring of the epping code itself so it can be plugged
>> > into the cpumap-tc code would be a good goal...
>>
>> Should probably do that...at some point. In general I think it's a bit
>> of an interesting problem to think about how to chain multiple XDP/tc
>> programs together in an efficent way. Most XDP and tc programs will do
>> some amount of packet parsing and when you have many chained programs
>> parsing the same packets this obviously becomes a bit wasteful. In the
>> same time it would be nice if one didn't need to manually merge
>> multiple programs together into a single one like this to get rid of
>> this duplicated parsing, or at least make that process of merging those
>> programs as simple as possible.
>>
>>
>> > -Toke
>> >
>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>
>> När du skickar e-post till Karlstads universitet behandlar vi dina
>> personuppgifter<https://www.kau.se/gdpr>.
>> When you send an e-mail to Karlstad University, we will process your
>> personal data<https://www.kau.se/en/gdpr>.
>>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>


-- 
Robert Chacón
CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>

[-- Attachment #2: Type: text/html, Size: 15140 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-17 20:33       ` Robert Chacón
@ 2022-10-18 18:01         ` Herbert Wolverson
  2022-10-19 13:44           ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-18 18:01 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 13838 bytes --]

It's probably not entirely thread-safe right now (ran into some issues
reading per_cpu maps back from userspace; hopefully, I'll get that figured
out) - but the commits I just pushed have it basically working on
single-stream testing. :-)

Setup cpumap as usual, and periodically run xdp-pping. This gives you
per-connection RTT information in JSON:

[
{"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
{}]

(With the extra {} because I'm not tracking the tail and haven't done comma
removal). The tool also empties the various maps used to gather data,
acting as a "reset" point. There's a max of 60 samples per queue, in a
ringbuffer setup (so newest will start to overwrite the oldest).

I'll start trying to test on a larger scale now.

On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
robert.chacon@jackrabbitwireless.com> wrote:

> Hey Herbert,
>
> Fantastic work! Super exciting to see this coming together, especially so
> quickly.
> I'll test it soon.
> I understand and agree with your decision to omit certain features (ICMP
> tracking,DNS tracking, etc) to optimize performance for our use case. Like
> you said, in order to merge the functionality without a performance hit,
> merging them is sort of the only way right now. Otherwise there would be a
> lot of redundancy and lost throughput for an ISP's use. Though hopefully
> long term there will be a way to keep all projects working independently
> but interoperably with a plugin system of some kind.
>
> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
> optimizations for high sub counts (8000+ subs) as well as stateful changes
> to the queue structure.
> I'm working to set up a physical lab to test high throughput and high
> client count scenarios.
> When testing beyond ~32,000 filters we get "no space left on device" from
> xdp-cpumap-tc, which I think relates to the bpf map size limitation you
> mentioned. Maybe in the coming months we can take a look at that.
>
> Anyway great work on the cpumap-pping program! Excited to see more on this.
>
> Thanks,
> Robert
>
> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
>
>> Hey,
>>
>> My current (unfinished) progress on this is now available here:
>> https://github.com/thebracket/cpumap-pping-hackjob
>>
>> I mean it about the warnings, this isn't at all stable, debugged - and
>> can't promise that it won't unleash the nasal demons
>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>
>> With that said, I'm pretty happy so far:
>>
>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
>> onto a dedicated CPU. It has to run on both
>>   the inbound and outbound classifiers, since otherwise it would only see
>> half the conversation.
>> * It does assume that your ingress and egress CPUs are mapped to the same
>> interface; I do that anyway in BracketQoS. Not doing
>>   that opens up a potential world of pain, since writes to the shared
>> maps would require a locking scheme. Too much locking, and you lose all of
>> the benefit of using multiple CPUs to begin with.
>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>> worked with have lots of it.
>> * I've been gradually removing features that I don't want for BracketQoS.
>> A hypothetical future "useful to everyone" version wouldn't do that.
>> * Rate limiting is working, but I removed the requirement for a shared
>> configuration provided from userland - so right now it's always set to
>> report at 1 second intervals per stream.
>>
>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world",
>> and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via
>> a cake/HTB queue setup) is around 5 gbit/s at present, on my
>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast
>> SSDs)
>>
>> Output currently consists of debug messages reading:
>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
>> Flow open event
>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
>> Send performance event (5,1), 374696
>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
>> Flow open event
>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
>> Send performance event (5,1), 247069
>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
>> Send performance event (5,1), 5217155
>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4515394
>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4481289
>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4255268
>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
>> Send performance event (5,1), 5249493
>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
>> Send performance event (5,1), 3795993
>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
>> Send performance event (5,1), 3949519
>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4365335
>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4154910
>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
>> Send performance event (5,1), 4405582
>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
>> Send flow event
>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
>> Send flow event
>>
>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
>> allocated by the xdp-cpumap parent.
>> I get pretty low latency between VMs; I'll set up a test with some
>> real-world data very soon.
>>
>> I plan to keep hacking away, but feel free to take a peek.
>>
>> Thanks,
>> Herbert
>>
>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
>> wrote:
>>
>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>>> notes.
>>>
>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>> > [ Adding Simon to Cc ]
>>> >
>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>> writes:
>>> >
>>> > > Hey,
>>> > >
>>> > > I've had some pretty good success with merging xdp-pping (
>>> > >
>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>>> > >
>>> > > I ported over most of the xdp-pping code, and then changed the entry
>>> point
>>> > > and packet parsing code to make use of the work already done in
>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no
>>> need to do
>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
>>> them -
>>> > > otherwise the two tc instances don't properly share data.
>>> > >
>>>
>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>>> be processed by different cores on ingress and egress the per-CPU maps
>>> will not really work reliably as each core will have a different view
>>> on the state of the flow, if there's been a previous packet with a
>>> certain TSval from that flow etc.
>>>
>>> Furthermore, if a flow is always processed on the same core (on both
>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>> memory. From my understanding the keys for per-CPU maps are still
>>> shared across all CPUs, it's just that each CPU gets its own value. So
>>> all CPUs will then have their own data for each flow, but it's only the
>>> CPU processing the flow that will have any relevant data for the flow
>>> while the remaining CPUs will just have an empty state for that flow.
>>> Under the same assumption that packets within the same flow are always
>>> processed on the same core there should generally not be any
>>> concurrency issues with having a global (non-per-CPU) either as packets
>>> from the same flow cannot be processed concurrently then (and thus no
>>> concurrent access to the same value in the map). I am however still
>>> very unclear on if there's any considerable performance impact between
>>> global and per-CPU map versions if the same key is not accessed
>>> concurrently.
>>>
>>> > > Right now, output
>>> > > is just stubbed - I've still got to port the perfmap output code.
>>> Instead,
>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
>>> see
>>> > > roughly what the output would look like.
>>> > >
>>> > > With debug enabled and just logging I'm now getting about 4.9
>>> Gbits/sec on
>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>> middle). :-)
>>> >
>>> > Just FYI, that "just logging" is probably the biggest source of
>>> > overhead, then. What Simon found was that sending the data from kernel
>>> > to userspace is one of the most expensive bits of epping, at least when
>>> > the number of data points goes up (which is does as additional flows
>>> are
>>> > added).
>>>
>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>>> direct overhead from the tool itself, but also becomes demanding for
>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>>> -R/--rtt-rate
>>> >
>>> > > So my question: how would you prefer to receive this data? I'll have
>>> to
>>> > > write a daemon that provides userspace control (periodic cleanup as
>>> well as
>>> > > reading the performance stream), so the world's kinda our oyster. I
>>> can
>>> > > stick to Kathie's original format (and dump it to a named pipe,
>>> perhaps?),
>>> > > a condensed format that only shows what you want to use, an efficient
>>> > > binary format if you feel like parsing that...
>>> >
>>> > It would be great if we could combine efforts a bit here so we don't
>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>>> > whatever daemon you end up writing can agree on data format etc that
>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>> >
>>> > Briefly what I've discussed before with Simon was to have the ability
>>> to
>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>>> > utility periodically pull them out. What we discussed was doing this
>>> > using an LPM map (which is not in that PR yet). The idea would be that
>>> > userspace would populate the LPM map with the keys (prefixes) they
>>> > wanted statistics for (in LibreQOS context that could be one key per
>>> > customer, for instance). Epping would then do a map lookup into the
>>> LPM,
>>> > and if it gets a match it would update the statistics in that map entry
>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>>> > below uses this technique where userspace will "reset" the histogram
>>> > every time it loads it by swapping out two different map entries when
>>> it
>>> > does a read; this allows you to control the sampling rate from
>>> > userspace, and you'll just get the data since the last time you polled.
>>>
>>> Thank's Toke for summarzing both the current state and the plan going
>>> forward. I will just note that this PR (and all my other work with
>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>>> on hold for a couple of weeks right now as I'm trying to finish up a
>>> paper.
>>>
>>> > I was thinking that if we all can agree on the map format, then your
>>> > polling daemon could be one userspace "client" for that, and the epping
>>> > binary itself could be another; but we could keep compatibility between
>>> > the two, so we don't duplicate effort.
>>> >
>>> > Similarly, refactoring of the epping code itself so it can be plugged
>>> > into the cpumap-tc code would be a good goal...
>>>
>>> Should probably do that...at some point. In general I think it's a bit
>>> of an interesting problem to think about how to chain multiple XDP/tc
>>> programs together in an efficent way. Most XDP and tc programs will do
>>> some amount of packet parsing and when you have many chained programs
>>> parsing the same packets this obviously becomes a bit wasteful. In the
>>> same time it would be nice if one didn't need to manually merge
>>> multiple programs together into a single one like this to get rid of
>>> this duplicated parsing, or at least make that process of merging those
>>> programs as simple as possible.
>>>
>>>
>>> > -Toke
>>> >
>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>
>>> När du skickar e-post till Karlstads universitet behandlar vi dina
>>> personuppgifter<https://www.kau.se/gdpr>.
>>> When you send an e-mail to Karlstad University, we will process your
>>> personal data<https://www.kau.se/en/gdpr>.
>>>
>> _______________________________________________
>> LibreQoS mailing list
>> LibreQoS@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/libreqos
>>
>
>
> --
> Robert Chacón
> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
>

[-- Attachment #2: Type: text/html, Size: 16572 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-18 18:01         ` Herbert Wolverson
@ 2022-10-19 13:44           ` Herbert Wolverson
  2022-10-19 13:58             ` Dave Taht
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-19 13:44 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 19841 bytes --]

Hey,

Testing the current version (
https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
than I hoped. This build has shared (not per-cpu) maps, and a userspace
daemon (xdp_pping) to extract and reset stats.

My testing environment has grown a bit:
* ShaperVM - running Ubuntu Server and LibreQoS, with the new
cpumap-pping-hackjob version of xdp-cpumap.
* ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf server.
* ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts
iperf client.
* ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts
iperf client.

ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on a
virtual switch.
ExtTest and the other interface (WAN facing) of ShaperVM are on a different
virtual switch.

These are all on a host machine running Windows 11, a core i7 12th gen, 32
Gb RAM and fast SSD setup.

TEST 1: DUAL STREAMS, LOW THROUGHPUT

For this test, LibreQoS is configured:
* Two APs, each with 5gbit/s max.
* 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mbit/s.
They map to 1:5 and 2:5 respectively (separate CPUs).
* Set to use Cake

On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500 (for
a long run). Running xdp_pping yields correct results:

[
{"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
{"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
{}]

Or when I waited a while to gather/reset:

[
{"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
{"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
{}]

The ShaperVM shows no errors, just periodic logging that it is recording
data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).

After 500 seconds of continual iperfing, each client reported a throughput
of 104 Mbit/sec and 6.06 GBytes of data transmitted.

So for smaller streams, I'd call this a success.

TEST 2: DUAL STREAMS, HIGH THROUGHPUT

For this test, LibreQoS is configured:
* Two APs, each with 5gb/s max.
* 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Mapped
to 1:5 and 2:5 respectively (separate CPUs).

Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.

xdp_pping shows results, too:

[
{"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
{"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
{}]

[
{"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
{"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
{}]

The ShaperVM shows two CPUs pegging between 70 and 90 percent.

After 500 seconds of continual iperfing, each client reported a throughput
of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.

Maxing out HyperV like this is inducing a bit of latency (which is to be
expected), but it's not bad. I also forgot to disable hyperthreading, and
looking at the host performance it is sometimes running the second virtual
CPU on an underpowered "fake" CPU.

So for two large streams, I think we're doing pretty well also!

TEST 3: DUAL STREAMS, SINGLE CPU

This test is designed to try and blow things up. It's the same as test 2,
but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.

ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
pping stats start to show a bit of degradation in performance for pounding
it so hard:

[
{"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
{"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
{}]

For whatever reason, it smoothed out over time:

[
{"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
{"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
{}]

Surprisingly (to me), I didn't encounter errors. Each client received 2.22
Gbit/s performance, over 129 Gbytes of data.

TEST 4: DUAL STREAMS, 50 SUB-STREAMS

This test is also designed to break things. Same as test 3, but using iperf
-c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the flow
tracking. (Shorter time window because I really wanted to go and find
coffee)

ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results show
that this torture test is worsening performance, and there's always lots of
samples in the buffer:

[
{"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
{"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
{}]

This test also ran better than I expected. You can definitely see some
latency creeping in as I make the system work hard. Each VM showed around
2.4 Gbit/s in total performance at the end of the iperf session. There's
definitely some latency creeping in, which is expected - but I'm not sure I
expected quite that much.

WHAT'S NEXT & CONCLUSION

I noticed that I forgot to turn off efficient power management on my VMs
and host, and left Hyperthreading on by mistake. So that hurts overall
performance.

The base system seems to be working pretty solidly, at least for small
tests.Next up, I'll be removing extraneous debug reporting code, removing
some code paths that don't do anything but report, and looking for any
small optimization opportunities. I'll then re-run these tests. Once that's
done, I hope to find a maintenance window on my WISP and try it with actual
traffic.

I also need to re-run these tests without the pping system to provide some
before/after analysis.

On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com>
wrote:

> It's probably not entirely thread-safe right now (ran into some issues
> reading per_cpu maps back from userspace; hopefully, I'll get that figured
> out) - but the commits I just pushed have it basically working on
> single-stream testing. :-)
>
> Setup cpumap as usual, and periodically run xdp-pping. This gives you
> per-connection RTT information in JSON:
>
> [
> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> {}]
>
> (With the extra {} because I'm not tracking the tail and haven't done
> comma removal). The tool also empties the various maps used to gather data,
> acting as a "reset" point. There's a max of 60 samples per queue, in a
> ringbuffer setup (so newest will start to overwrite the oldest).
>
> I'll start trying to test on a larger scale now.
>
> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
> robert.chacon@jackrabbitwireless.com> wrote:
>
>> Hey Herbert,
>>
>> Fantastic work! Super exciting to see this coming together, especially so
>> quickly.
>> I'll test it soon.
>> I understand and agree with your decision to omit certain features (ICMP
>> tracking,DNS tracking, etc) to optimize performance for our use case. Like
>> you said, in order to merge the functionality without a performance hit,
>> merging them is sort of the only way right now. Otherwise there would be a
>> lot of redundancy and lost throughput for an ISP's use. Though hopefully
>> long term there will be a way to keep all projects working independently
>> but interoperably with a plugin system of some kind.
>>
>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>> to the queue structure.
>> I'm working to set up a physical lab to test high throughput and high
>> client count scenarios.
>> When testing beyond ~32,000 filters we get "no space left on device" from
>> xdp-cpumap-tc, which I think relates to the bpf map size limitation you
>> mentioned. Maybe in the coming months we can take a look at that.
>>
>> Anyway great work on the cpumap-pping program! Excited to see more on
>> this.
>>
>> Thanks,
>> Robert
>>
>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>>
>>> Hey,
>>>
>>> My current (unfinished) progress on this is now available here:
>>> https://github.com/thebracket/cpumap-pping-hackjob
>>>
>>> I mean it about the warnings, this isn't at all stable, debugged - and
>>> can't promise that it won't unleash the nasal demons
>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>
>>> With that said, I'm pretty happy so far:
>>>
>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>>> shunted onto a dedicated CPU. It has to run on both
>>>   the inbound and outbound classifiers, since otherwise it would only
>>> see half the conversation.
>>> * It does assume that your ingress and egress CPUs are mapped to the
>>> same interface; I do that anyway in BracketQoS. Not doing
>>>   that opens up a potential world of pain, since writes to the shared
>>> maps would require a locking scheme. Too much locking, and you lose all of
>>> the benefit of using multiple CPUs to begin with.
>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>>> worked with have lots of it.
>>> * I've been gradually removing features that I don't want for
>>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>>> that.
>>> * Rate limiting is working, but I removed the requirement for a shared
>>> configuration provided from userland - so right now it's always set to
>>> report at 1 second intervals per stream.
>>>
>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world",
>>> and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max,
>>> via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast
>>> SSDs)
>>>
>>> Output currently consists of debug messages reading:
>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk:
>>> (tc) Flow open event
>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 374696
>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk:
>>> (tc) Flow open event
>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 247069
>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 5217155
>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4515394
>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4481289
>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4255268
>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 5249493
>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 3795993
>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 3949519
>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4365335
>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4154910
>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk:
>>> (tc) Send performance event (5,1), 4405582
>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk:
>>> (tc) Send flow event
>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk:
>>> (tc) Send flow event
>>>
>>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
>>> allocated by the xdp-cpumap parent.
>>> I get pretty low latency between VMs; I'll set up a test with some
>>> real-world data very soon.
>>>
>>> I plan to keep hacking away, but feel free to take a peek.
>>>
>>> Thanks,
>>> Herbert
>>>
>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
>>> wrote:
>>>
>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>>>> notes.
>>>>
>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>> > [ Adding Simon to Cc ]
>>>> >
>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>>> writes:
>>>> >
>>>> > > Hey,
>>>> > >
>>>> > > I've had some pretty good success with merging xdp-pping (
>>>> > >
>>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc
>>>> ).
>>>> > >
>>>> > > I ported over most of the xdp-pping code, and then changed the
>>>> entry point
>>>> > > and packet parsing code to make use of the work already done in
>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no
>>>> need to do
>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
>>>> them -
>>>> > > otherwise the two tc instances don't properly share data.
>>>> > >
>>>>
>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>>>> be processed by different cores on ingress and egress the per-CPU maps
>>>> will not really work reliably as each core will have a different view
>>>> on the state of the flow, if there's been a previous packet with a
>>>> certain TSval from that flow etc.
>>>>
>>>> Furthermore, if a flow is always processed on the same core (on both
>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>> memory. From my understanding the keys for per-CPU maps are still
>>>> shared across all CPUs, it's just that each CPU gets its own value. So
>>>> all CPUs will then have their own data for each flow, but it's only the
>>>> CPU processing the flow that will have any relevant data for the flow
>>>> while the remaining CPUs will just have an empty state for that flow.
>>>> Under the same assumption that packets within the same flow are always
>>>> processed on the same core there should generally not be any
>>>> concurrency issues with having a global (non-per-CPU) either as packets
>>>> from the same flow cannot be processed concurrently then (and thus no
>>>> concurrent access to the same value in the map). I am however still
>>>> very unclear on if there's any considerable performance impact between
>>>> global and per-CPU map versions if the same key is not accessed
>>>> concurrently.
>>>>
>>>> > > Right now, output
>>>> > > is just stubbed - I've still got to port the perfmap output code.
>>>> Instead,
>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I
>>>> can see
>>>> > > roughly what the output would look like.
>>>> > >
>>>> > > With debug enabled and just logging I'm now getting about 4.9
>>>> Gbits/sec on
>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>>> middle). :-)
>>>> >
>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>> > overhead, then. What Simon found was that sending the data from kernel
>>>> > to userspace is one of the most expensive bits of epping, at least
>>>> when
>>>> > the number of data points goes up (which is does as additional flows
>>>> are
>>>> > added).
>>>>
>>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>>>> direct overhead from the tool itself, but also becomes demanding for
>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>>>> -R/--rtt-rate
>>>> >
>>>> > > So my question: how would you prefer to receive this data? I'll
>>>> have to
>>>> > > write a daemon that provides userspace control (periodic cleanup as
>>>> well as
>>>> > > reading the performance stream), so the world's kinda our oyster. I
>>>> can
>>>> > > stick to Kathie's original format (and dump it to a named pipe,
>>>> perhaps?),
>>>> > > a condensed format that only shows what you want to use, an
>>>> efficient
>>>> > > binary format if you feel like parsing that...
>>>> >
>>>> > It would be great if we could combine efforts a bit here so we don't
>>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>>>> > whatever daemon you end up writing can agree on data format etc that
>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>> >
>>>> > Briefly what I've discussed before with Simon was to have the ability
>>>> to
>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>>>> > utility periodically pull them out. What we discussed was doing this
>>>> > using an LPM map (which is not in that PR yet). The idea would be that
>>>> > userspace would populate the LPM map with the keys (prefixes) they
>>>> > wanted statistics for (in LibreQOS context that could be one key per
>>>> > customer, for instance). Epping would then do a map lookup into the
>>>> LPM,
>>>> > and if it gets a match it would update the statistics in that map
>>>> entry
>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>>>> > below uses this technique where userspace will "reset" the histogram
>>>> > every time it loads it by swapping out two different map entries when
>>>> it
>>>> > does a read; this allows you to control the sampling rate from
>>>> > userspace, and you'll just get the data since the last time you
>>>> polled.
>>>>
>>>> Thank's Toke for summarzing both the current state and the plan going
>>>> forward. I will just note that this PR (and all my other work with
>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>>>> on hold for a couple of weeks right now as I'm trying to finish up a
>>>> paper.
>>>>
>>>> > I was thinking that if we all can agree on the map format, then your
>>>> > polling daemon could be one userspace "client" for that, and the
>>>> epping
>>>> > binary itself could be another; but we could keep compatibility
>>>> between
>>>> > the two, so we don't duplicate effort.
>>>> >
>>>> > Similarly, refactoring of the epping code itself so it can be plugged
>>>> > into the cpumap-tc code would be a good goal...
>>>>
>>>> Should probably do that...at some point. In general I think it's a bit
>>>> of an interesting problem to think about how to chain multiple XDP/tc
>>>> programs together in an efficent way. Most XDP and tc programs will do
>>>> some amount of packet parsing and when you have many chained programs
>>>> parsing the same packets this obviously becomes a bit wasteful. In the
>>>> same time it would be nice if one didn't need to manually merge
>>>> multiple programs together into a single one like this to get rid of
>>>> this duplicated parsing, or at least make that process of merging those
>>>> programs as simple as possible.
>>>>
>>>>
>>>> > -Toke
>>>> >
>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>
>>>> När du skickar e-post till Karlstads universitet behandlar vi dina
>>>> personuppgifter<https://www.kau.se/gdpr>.
>>>> When you send an e-mail to Karlstad University, we will process your
>>>> personal data<https://www.kau.se/en/gdpr>.
>>>>
>>> _______________________________________________
>>> LibreQoS mailing list
>>> LibreQoS@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>
>>
>> --
>> Robert Chacón
>> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
>>
>

[-- Attachment #2: Type: text/html, Size: 24477 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 13:44           ` Herbert Wolverson
@ 2022-10-19 13:58             ` Dave Taht
  2022-10-19 14:01               ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Taht @ 2022-10-19 13:58 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

could I coax you to adopt flent?

apt-get install flent netperf irtt fping

You sometimes have to compile netperf yourself with --enable-demo on
some systems.
There are a bunch of python libs neede for the gui, but only on the client.

Then you can run a really gnarly test series and plot the results over time.

flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
the_server_name rrul # 110 other tests


On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
<libreqos@lists.bufferbloat.net> wrote:
>
> Hey,
>
> Testing the current version ( https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better than I hoped. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset stats.
>
> My testing environment has grown a bit:
> * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-pping-hackjob version of xdp-cpumap.
> * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf server.
> * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts iperf client.
> * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts iperf client.
>
> ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on a virtual switch.
> ExtTest and the other interface (WAN facing) of ShaperVM are on a different virtual switch.
>
> These are all on a host machine running Windows 11, a core i7 12th gen, 32 Gb RAM and fast SSD setup.
>
> TEST 1: DUAL STREAMS, LOW THROUGHPUT
>
> For this test, LibreQoS is configured:
> * Two APs, each with 5gbit/s max.
> * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> * Set to use Cake
>
> On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct results:
>
> [
> {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> {}]
>
> Or when I waited a while to gather/reset:
>
> [
> {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> {}]
>
> The ShaperVM shows no errors, just periodic logging that it is recording data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
>
> After 500 seconds of continual iperfing, each client reported a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>
> So for smaller streams, I'd call this a success.
>
> TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>
> For this test, LibreQoS is configured:
> * Two APs, each with 5gb/s max.
> * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>
> Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>
> xdp_pping shows results, too:
>
> [
> {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> {}]
>
> [
> {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> {}]
>
> The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>
> After 500 seconds of continual iperfing, each client reported a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>
> Maxing out HyperV like this is inducing a bit of latency (which is to be expected), but it's not bad. I also forgot to disable hyperthreading, and looking at the host performance it is sometimes running the second virtual CPU on an underpowered "fake" CPU.
>
> So for two large streams, I think we're doing pretty well also!
>
> TEST 3: DUAL STREAMS, SINGLE CPU
>
> This test is designed to try and blow things up. It's the same as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
>
> ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The pping stats start to show a bit of degradation in performance for pounding it so hard:
>
> [
> {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> {}]
>
> For whatever reason, it smoothed out over time:
>
> [
> {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> {}]
>
> Surprisingly (to me), I didn't encounter errors. Each client received 2.22 Gbit/s performance, over 129 Gbytes of data.
>
> TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>
> This test is also designed to break things. Same as test 3, but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the flow tracking. (Shorter time window because I really wanted to go and find coffee)
>
> ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results show that this torture test is worsening performance, and there's always lots of samples in the buffer:
>
> [
> {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
> {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
> {}]
>
> This test also ran better than I expected. You can definitely see some latency creeping in as I make the system work hard. Each VM showed around 2.4 Gbit/s in total performance at the end of the iperf session. There's definitely some latency creeping in, which is expected - but I'm not sure I expected quite that much.
>
> WHAT'S NEXT & CONCLUSION
>
> I noticed that I forgot to turn off efficient power management on my VMs and host, and left Hyperthreading on by mistake. So that hurts overall performance.
>
> The base system seems to be working pretty solidly, at least for small tests.Next up, I'll be removing extraneous debug reporting code, removing some code paths that don't do anything but report, and looking for any small optimization opportunities. I'll then re-run these tests. Once that's done, I hope to find a maintenance window on my WISP and try it with actual traffic.
>
> I also need to re-run these tests without the pping system to provide some before/after analysis.
>
> On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com> wrote:
>>
>> It's probably not entirely thread-safe right now (ran into some issues reading per_cpu maps back from userspace; hopefully, I'll get that figured out) - but the commits I just pushed have it basically working on single-stream testing. :-)
>>
>> Setup cpumap as usual, and periodically run xdp-pping. This gives you per-connection RTT information in JSON:
>>
>> [
>> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>> {}]
>>
>> (With the extra {} because I'm not tracking the tail and haven't done comma removal). The tool also empties the various maps used to gather data, acting as a "reset" point. There's a max of 60 samples per queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
>>
>> I'll start trying to test on a larger scale now.
>>
>> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <robert.chacon@jackrabbitwireless.com> wrote:
>>>
>>> Hey Herbert,
>>>
>>> Fantastic work! Super exciting to see this coming together, especially so quickly.
>>> I'll test it soon.
>>> I understand and agree with your decision to omit certain features (ICMP tracking,DNS tracking, etc) to optimize performance for our use case. Like you said, in order to merge the functionality without a performance hit, merging them is sort of the only way right now. Otherwise there would be a lot of redundancy and lost throughput for an ISP's use. Though hopefully long term there will be a way to keep all projects working independently but interoperably with a plugin system of some kind.
>>>
>>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on optimizations for high sub counts (8000+ subs) as well as stateful changes to the queue structure.
>>> I'm working to set up a physical lab to test high throughput and high client count scenarios.
>>> When testing beyond ~32,000 filters we get "no space left on device" from xdp-cpumap-tc, which I think relates to the bpf map size limitation you mentioned. Maybe in the coming months we can take a look at that.
>>>
>>> Anyway great work on the cpumap-pping program! Excited to see more on this.
>>>
>>> Thanks,
>>> Robert
>>>
>>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>>>>
>>>> Hey,
>>>>
>>>> My current (unfinished) progress on this is now available here: https://github.com/thebracket/cpumap-pping-hackjob
>>>>
>>>> I mean it about the warnings, this isn't at all stable, debugged - and can't promise that it won't unleash the nasal demons
>>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>>
>>>> With that said, I'm pretty happy so far:
>>>>
>>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on both
>>>>   the inbound and outbound classifiers, since otherwise it would only see half the conversation.
>>>> * It does assume that your ingress and egress CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not doing
>>>>   that opens up a potential world of pain, since writes to the shared maps would require a locking scheme. Too much locking, and you lose all of the benefit of using multiple CPUs to begin with.
>>>> * It is pretty wasteful of RAM, but most of the shaper systems I've worked with have lots of it.
>>>> * I've been gradually removing features that I don't want for BracketQoS. A hypothetical future "useful to everyone" version wouldn't do that.
>>>> * Rate limiting is working, but I removed the requirement for a shared configuration provided from userland - so right now it's always set to report at 1 second intervals per stream.
>>>>
>>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)
>>>>
>>>> Output currently consists of debug messages reading:
>>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc) Flow open event
>>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc) Flow open event
>>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc) Send flow event
>>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc) Send flow event
>>>>
>>>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap parent.
>>>> I get pretty low latency between VMs; I'll set up a test with some real-world data very soon.
>>>>
>>>> I plan to keep hacking away, but feel free to take a peek.
>>>>
>>>> Thanks,
>>>> Herbert
>>>>
>>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se> wrote:
>>>>>
>>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>>>>> notes.
>>>>>
>>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>>> > [ Adding Simon to Cc ]
>>>>> >
>>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>>>>> >
>>>>> > > Hey,
>>>>> > >
>>>>> > > I've had some pretty good success with merging xdp-pping (
>>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>>>>> > >
>>>>> > > I ported over most of the xdp-pping code, and then changed the entry point
>>>>> > > and packet parsing code to make use of the work already done in
>>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
>>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin them -
>>>>> > > otherwise the two tc instances don't properly share data.
>>>>> > >
>>>>>
>>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>>>>> be processed by different cores on ingress and egress the per-CPU maps
>>>>> will not really work reliably as each core will have a different view
>>>>> on the state of the flow, if there's been a previous packet with a
>>>>> certain TSval from that flow etc.
>>>>>
>>>>> Furthermore, if a flow is always processed on the same core (on both
>>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>>> memory. From my understanding the keys for per-CPU maps are still
>>>>> shared across all CPUs, it's just that each CPU gets its own value. So
>>>>> all CPUs will then have their own data for each flow, but it's only the
>>>>> CPU processing the flow that will have any relevant data for the flow
>>>>> while the remaining CPUs will just have an empty state for that flow.
>>>>> Under the same assumption that packets within the same flow are always
>>>>> processed on the same core there should generally not be any
>>>>> concurrency issues with having a global (non-per-CPU) either as packets
>>>>> from the same flow cannot be processed concurrently then (and thus no
>>>>> concurrent access to the same value in the map). I am however still
>>>>> very unclear on if there's any considerable performance impact between
>>>>> global and per-CPU map versions if the same key is not accessed
>>>>> concurrently.
>>>>>
>>>>> > > Right now, output
>>>>> > > is just stubbed - I've still got to port the perfmap output code. Instead,
>>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
>>>>> > > roughly what the output would look like.
>>>>> > >
>>>>> > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
>>>>> > > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>>>>> >
>>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>>> > overhead, then. What Simon found was that sending the data from kernel
>>>>> > to userspace is one of the most expensive bits of epping, at least when
>>>>> > the number of data points goes up (which is does as additional flows are
>>>>> > added).
>>>>>
>>>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>>>>> direct overhead from the tool itself, but also becomes demanding for
>>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>>>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>>>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>>>>> -R/--rtt-rate
>>>>> >
>>>>> > > So my question: how would you prefer to receive this data? I'll have to
>>>>> > > write a daemon that provides userspace control (periodic cleanup as well as
>>>>> > > reading the performance stream), so the world's kinda our oyster. I can
>>>>> > > stick to Kathie's original format (and dump it to a named pipe, perhaps?),
>>>>> > > a condensed format that only shows what you want to use, an efficient
>>>>> > > binary format if you feel like parsing that...
>>>>> >
>>>>> > It would be great if we could combine efforts a bit here so we don't
>>>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>>>>> > whatever daemon you end up writing can agree on data format etc that
>>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>>> >
>>>>> > Briefly what I've discussed before with Simon was to have the ability to
>>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>>>>> > utility periodically pull them out. What we discussed was doing this
>>>>> > using an LPM map (which is not in that PR yet). The idea would be that
>>>>> > userspace would populate the LPM map with the keys (prefixes) they
>>>>> > wanted statistics for (in LibreQOS context that could be one key per
>>>>> > customer, for instance). Epping would then do a map lookup into the LPM,
>>>>> > and if it gets a match it would update the statistics in that map entry
>>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>>>>> > below uses this technique where userspace will "reset" the histogram
>>>>> > every time it loads it by swapping out two different map entries when it
>>>>> > does a read; this allows you to control the sampling rate from
>>>>> > userspace, and you'll just get the data since the last time you polled.
>>>>>
>>>>> Thank's Toke for summarzing both the current state and the plan going
>>>>> forward. I will just note that this PR (and all my other work with
>>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>>>>> on hold for a couple of weeks right now as I'm trying to finish up a
>>>>> paper.
>>>>>
>>>>> > I was thinking that if we all can agree on the map format, then your
>>>>> > polling daemon could be one userspace "client" for that, and the epping
>>>>> > binary itself could be another; but we could keep compatibility between
>>>>> > the two, so we don't duplicate effort.
>>>>> >
>>>>> > Similarly, refactoring of the epping code itself so it can be plugged
>>>>> > into the cpumap-tc code would be a good goal...
>>>>>
>>>>> Should probably do that...at some point. In general I think it's a bit
>>>>> of an interesting problem to think about how to chain multiple XDP/tc
>>>>> programs together in an efficent way. Most XDP and tc programs will do
>>>>> some amount of packet parsing and when you have many chained programs
>>>>> parsing the same packets this obviously becomes a bit wasteful. In the
>>>>> same time it would be nice if one didn't need to manually merge
>>>>> multiple programs together into a single one like this to get rid of
>>>>> this duplicated parsing, or at least make that process of merging those
>>>>> programs as simple as possible.
>>>>>
>>>>>
>>>>> > -Toke
>>>>> >
>>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>>
>>>>> När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
>>>>> When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.
>>>>
>>>> _______________________________________________
>>>> LibreQoS mailing list
>>>> LibreQoS@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>>
>>>
>>> --
>>> Robert Chacón
>>> CEO | JackRabbit Wireless LLC
>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 13:58             ` Dave Taht
@ 2022-10-19 14:01               ` Herbert Wolverson
  2022-10-19 14:05                 ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-19 14:01 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 22054 bytes --]

I'll definitely take a look - that does look interesting. I don't have X11
on any of my test VMs, but
it looks like it can work without the GUI.

Thanks!

On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:

> could I coax you to adopt flent?
>
> apt-get install flent netperf irtt fping
>
> You sometimes have to compile netperf yourself with --enable-demo on
> some systems.
> There are a bunch of python libs neede for the gui, but only on the client.
>
> Then you can run a really gnarly test series and plot the results over
> time.
>
> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
> the_server_name rrul # 110 other tests
>
>
> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
> <libreqos@lists.bufferbloat.net> wrote:
> >
> > Hey,
> >
> > Testing the current version (
> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
> than I hoped. This build has shared (not per-cpu) maps, and a userspace
> daemon (xdp_pping) to extract and reset stats.
> >
> > My testing environment has grown a bit:
> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
> cpumap-pping-hackjob version of xdp-cpumap.
> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
> server.
> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts
> iperf client.
> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts
> iperf client.
> >
> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on
> a virtual switch.
> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
> different virtual switch.
> >
> > These are all on a host machine running Windows 11, a core i7 12th gen,
> 32 Gb RAM and fast SSD setup.
> >
> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
> >
> > For this test, LibreQoS is configured:
> > * Two APs, each with 5gbit/s max.
> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> > * Set to use Cake
> >
> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500
> (for a long run). Running xdp_pping yields correct results:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > {}]
> >
> > Or when I waited a while to gather/reset:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> > {}]
> >
> > The ShaperVM shows no errors, just periodic logging that it is recording
> data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
> >
> > After 500 seconds of continual iperfing, each client reported a
> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
> >
> > So for smaller streams, I'd call this a success.
> >
> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
> >
> > For this test, LibreQoS is configured:
> > * Two APs, each with 5gb/s max.
> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
> Mapped to 1:5 and 2:5 respectively (separate CPUs).
> >
> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
> >
> > xdp_pping shows results, too:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> > {}]
> >
> > [
> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> > {}]
> >
> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
> >
> > After 500 seconds of continual iperfing, each client reported a
> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
> >
> > Maxing out HyperV like this is inducing a bit of latency (which is to be
> expected), but it's not bad. I also forgot to disable hyperthreading, and
> looking at the host performance it is sometimes running the second virtual
> CPU on an underpowered "fake" CPU.
> >
> > So for two large streams, I think we're doing pretty well also!
> >
> > TEST 3: DUAL STREAMS, SINGLE CPU
> >
> > This test is designed to try and blow things up. It's the same as test
> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
> >
> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
> pping stats start to show a bit of degradation in performance for pounding
> it so hard:
> >
> > [
> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> > {}]
> >
> > For whatever reason, it smoothed out over time:
> >
> > [
> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> > {}]
> >
> > Surprisingly (to me), I didn't encounter errors. Each client received
> 2.22 Gbit/s performance, over 129 Gbytes of data.
> >
> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
> >
> > This test is also designed to break things. Same as test 3, but using
> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the
> flow tracking. (Shorter time window because I really wanted to go and find
> coffee)
> >
> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
> show that this torture test is worsening performance, and there's always
> lots of samples in the buffer:
> >
> > [
> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
> > {}]
> >
> > This test also ran better than I expected. You can definitely see some
> latency creeping in as I make the system work hard. Each VM showed around
> 2.4 Gbit/s in total performance at the end of the iperf session. There's
> definitely some latency creeping in, which is expected - but I'm not sure I
> expected quite that much.
> >
> > WHAT'S NEXT & CONCLUSION
> >
> > I noticed that I forgot to turn off efficient power management on my VMs
> and host, and left Hyperthreading on by mistake. So that hurts overall
> performance.
> >
> > The base system seems to be working pretty solidly, at least for small
> tests.Next up, I'll be removing extraneous debug reporting code, removing
> some code paths that don't do anything but report, and looking for any
> small optimization opportunities. I'll then re-run these tests. Once that's
> done, I hope to find a maintenance window on my WISP and try it with actual
> traffic.
> >
> > I also need to re-run these tests without the pping system to provide
> some before/after analysis.
> >
> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com>
> wrote:
> >>
> >> It's probably not entirely thread-safe right now (ran into some issues
> reading per_cpu maps back from userspace; hopefully, I'll get that figured
> out) - but the commits I just pushed have it basically working on
> single-stream testing. :-)
> >>
> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you
> per-connection RTT information in JSON:
> >>
> >> [
> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> >> {}]
> >>
> >> (With the extra {} because I'm not tracking the tail and haven't done
> comma removal). The tool also empties the various maps used to gather data,
> acting as a "reset" point. There's a max of 60 samples per queue, in a
> ringbuffer setup (so newest will start to overwrite the oldest).
> >>
> >> I'll start trying to test on a larger scale now.
> >>
> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
> robert.chacon@jackrabbitwireless.com> wrote:
> >>>
> >>> Hey Herbert,
> >>>
> >>> Fantastic work! Super exciting to see this coming together, especially
> so quickly.
> >>> I'll test it soon.
> >>> I understand and agree with your decision to omit certain features
> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
> Like you said, in order to merge the functionality without a performance
> hit, merging them is sort of the only way right now. Otherwise there would
> be a lot of redundancy and lost throughput for an ISP's use. Though
> hopefully long term there will be a way to keep all projects working
> independently but interoperably with a plugin system of some kind.
> >>>
> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
> optimizations for high sub counts (8000+ subs) as well as stateful changes
> to the queue structure.
> >>> I'm working to set up a physical lab to test high throughput and high
> client count scenarios.
> >>> When testing beyond ~32,000 filters we get "no space left on device"
> from xdp-cpumap-tc, which I think relates to the bpf map size limitation
> you mentioned. Maybe in the coming months we can take a look at that.
> >>>
> >>> Anyway great work on the cpumap-pping program! Excited to see more on
> this.
> >>>
> >>> Thanks,
> >>> Robert
> >>>
> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
> >>>>
> >>>> Hey,
> >>>>
> >>>> My current (unfinished) progress on this is now available here:
> https://github.com/thebracket/cpumap-pping-hackjob
> >>>>
> >>>> I mean it about the warnings, this isn't at all stable, debugged -
> and can't promise that it won't unleash the nasal demons
> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
> >>>>
> >>>> With that said, I'm pretty happy so far:
> >>>>
> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
> shunted onto a dedicated CPU. It has to run on both
> >>>>   the inbound and outbound classifiers, since otherwise it would only
> see half the conversation.
> >>>> * It does assume that your ingress and egress CPUs are mapped to the
> same interface; I do that anyway in BracketQoS. Not doing
> >>>>   that opens up a potential world of pain, since writes to the shared
> maps would require a locking scheme. Too much locking, and you lose all of
> the benefit of using multiple CPUs to begin with.
> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
> worked with have lots of it.
> >>>> * I've been gradually removing features that I don't want for
> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
> that.
> >>>> * Rate limiting is working, but I removed the requirement for a
> shared configuration provided from userland - so right now it's always set
> to report at 1 second intervals per stream.
> >>>>
> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max,
> via a cake/HTB queue setup) is around 5 gbit/s at present, on my
> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
> fast SSDs)
> >>>>
> >>>> Output currently consists of debug messages reading:
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk:
> (tc) Flow open event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk:
> (tc) Send performance event (5,1), 374696
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk:
> (tc) Flow open event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk:
> (tc) Send performance event (5,1), 247069
> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk:
> (tc) Send performance event (5,1), 5217155
> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk:
> (tc) Send performance event (5,1), 4515394
> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk:
> (tc) Send performance event (5,1), 4481289
> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk:
> (tc) Send performance event (5,1), 4255268
> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk:
> (tc) Send performance event (5,1), 5249493
> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk:
> (tc) Send performance event (5,1), 3795993
> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk:
> (tc) Send performance event (5,1), 3949519
> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk:
> (tc) Send performance event (5,1), 4365335
> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk:
> (tc) Send performance event (5,1), 4154910
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk:
> (tc) Send performance event (5,1), 4405582
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk:
> (tc) Send flow event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk:
> (tc) Send flow event
> >>>>
> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
> major/minor, allocated by the xdp-cpumap parent.
> >>>> I get pretty low latency between VMs; I'll set up a test with some
> real-world data very soon.
> >>>>
> >>>> I plan to keep hacking away, but feel free to take a peek.
> >>>>
> >>>> Thanks,
> >>>> Herbert
> >>>>
> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
> Simon.Sundberg@kau.se> wrote:
> >>>>>
> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
> >>>>> notes.
> >>>>>
> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> >>>>> > [ Adding Simon to Cc ]
> >>>>> >
> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
> writes:
> >>>>> >
> >>>>> > > Hey,
> >>>>> > >
> >>>>> > > I've had some pretty good success with merging xdp-pping (
> >>>>> > >
> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> >>>>> > > into xdp-cpumap-tc (
> https://github.com/xdp-project/xdp-cpumap-tc ).
> >>>>> > >
> >>>>> > > I ported over most of the xdp-pping code, and then changed the
> entry point
> >>>>> > > and packet parsing code to make use of the work already done in
> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no
> need to do
> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to
> pin them -
> >>>>> > > otherwise the two tc instances don't properly share data.
> >>>>> > >
> >>>>>
> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
> >>>>> be processed by different cores on ingress and egress the per-CPU
> maps
> >>>>> will not really work reliably as each core will have a different view
> >>>>> on the state of the flow, if there's been a previous packet with a
> >>>>> certain TSval from that flow etc.
> >>>>>
> >>>>> Furthermore, if a flow is always processed on the same core (on both
> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
> >>>>> memory. From my understanding the keys for per-CPU maps are still
> >>>>> shared across all CPUs, it's just that each CPU gets its own value.
> So
> >>>>> all CPUs will then have their own data for each flow, but it's only
> the
> >>>>> CPU processing the flow that will have any relevant data for the flow
> >>>>> while the remaining CPUs will just have an empty state for that flow.
> >>>>> Under the same assumption that packets within the same flow are
> always
> >>>>> processed on the same core there should generally not be any
> >>>>> concurrency issues with having a global (non-per-CPU) either as
> packets
> >>>>> from the same flow cannot be processed concurrently then (and thus no
> >>>>> concurrent access to the same value in the map). I am however still
> >>>>> very unclear on if there's any considerable performance impact
> between
> >>>>> global and per-CPU map versions if the same key is not accessed
> >>>>> concurrently.
> >>>>>
> >>>>> > > Right now, output
> >>>>> > > is just stubbed - I've still got to port the perfmap output
> code. Instead,
> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I
> can see
> >>>>> > > roughly what the output would look like.
> >>>>> > >
> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
> Gbits/sec on
> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
> middle). :-)
> >>>>> >
> >>>>> > Just FYI, that "just logging" is probably the biggest source of
> >>>>> > overhead, then. What Simon found was that sending the data from
> kernel
> >>>>> > to userspace is one of the most expensive bits of epping, at least
> when
> >>>>> > the number of data points goes up (which is does as additional
> flows are
> >>>>> > added).
> >>>>>
> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may
> get
> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> >>>>> direct overhead from the tool itself, but also becomes demanding for
> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
> >>>>> analyze etc. a very large amount of RTTs). One way to deal with that
> is
> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit
> and
> >>>>> -R/--rtt-rate
> >>>>> >
> >>>>> > > So my question: how would you prefer to receive this data? I'll
> have to
> >>>>> > > write a daemon that provides userspace control (periodic cleanup
> as well as
> >>>>> > > reading the performance stream), so the world's kinda our
> oyster. I can
> >>>>> > > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> >>>>> > > a condensed format that only shows what you want to use, an
> efficient
> >>>>> > > binary format if you feel like parsing that...
> >>>>> >
> >>>>> > It would be great if we could combine efforts a bit here so we
> don't
> >>>>> > fork the codebase more than we have to. I.e., if "upstream" epping
> and
> >>>>> > whatever daemon you end up writing can agree on data format etc
> that
> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
> >>>>> >
> >>>>> > Briefly what I've discussed before with Simon was to have the
> ability to
> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
> userspace
> >>>>> > utility periodically pull them out. What we discussed was doing
> this
> >>>>> > using an LPM map (which is not in that PR yet). The idea would be
> that
> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
> >>>>> > wanted statistics for (in LibreQOS context that could be one key
> per
> >>>>> > customer, for instance). Epping would then do a map lookup into
> the LPM,
> >>>>> > and if it gets a match it would update the statistics in that map
> entry
> >>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
> >>>>> > below uses this technique where userspace will "reset" the
> histogram
> >>>>> > every time it loads it by swapping out two different map entries
> when it
> >>>>> > does a read; this allows you to control the sampling rate from
> >>>>> > userspace, and you'll just get the data since the last time you
> polled.
> >>>>>
> >>>>> Thank's Toke for summarzing both the current state and the plan going
> >>>>> forward. I will just note that this PR (and all my other work with
> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
> less
> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
> >>>>> paper.
> >>>>>
> >>>>> > I was thinking that if we all can agree on the map format, then
> your
> >>>>> > polling daemon could be one userspace "client" for that, and the
> epping
> >>>>> > binary itself could be another; but we could keep compatibility
> between
> >>>>> > the two, so we don't duplicate effort.
> >>>>> >
> >>>>> > Similarly, refactoring of the epping code itself so it can be
> plugged
> >>>>> > into the cpumap-tc code would be a good goal...
> >>>>>
> >>>>> Should probably do that...at some point. In general I think it's a
> bit
> >>>>> of an interesting problem to think about how to chain multiple XDP/tc
> >>>>> programs together in an efficent way. Most XDP and tc programs will
> do
> >>>>> some amount of packet parsing and when you have many chained programs
> >>>>> parsing the same packets this obviously becomes a bit wasteful. In
> the
> >>>>> same time it would be nice if one didn't need to manually merge
> >>>>> multiple programs together into a single one like this to get rid of
> >>>>> this duplicated parsing, or at least make that process of merging
> those
> >>>>> programs as simple as possible.
> >>>>>
> >>>>>
> >>>>> > -Toke
> >>>>> >
> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
> >>>>>
> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina
> personuppgifter<https://www.kau.se/gdpr>.
> >>>>> When you send an e-mail to Karlstad University, we will process your
> personal data<https://www.kau.se/en/gdpr>.
> >>>>
> >>>> _______________________________________________
> >>>> LibreQoS mailing list
> >>>> LibreQoS@lists.bufferbloat.net
> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>
> >>>
> >>>
> >>> --
> >>> Robert Chacón
> >>> CEO | JackRabbit Wireless LLC
> >
> > _______________________________________________
> > LibreQoS mailing list
> > LibreQoS@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/libreqos
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
>
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> Dave Täht CEO, TekLibre, LLC
>

[-- Attachment #2: Type: text/html, Size: 29001 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 14:01               ` Herbert Wolverson
@ 2022-10-19 14:05                 ` Herbert Wolverson
  2022-10-19 14:48                   ` Robert Chacón
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-19 14:05 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 22990 bytes --]

Also, I forgot to mention that I *think* the current version has removed
the requirement that the inbound
and outbound classifiers be placed on the same CPU. I know interduo was
particularly keen on packing
upload into fewer cores. I'll add that to my list of things to test.

On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
wrote:

> I'll definitely take a look - that does look interesting. I don't have X11
> on any of my test VMs, but
> it looks like it can work without the GUI.
>
> Thanks!
>
> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>
>> could I coax you to adopt flent?
>>
>> apt-get install flent netperf irtt fping
>>
>> You sometimes have to compile netperf yourself with --enable-demo on
>> some systems.
>> There are a bunch of python libs neede for the gui, but only on the
>> client.
>>
>> Then you can run a really gnarly test series and plot the results over
>> time.
>>
>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>> the_server_name rrul # 110 other tests
>>
>>
>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> <libreqos@lists.bufferbloat.net> wrote:
>> >
>> > Hey,
>> >
>> > Testing the current version (
>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
>> than I hoped. This build has shared (not per-cpu) maps, and a userspace
>> daemon (xdp_pping) to extract and reset stats.
>> >
>> > My testing environment has grown a bit:
>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>> cpumap-pping-hackjob version of xdp-cpumap.
>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>> server.
>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts
>> iperf client.
>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts
>> iperf client.
>> >
>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are
>> on a virtual switch.
>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>> different virtual switch.
>> >
>> > These are all on a host machine running Windows 11, a core i7 12th gen,
>> 32 Gb RAM and fast SSD setup.
>> >
>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> >
>> > For this test, LibreQoS is configured:
>> > * Two APs, each with 5gbit/s max.
>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > * Set to use Cake
>> >
>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500
>> (for a long run). Running xdp_pping yields correct results:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > {}]
>> >
>> > Or when I waited a while to gather/reset:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>> > {}]
>> >
>> > The ShaperVM shows no errors, just periodic logging that it is
>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>> expected).
>> >
>> > After 500 seconds of continual iperfing, each client reported a
>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> >
>> > So for smaller streams, I'd call this a success.
>> >
>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> >
>> > For this test, LibreQoS is configured:
>> > * Two APs, each with 5gb/s max.
>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> >
>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>> >
>> > xdp_pping shows results, too:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>> > {}]
>> >
>> > [
>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>> > {}]
>> >
>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>> >
>> > After 500 seconds of continual iperfing, each client reported a
>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>> >
>> > Maxing out HyperV like this is inducing a bit of latency (which is to
>> be expected), but it's not bad. I also forgot to disable hyperthreading,
>> and looking at the host performance it is sometimes running the second
>> virtual CPU on an underpowered "fake" CPU.
>> >
>> > So for two large streams, I think we're doing pretty well also!
>> >
>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> >
>> > This test is designed to try and blow things up. It's the same as test
>> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
>> >
>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
>> pping stats start to show a bit of degradation in performance for pounding
>> it so hard:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>> > {}]
>> >
>> > For whatever reason, it smoothed out over time:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>> > {}]
>> >
>> > Surprisingly (to me), I didn't encounter errors. Each client received
>> 2.22 Gbit/s performance, over 129 Gbytes of data.
>> >
>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> >
>> > This test is also designed to break things. Same as test 3, but using
>> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the
>> flow tracking. (Shorter time window because I really wanted to go and find
>> coffee)
>> >
>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
>> show that this torture test is worsening performance, and there's always
>> lots of samples in the buffer:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>> > {}]
>> >
>> > This test also ran better than I expected. You can definitely see some
>> latency creeping in as I make the system work hard. Each VM showed around
>> 2.4 Gbit/s in total performance at the end of the iperf session. There's
>> definitely some latency creeping in, which is expected - but I'm not sure I
>> expected quite that much.
>> >
>> > WHAT'S NEXT & CONCLUSION
>> >
>> > I noticed that I forgot to turn off efficient power management on my
>> VMs and host, and left Hyperthreading on by mistake. So that hurts overall
>> performance.
>> >
>> > The base system seems to be working pretty solidly, at least for small
>> tests.Next up, I'll be removing extraneous debug reporting code, removing
>> some code paths that don't do anything but report, and looking for any
>> small optimization opportunities. I'll then re-run these tests. Once that's
>> done, I hope to find a maintenance window on my WISP and try it with actual
>> traffic.
>> >
>> > I also need to re-run these tests without the pping system to provide
>> some before/after analysis.
>> >
>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> >>
>> >> It's probably not entirely thread-safe right now (ran into some issues
>> reading per_cpu maps back from userspace; hopefully, I'll get that figured
>> out) - but the commits I just pushed have it basically working on
>> single-stream testing. :-)
>> >>
>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you
>> per-connection RTT information in JSON:
>> >>
>> >> [
>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>> >> {}]
>> >>
>> >> (With the extra {} because I'm not tracking the tail and haven't done
>> comma removal). The tool also empties the various maps used to gather data,
>> acting as a "reset" point. There's a max of 60 samples per queue, in a
>> ringbuffer setup (so newest will start to overwrite the oldest).
>> >>
>> >> I'll start trying to test on a larger scale now.
>> >>
>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>> robert.chacon@jackrabbitwireless.com> wrote:
>> >>>
>> >>> Hey Herbert,
>> >>>
>> >>> Fantastic work! Super exciting to see this coming together,
>> especially so quickly.
>> >>> I'll test it soon.
>> >>> I understand and agree with your decision to omit certain features
>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
>> Like you said, in order to merge the functionality without a performance
>> hit, merging them is sort of the only way right now. Otherwise there would
>> be a lot of redundancy and lost throughput for an ISP's use. Though
>> hopefully long term there will be a way to keep all projects working
>> independently but interoperably with a plugin system of some kind.
>> >>>
>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>> to the queue structure.
>> >>> I'm working to set up a physical lab to test high throughput and high
>> client count scenarios.
>> >>> When testing beyond ~32,000 filters we get "no space left on device"
>> from xdp-cpumap-tc, which I think relates to the bpf map size limitation
>> you mentioned. Maybe in the coming months we can take a look at that.
>> >>>
>> >>> Anyway great work on the cpumap-pping program! Excited to see more on
>> this.
>> >>>
>> >>> Thanks,
>> >>> Robert
>> >>>
>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> >>>>
>> >>>> Hey,
>> >>>>
>> >>>> My current (unfinished) progress on this is now available here:
>> https://github.com/thebracket/cpumap-pping-hackjob
>> >>>>
>> >>>> I mean it about the warnings, this isn't at all stable, debugged -
>> and can't promise that it won't unleash the nasal demons
>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>> >>>>
>> >>>> With that said, I'm pretty happy so far:
>> >>>>
>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>> shunted onto a dedicated CPU. It has to run on both
>> >>>>   the inbound and outbound classifiers, since otherwise it would
>> only see half the conversation.
>> >>>> * It does assume that your ingress and egress CPUs are mapped to the
>> same interface; I do that anyway in BracketQoS. Not doing
>> >>>>   that opens up a potential world of pain, since writes to the
>> shared maps would require a locking scheme. Too much locking, and you lose
>> all of the benefit of using multiple CPUs to begin with.
>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>> worked with have lots of it.
>> >>>> * I've been gradually removing features that I don't want for
>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>> that.
>> >>>> * Rate limiting is working, but I removed the requirement for a
>> shared configuration provided from userland - so right now it's always set
>> to report at 1 second intervals per stream.
>> >>>>
>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>> fast SSDs)
>> >>>>
>> >>>> Output currently consists of debug messages reading:
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk:
>> (tc) Flow open event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk:
>> (tc) Send performance event (5,1), 374696
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk:
>> (tc) Flow open event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk:
>> (tc) Send performance event (5,1), 247069
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk:
>> (tc) Send performance event (5,1), 5217155
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4515394
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4481289
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4255268
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk:
>> (tc) Send performance event (5,1), 5249493
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk:
>> (tc) Send performance event (5,1), 3795993
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk:
>> (tc) Send performance event (5,1), 3949519
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4365335
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4154910
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk:
>> (tc) Send performance event (5,1), 4405582
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk:
>> (tc) Send flow event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk:
>> (tc) Send flow event
>> >>>>
>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>> major/minor, allocated by the xdp-cpumap parent.
>> >>>> I get pretty low latency between VMs; I'll set up a test with some
>> real-world data very soon.
>> >>>>
>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> >>>>
>> >>>> Thanks,
>> >>>> Herbert
>> >>>>
>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>> Simon.Sundberg@kau.se> wrote:
>> >>>>>
>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>> >>>>> notes.
>> >>>>>
>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>> >>>>> > [ Adding Simon to Cc ]
>> >>>>> >
>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>> writes:
>> >>>>> >
>> >>>>> > > Hey,
>> >>>>> > >
>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>> >>>>> > >
>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>> >>>>> > > into xdp-cpumap-tc (
>> https://github.com/xdp-project/xdp-cpumap-tc ).
>> >>>>> > >
>> >>>>> > > I ported over most of the xdp-pping code, and then changed the
>> entry point
>> >>>>> > > and packet parsing code to make use of the work already done in
>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet,
>> no need to do
>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to
>> pin them -
>> >>>>> > > otherwise the two tc instances don't properly share data.
>> >>>>> > >
>> >>>>>
>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow
>> may
>> >>>>> be processed by different cores on ingress and egress the per-CPU
>> maps
>> >>>>> will not really work reliably as each core will have a different
>> view
>> >>>>> on the state of the flow, if there's been a previous packet with a
>> >>>>> certain TSval from that flow etc.
>> >>>>>
>> >>>>> Furthermore, if a flow is always processed on the same core (on both
>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>> >>>>> shared across all CPUs, it's just that each CPU gets its own value.
>> So
>> >>>>> all CPUs will then have their own data for each flow, but it's only
>> the
>> >>>>> CPU processing the flow that will have any relevant data for the
>> flow
>> >>>>> while the remaining CPUs will just have an empty state for that
>> flow.
>> >>>>> Under the same assumption that packets within the same flow are
>> always
>> >>>>> processed on the same core there should generally not be any
>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>> packets
>> >>>>> from the same flow cannot be processed concurrently then (and thus
>> no
>> >>>>> concurrent access to the same value in the map). I am however still
>> >>>>> very unclear on if there's any considerable performance impact
>> between
>> >>>>> global and per-CPU map versions if the same key is not accessed
>> >>>>> concurrently.
>> >>>>>
>> >>>>> > > Right now, output
>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>> code. Instead,
>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so
>> I can see
>> >>>>> > > roughly what the output would look like.
>> >>>>> > >
>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
>> Gbits/sec on
>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>> middle). :-)
>> >>>>> >
>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>> >>>>> > overhead, then. What Simon found was that sending the data from
>> kernel
>> >>>>> > to userspace is one of the most expensive bits of epping, at
>> least when
>> >>>>> > the number of data points goes up (which is does as additional
>> flows are
>> >>>>> > added).
>> >>>>>
>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may
>> get
>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>> >>>>> direct overhead from the tool itself, but also becomes demanding for
>> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>> that is
>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit
>> and
>> >>>>> -R/--rtt-rate
>> >>>>> >
>> >>>>> > > So my question: how would you prefer to receive this data? I'll
>> have to
>> >>>>> > > write a daemon that provides userspace control (periodic
>> cleanup as well as
>> >>>>> > > reading the performance stream), so the world's kinda our
>> oyster. I can
>> >>>>> > > stick to Kathie's original format (and dump it to a named pipe,
>> perhaps?),
>> >>>>> > > a condensed format that only shows what you want to use, an
>> efficient
>> >>>>> > > binary format if you feel like parsing that...
>> >>>>> >
>> >>>>> > It would be great if we could combine efforts a bit here so we
>> don't
>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>> epping and
>> >>>>> > whatever daemon you end up writing can agree on data format etc
>> that
>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>> >>>>> >
>> >>>>> > Briefly what I've discussed before with Simon was to have the
>> ability to
>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>> userspace
>> >>>>> > utility periodically pull them out. What we discussed was doing
>> this
>> >>>>> > using an LPM map (which is not in that PR yet). The idea would be
>> that
>> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
>> >>>>> > wanted statistics for (in LibreQOS context that could be one key
>> per
>> >>>>> > customer, for instance). Epping would then do a map lookup into
>> the LPM,
>> >>>>> > and if it gets a match it would update the statistics in that map
>> entry
>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's
>> PR
>> >>>>> > below uses this technique where userspace will "reset" the
>> histogram
>> >>>>> > every time it loads it by swapping out two different map entries
>> when it
>> >>>>> > does a read; this allows you to control the sampling rate from
>> >>>>> > userspace, and you'll just get the data since the last time you
>> polled.
>> >>>>>
>> >>>>> Thank's Toke for summarzing both the current state and the plan
>> going
>> >>>>> forward. I will just note that this PR (and all my other work with
>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
>> less
>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
>> >>>>> paper.
>> >>>>>
>> >>>>> > I was thinking that if we all can agree on the map format, then
>> your
>> >>>>> > polling daemon could be one userspace "client" for that, and the
>> epping
>> >>>>> > binary itself could be another; but we could keep compatibility
>> between
>> >>>>> > the two, so we don't duplicate effort.
>> >>>>> >
>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>> plugged
>> >>>>> > into the cpumap-tc code would be a good goal...
>> >>>>>
>> >>>>> Should probably do that...at some point. In general I think it's a
>> bit
>> >>>>> of an interesting problem to think about how to chain multiple
>> XDP/tc
>> >>>>> programs together in an efficent way. Most XDP and tc programs will
>> do
>> >>>>> some amount of packet parsing and when you have many chained
>> programs
>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In
>> the
>> >>>>> same time it would be nice if one didn't need to manually merge
>> >>>>> multiple programs together into a single one like this to get rid of
>> >>>>> this duplicated parsing, or at least make that process of merging
>> those
>> >>>>> programs as simple as possible.
>> >>>>>
>> >>>>>
>> >>>>> > -Toke
>> >>>>> >
>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> >>>>>
>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina
>> personuppgifter<https://www.kau.se/gdpr>.
>> >>>>> When you send an e-mail to Karlstad University, we will process
>> your personal data<https://www.kau.se/en/gdpr>.
>> >>>>
>> >>>> _______________________________________________
>> >>>> LibreQoS mailing list
>> >>>> LibreQoS@lists.bufferbloat.net
>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Robert Chacón
>> >>> CEO | JackRabbit Wireless LLC
>> >
>> > _______________________________________________
>> > LibreQoS mailing list
>> > LibreQoS@lists.bufferbloat.net
>> > https://lists.bufferbloat.net/listinfo/libreqos
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> Dave Täht CEO, TekLibre, LLC
>>
>

[-- Attachment #2: Type: text/html, Size: 29706 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 14:05                 ` Herbert Wolverson
@ 2022-10-19 14:48                   ` Robert Chacón
  2022-10-19 15:49                     ` dan
  0 siblings, 1 reply; 20+ messages in thread
From: Robert Chacón @ 2022-10-19 14:48 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 24446 bytes --]

Awesome work on this!
I suspect there should be a slight performance bump once Hyperthreading is
disabled and efficient power management is off.
Hyperthreading/SMT always messes with HTB performance when I leave it on.
Thank you for mentioning that - I now went ahead and added instructions on
disabling hyperthreading on the Wiki for new users.
Super promising results!
Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping. So
far in your VM setup it seems to be doing very well.

On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Also, I forgot to mention that I *think* the current version has removed
> the requirement that the inbound
> and outbound classifiers be placed on the same CPU. I know interduo was
> particularly keen on packing
> upload into fewer cores. I'll add that to my list of things to test.
>
> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
> wrote:
>
>> I'll definitely take a look - that does look interesting. I don't have
>> X11 on any of my test VMs, but
>> it looks like it can work without the GUI.
>>
>> Thanks!
>>
>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>>
>>> could I coax you to adopt flent?
>>>
>>> apt-get install flent netperf irtt fping
>>>
>>> You sometimes have to compile netperf yourself with --enable-demo on
>>> some systems.
>>> There are a bunch of python libs neede for the gui, but only on the
>>> client.
>>>
>>> Then you can run a really gnarly test series and plot the results over
>>> time.
>>>
>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>>> the_server_name rrul # 110 other tests
>>>
>>>
>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>> <libreqos@lists.bufferbloat.net> wrote:
>>> >
>>> > Hey,
>>> >
>>> > Testing the current version (
>>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
>>> than I hoped. This build has shared (not per-cpu) maps, and a userspace
>>> daemon (xdp_pping) to extract and reset stats.
>>> >
>>> > My testing environment has grown a bit:
>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>>> cpumap-pping-hackjob version of xdp-cpumap.
>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>>> server.
>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2.
>>> Hosts iperf client.
>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3.
>>> Hosts iperf client.
>>> >
>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are
>>> on a virtual switch.
>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>>> different virtual switch.
>>> >
>>> > These are all on a host machine running Windows 11, a core i7 12th
>>> gen, 32 Gb RAM and fast SSD setup.
>>> >
>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>> >
>>> > For this test, LibreQoS is configured:
>>> > * Two APs, each with 5gbit/s max.
>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>> > * Set to use Cake
>>> >
>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500
>>> (for a long run). Running xdp_pping yields correct results:
>>> >
>>> > [
>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>> > {}]
>>> >
>>> > Or when I waited a while to gather/reset:
>>> >
>>> > [
>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>> > {}]
>>> >
>>> > The ShaperVM shows no errors, just periodic logging that it is
>>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>>> expected).
>>> >
>>> > After 500 seconds of continual iperfing, each client reported a
>>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>> >
>>> > So for smaller streams, I'd call this a success.
>>> >
>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>> >
>>> > For this test, LibreQoS is configured:
>>> > * Two APs, each with 5gb/s max.
>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>> >
>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>> >
>>> > xdp_pping shows results, too:
>>> >
>>> > [
>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>> > {}]
>>> >
>>> > [
>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>> > {}]
>>> >
>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>> >
>>> > After 500 seconds of continual iperfing, each client reported a
>>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>> >
>>> > Maxing out HyperV like this is inducing a bit of latency (which is to
>>> be expected), but it's not bad. I also forgot to disable hyperthreading,
>>> and looking at the host performance it is sometimes running the second
>>> virtual CPU on an underpowered "fake" CPU.
>>> >
>>> > So for two large streams, I think we're doing pretty well also!
>>> >
>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>> >
>>> > This test is designed to try and blow things up. It's the same as test
>>> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
>>> >
>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
>>> pping stats start to show a bit of degradation in performance for pounding
>>> it so hard:
>>> >
>>> > [
>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>> > {}]
>>> >
>>> > For whatever reason, it smoothed out over time:
>>> >
>>> > [
>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>> > {}]
>>> >
>>> > Surprisingly (to me), I didn't encounter errors. Each client received
>>> 2.22 Gbit/s performance, over 129 Gbytes of data.
>>> >
>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>> >
>>> > This test is also designed to break things. Same as test 3, but using
>>> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the
>>> flow tracking. (Shorter time window because I really wanted to go and find
>>> coffee)
>>> >
>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
>>> show that this torture test is worsening performance, and there's always
>>> lots of samples in the buffer:
>>> >
>>> > [
>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>> > {}]
>>> >
>>> > This test also ran better than I expected. You can definitely see some
>>> latency creeping in as I make the system work hard. Each VM showed around
>>> 2.4 Gbit/s in total performance at the end of the iperf session. There's
>>> definitely some latency creeping in, which is expected - but I'm not sure I
>>> expected quite that much.
>>> >
>>> > WHAT'S NEXT & CONCLUSION
>>> >
>>> > I noticed that I forgot to turn off efficient power management on my
>>> VMs and host, and left Hyperthreading on by mistake. So that hurts overall
>>> performance.
>>> >
>>> > The base system seems to be working pretty solidly, at least for small
>>> tests.Next up, I'll be removing extraneous debug reporting code, removing
>>> some code paths that don't do anything but report, and looking for any
>>> small optimization opportunities. I'll then re-run these tests. Once that's
>>> done, I hope to find a maintenance window on my WISP and try it with actual
>>> traffic.
>>> >
>>> > I also need to re-run these tests without the pping system to provide
>>> some before/after analysis.
>>> >
>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>>> herberticus@gmail.com> wrote:
>>> >>
>>> >> It's probably not entirely thread-safe right now (ran into some
>>> issues reading per_cpu maps back from userspace; hopefully, I'll get that
>>> figured out) - but the commits I just pushed have it basically working on
>>> single-stream testing. :-)
>>> >>
>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you
>>> per-connection RTT information in JSON:
>>> >>
>>> >> [
>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>> >> {}]
>>> >>
>>> >> (With the extra {} because I'm not tracking the tail and haven't done
>>> comma removal). The tool also empties the various maps used to gather data,
>>> acting as a "reset" point. There's a max of 60 samples per queue, in a
>>> ringbuffer setup (so newest will start to overwrite the oldest).
>>> >>
>>> >> I'll start trying to test on a larger scale now.
>>> >>
>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>>> robert.chacon@jackrabbitwireless.com> wrote:
>>> >>>
>>> >>> Hey Herbert,
>>> >>>
>>> >>> Fantastic work! Super exciting to see this coming together,
>>> especially so quickly.
>>> >>> I'll test it soon.
>>> >>> I understand and agree with your decision to omit certain features
>>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
>>> Like you said, in order to merge the functionality without a performance
>>> hit, merging them is sort of the only way right now. Otherwise there would
>>> be a lot of redundancy and lost throughput for an ISP's use. Though
>>> hopefully long term there will be a way to keep all projects working
>>> independently but interoperably with a plugin system of some kind.
>>> >>>
>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>>> to the queue structure.
>>> >>> I'm working to set up a physical lab to test high throughput and
>>> high client count scenarios.
>>> >>> When testing beyond ~32,000 filters we get "no space left on device"
>>> from xdp-cpumap-tc, which I think relates to the bpf map size limitation
>>> you mentioned. Maybe in the coming months we can take a look at that.
>>> >>>
>>> >>> Anyway great work on the cpumap-pping program! Excited to see more
>>> on this.
>>> >>>
>>> >>> Thanks,
>>> >>> Robert
>>> >>>
>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>>> libreqos@lists.bufferbloat.net> wrote:
>>> >>>>
>>> >>>> Hey,
>>> >>>>
>>> >>>> My current (unfinished) progress on this is now available here:
>>> https://github.com/thebracket/cpumap-pping-hackjob
>>> >>>>
>>> >>>> I mean it about the warnings, this isn't at all stable, debugged -
>>> and can't promise that it won't unleash the nasal demons
>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>> >>>>
>>> >>>> With that said, I'm pretty happy so far:
>>> >>>>
>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>>> shunted onto a dedicated CPU. It has to run on both
>>> >>>>   the inbound and outbound classifiers, since otherwise it would
>>> only see half the conversation.
>>> >>>> * It does assume that your ingress and egress CPUs are mapped to
>>> the same interface; I do that anyway in BracketQoS. Not doing
>>> >>>>   that opens up a potential world of pain, since writes to the
>>> shared maps would require a locking scheme. Too much locking, and you lose
>>> all of the benefit of using multiple CPUs to begin with.
>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>>> worked with have lots of it.
>>> >>>> * I've been gradually removing features that I don't want for
>>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>>> that.
>>> >>>> * Rate limiting is working, but I removed the requirement for a
>>> shared configuration provided from userland - so right now it's always set
>>> to report at 1 second intervals per stream.
>>> >>>>
>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>>> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>>> fast SSDs)
>>> >>>>
>>> >>>> Output currently consists of debug messages reading:
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>>> bpf_trace_printk: (tc) Flow open event
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>>> bpf_trace_printk: (tc) Flow open event
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>>> bpf_trace_printk: (tc) Send flow event
>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>>> bpf_trace_printk: (tc) Send flow event
>>> >>>>
>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>>> major/minor, allocated by the xdp-cpumap parent.
>>> >>>> I get pretty low latency between VMs; I'll set up a test with some
>>> real-world data very soon.
>>> >>>>
>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Herbert
>>> >>>>
>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>>> Simon.Sundberg@kau.se> wrote:
>>> >>>>>
>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of
>>> quick
>>> >>>>> notes.
>>> >>>>>
>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>> >>>>> > [ Adding Simon to Cc ]
>>> >>>>> >
>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>> writes:
>>> >>>>> >
>>> >>>>> > > Hey,
>>> >>>>> > >
>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>> >>>>> > >
>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>> >>>>> > > into xdp-cpumap-tc (
>>> https://github.com/xdp-project/xdp-cpumap-tc ).
>>> >>>>> > >
>>> >>>>> > > I ported over most of the xdp-pping code, and then changed the
>>> entry point
>>> >>>>> > > and packet parsing code to make use of the work already done in
>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet,
>>> no need to do
>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had
>>> to pin them -
>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>> >>>>> > >
>>> >>>>>
>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed
>>> on
>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow
>>> may
>>> >>>>> be processed by different cores on ingress and egress the per-CPU
>>> maps
>>> >>>>> will not really work reliably as each core will have a different
>>> view
>>> >>>>> on the state of the flow, if there's been a previous packet with a
>>> >>>>> certain TSval from that flow etc.
>>> >>>>>
>>> >>>>> Furthermore, if a flow is always processed on the same core (on
>>> both
>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>>> >>>>> shared across all CPUs, it's just that each CPU gets its own
>>> value. So
>>> >>>>> all CPUs will then have their own data for each flow, but it's
>>> only the
>>> >>>>> CPU processing the flow that will have any relevant data for the
>>> flow
>>> >>>>> while the remaining CPUs will just have an empty state for that
>>> flow.
>>> >>>>> Under the same assumption that packets within the same flow are
>>> always
>>> >>>>> processed on the same core there should generally not be any
>>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>>> packets
>>> >>>>> from the same flow cannot be processed concurrently then (and thus
>>> no
>>> >>>>> concurrent access to the same value in the map). I am however still
>>> >>>>> very unclear on if there's any considerable performance impact
>>> between
>>> >>>>> global and per-CPU map versions if the same key is not accessed
>>> >>>>> concurrently.
>>> >>>>>
>>> >>>>> > > Right now, output
>>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>>> code. Instead,
>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so
>>> I can see
>>> >>>>> > > roughly what the output would look like.
>>> >>>>> > >
>>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
>>> Gbits/sec on
>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>> middle). :-)
>>> >>>>> >
>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>>> >>>>> > overhead, then. What Simon found was that sending the data from
>>> kernel
>>> >>>>> > to userspace is one of the most expensive bits of epping, at
>>> least when
>>> >>>>> > the number of data points goes up (which is does as additional
>>> flows are
>>> >>>>> > added).
>>> >>>>>
>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may
>>> get
>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms
>>> of
>>> >>>>> direct overhead from the tool itself, but also becomes demanding
>>> for
>>> >>>>> whatever you use all those RTT samples for (i.e. need to log,
>>> parse,
>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>>> that is
>>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit
>>> and
>>> >>>>> -R/--rtt-rate
>>> >>>>> >
>>> >>>>> > > So my question: how would you prefer to receive this data?
>>> I'll have to
>>> >>>>> > > write a daemon that provides userspace control (periodic
>>> cleanup as well as
>>> >>>>> > > reading the performance stream), so the world's kinda our
>>> oyster. I can
>>> >>>>> > > stick to Kathie's original format (and dump it to a named
>>> pipe, perhaps?),
>>> >>>>> > > a condensed format that only shows what you want to use, an
>>> efficient
>>> >>>>> > > binary format if you feel like parsing that...
>>> >>>>> >
>>> >>>>> > It would be great if we could combine efforts a bit here so we
>>> don't
>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>>> epping and
>>> >>>>> > whatever daemon you end up writing can agree on data format etc
>>> that
>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>> >>>>> >
>>> >>>>> > Briefly what I've discussed before with Simon was to have the
>>> ability to
>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>>> userspace
>>> >>>>> > utility periodically pull them out. What we discussed was doing
>>> this
>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would
>>> be that
>>> >>>>> > userspace would populate the LPM map with the keys (prefixes)
>>> they
>>> >>>>> > wanted statistics for (in LibreQOS context that could be one key
>>> per
>>> >>>>> > customer, for instance). Epping would then do a map lookup into
>>> the LPM,
>>> >>>>> > and if it gets a match it would update the statistics in that
>>> map entry
>>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's
>>> PR
>>> >>>>> > below uses this technique where userspace will "reset" the
>>> histogram
>>> >>>>> > every time it loads it by swapping out two different map entries
>>> when it
>>> >>>>> > does a read; this allows you to control the sampling rate from
>>> >>>>> > userspace, and you'll just get the data since the last time you
>>> polled.
>>> >>>>>
>>> >>>>> Thank's Toke for summarzing both the current state and the plan
>>> going
>>> >>>>> forward. I will just note that this PR (and all my other work with
>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
>>> less
>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up
>>> a
>>> >>>>> paper.
>>> >>>>>
>>> >>>>> > I was thinking that if we all can agree on the map format, then
>>> your
>>> >>>>> > polling daemon could be one userspace "client" for that, and the
>>> epping
>>> >>>>> > binary itself could be another; but we could keep compatibility
>>> between
>>> >>>>> > the two, so we don't duplicate effort.
>>> >>>>> >
>>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>>> plugged
>>> >>>>> > into the cpumap-tc code would be a good goal...
>>> >>>>>
>>> >>>>> Should probably do that...at some point. In general I think it's a
>>> bit
>>> >>>>> of an interesting problem to think about how to chain multiple
>>> XDP/tc
>>> >>>>> programs together in an efficent way. Most XDP and tc programs
>>> will do
>>> >>>>> some amount of packet parsing and when you have many chained
>>> programs
>>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In
>>> the
>>> >>>>> same time it would be nice if one didn't need to manually merge
>>> >>>>> multiple programs together into a single one like this to get rid
>>> of
>>> >>>>> this duplicated parsing, or at least make that process of merging
>>> those
>>> >>>>> programs as simple as possible.
>>> >>>>>
>>> >>>>>
>>> >>>>> > -Toke
>>> >>>>> >
>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>> >>>>>
>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina
>>> personuppgifter<https://www.kau.se/gdpr>.
>>> >>>>> When you send an e-mail to Karlstad University, we will process
>>> your personal data<https://www.kau.se/en/gdpr>.
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> LibreQoS mailing list
>>> >>>> LibreQoS@lists.bufferbloat.net
>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Robert Chacón
>>> >>> CEO | JackRabbit Wireless LLC
>>> >
>>> > _______________________________________________
>>> > LibreQoS mailing list
>>> > LibreQoS@lists.bufferbloat.net
>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>>
>>>
>>> --
>>> This song goes out to all the folk that thought Stadia would work:
>>>
>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>>> Dave Täht CEO, TekLibre, LLC
>>>
>> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>


-- 
Robert Chacón
CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>

[-- Attachment #2: Type: text/html, Size: 31212 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 14:48                   ` Robert Chacón
@ 2022-10-19 15:49                     ` dan
  2022-10-19 16:10                       ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: dan @ 2022-10-19 15:49 UTC (permalink / raw)
  To: Robert Chacón; +Cc: Herbert Wolverson, libreqos

[-- Attachment #1: Type: text/plain, Size: 25440 bytes --]

Those 'efficiency' threads in Intel 12th gen should probably be addressed
as well.  You can't turn them off in BIOS.

On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Awesome work on this!
> I suspect there should be a slight performance bump once Hyperthreading is
> disabled and efficient power management is off.
> Hyperthreading/SMT always messes with HTB performance when I leave it on.
> Thank you for mentioning that - I now went ahead and added instructions on
> disabling hyperthreading on the Wiki for new users.
> Super promising results!
> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping.
> So far in your VM setup it seems to be doing very well.
>
> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
>
>> Also, I forgot to mention that I *think* the current version has removed
>> the requirement that the inbound
>> and outbound classifiers be placed on the same CPU. I know interduo was
>> particularly keen on packing
>> upload into fewer cores. I'll add that to my list of things to test.
>>
>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
>> wrote:
>>
>>> I'll definitely take a look - that does look interesting. I don't have
>>> X11 on any of my test VMs, but
>>> it looks like it can work without the GUI.
>>>
>>> Thanks!
>>>
>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>>> could I coax you to adopt flent?
>>>>
>>>> apt-get install flent netperf irtt fping
>>>>
>>>> You sometimes have to compile netperf yourself with --enable-demo on
>>>> some systems.
>>>> There are a bunch of python libs neede for the gui, but only on the
>>>> client.
>>>>
>>>> Then you can run a really gnarly test series and plot the results over
>>>> time.
>>>>
>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>>>> the_server_name rrul # 110 other tests
>>>>
>>>>
>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>>> <libreqos@lists.bufferbloat.net> wrote:
>>>> >
>>>> > Hey,
>>>> >
>>>> > Testing the current version (
>>>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing
>>>> better than I hoped. This build has shared (not per-cpu) maps, and a
>>>> userspace daemon (xdp_pping) to extract and reset stats.
>>>> >
>>>> > My testing environment has grown a bit:
>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>>>> cpumap-pping-hackjob version of xdp-cpumap.
>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>>>> server.
>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2.
>>>> Hosts iperf client.
>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3.
>>>> Hosts iperf client.
>>>> >
>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are
>>>> on a virtual switch.
>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>>>> different virtual switch.
>>>> >
>>>> > These are all on a host machine running Windows 11, a core i7 12th
>>>> gen, 32 Gb RAM and fast SSD setup.
>>>> >
>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>>> >
>>>> > For this test, LibreQoS is configured:
>>>> > * Two APs, each with 5gbit/s max.
>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>>>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>>> > * Set to use Cake
>>>> >
>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t
>>>> 500 (for a long run). Running xdp_pping yields correct results:
>>>> >
>>>> > [
>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>> > {}]
>>>> >
>>>> > Or when I waited a while to gather/reset:
>>>> >
>>>> > [
>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>>> > {}]
>>>> >
>>>> > The ShaperVM shows no errors, just periodic logging that it is
>>>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>>>> expected).
>>>> >
>>>> > After 500 seconds of continual iperfing, each client reported a
>>>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>>> >
>>>> > So for smaller streams, I'd call this a success.
>>>> >
>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>>> >
>>>> > For this test, LibreQoS is configured:
>>>> > * Two APs, each with 5gb/s max.
>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>>>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>>> >
>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>>> >
>>>> > xdp_pping shows results, too:
>>>> >
>>>> > [
>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>>> > {}]
>>>> >
>>>> > [
>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>>> > {}]
>>>> >
>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>>> >
>>>> > After 500 seconds of continual iperfing, each client reported a
>>>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>>> >
>>>> > Maxing out HyperV like this is inducing a bit of latency (which is to
>>>> be expected), but it's not bad. I also forgot to disable hyperthreading,
>>>> and looking at the host performance it is sometimes running the second
>>>> virtual CPU on an underpowered "fake" CPU.
>>>> >
>>>> > So for two large streams, I think we're doing pretty well also!
>>>> >
>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>>> >
>>>> > This test is designed to try and blow things up. It's the same as
>>>> test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and
>>>> 1:6.
>>>> >
>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle.
>>>> The pping stats start to show a bit of degradation in performance for
>>>> pounding it so hard:
>>>> >
>>>> > [
>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>>> > {}]
>>>> >
>>>> > For whatever reason, it smoothed out over time:
>>>> >
>>>> > [
>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>>> > {}]
>>>> >
>>>> > Surprisingly (to me), I didn't encounter errors. Each client received
>>>> 2.22 Gbit/s performance, over 129 Gbytes of data.
>>>> >
>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>>> >
>>>> > This test is also designed to break things. Same as test 3, but using
>>>> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the
>>>> flow tracking. (Shorter time window because I really wanted to go and find
>>>> coffee)
>>>> >
>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
>>>> show that this torture test is worsening performance, and there's always
>>>> lots of samples in the buffer:
>>>> >
>>>> > [
>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>>> > {}]
>>>> >
>>>> > This test also ran better than I expected. You can definitely see
>>>> some latency creeping in as I make the system work hard. Each VM showed
>>>> around 2.4 Gbit/s in total performance at the end of the iperf session.
>>>> There's definitely some latency creeping in, which is expected - but I'm
>>>> not sure I expected quite that much.
>>>> >
>>>> > WHAT'S NEXT & CONCLUSION
>>>> >
>>>> > I noticed that I forgot to turn off efficient power management on my
>>>> VMs and host, and left Hyperthreading on by mistake. So that hurts overall
>>>> performance.
>>>> >
>>>> > The base system seems to be working pretty solidly, at least for
>>>> small tests.Next up, I'll be removing extraneous debug reporting code,
>>>> removing some code paths that don't do anything but report, and looking for
>>>> any small optimization opportunities. I'll then re-run these tests. Once
>>>> that's done, I hope to find a maintenance window on my WISP and try it with
>>>> actual traffic.
>>>> >
>>>> > I also need to re-run these tests without the pping system to provide
>>>> some before/after analysis.
>>>> >
>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>>>> herberticus@gmail.com> wrote:
>>>> >>
>>>> >> It's probably not entirely thread-safe right now (ran into some
>>>> issues reading per_cpu maps back from userspace; hopefully, I'll get that
>>>> figured out) - but the commits I just pushed have it basically working on
>>>> single-stream testing. :-)
>>>> >>
>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives
>>>> you per-connection RTT information in JSON:
>>>> >>
>>>> >> [
>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>>> >> {}]
>>>> >>
>>>> >> (With the extra {} because I'm not tracking the tail and haven't
>>>> done comma removal). The tool also empties the various maps used to gather
>>>> data, acting as a "reset" point. There's a max of 60 samples per queue, in
>>>> a ringbuffer setup (so newest will start to overwrite the oldest).
>>>> >>
>>>> >> I'll start trying to test on a larger scale now.
>>>> >>
>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>>>> robert.chacon@jackrabbitwireless.com> wrote:
>>>> >>>
>>>> >>> Hey Herbert,
>>>> >>>
>>>> >>> Fantastic work! Super exciting to see this coming together,
>>>> especially so quickly.
>>>> >>> I'll test it soon.
>>>> >>> I understand and agree with your decision to omit certain features
>>>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
>>>> Like you said, in order to merge the functionality without a performance
>>>> hit, merging them is sort of the only way right now. Otherwise there would
>>>> be a lot of redundancy and lost throughput for an ISP's use. Though
>>>> hopefully long term there will be a way to keep all projects working
>>>> independently but interoperably with a plugin system of some kind.
>>>> >>>
>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>>>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>>>> to the queue structure.
>>>> >>> I'm working to set up a physical lab to test high throughput and
>>>> high client count scenarios.
>>>> >>> When testing beyond ~32,000 filters we get "no space left on
>>>> device" from xdp-cpumap-tc, which I think relates to the bpf map size
>>>> limitation you mentioned. Maybe in the coming months we can take a look at
>>>> that.
>>>> >>>
>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more
>>>> on this.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Robert
>>>> >>>
>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>>>> libreqos@lists.bufferbloat.net> wrote:
>>>> >>>>
>>>> >>>> Hey,
>>>> >>>>
>>>> >>>> My current (unfinished) progress on this is now available here:
>>>> https://github.com/thebracket/cpumap-pping-hackjob
>>>> >>>>
>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged -
>>>> and can't promise that it won't unleash the nasal demons
>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>> >>>>
>>>> >>>> With that said, I'm pretty happy so far:
>>>> >>>>
>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>>>> shunted onto a dedicated CPU. It has to run on both
>>>> >>>>   the inbound and outbound classifiers, since otherwise it would
>>>> only see half the conversation.
>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to
>>>> the same interface; I do that anyway in BracketQoS. Not doing
>>>> >>>>   that opens up a potential world of pain, since writes to the
>>>> shared maps would require a locking scheme. Too much locking, and you lose
>>>> all of the benefit of using multiple CPUs to begin with.
>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems
>>>> I've worked with have lots of it.
>>>> >>>> * I've been gradually removing features that I don't want for
>>>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>>>> that.
>>>> >>>> * Rate limiting is working, but I removed the requirement for a
>>>> shared configuration provided from userland - so right now it's always set
>>>> to report at 1 second intervals per stream.
>>>> >>>>
>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>>>> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>>>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>>>> fast SSDs)
>>>> >>>>
>>>> >>>> Output currently consists of debug messages reading:
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>>>> bpf_trace_printk: (tc) Flow open event
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>>>> bpf_trace_printk: (tc) Flow open event
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>>>> bpf_trace_printk: (tc) Send flow event
>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>>>> bpf_trace_printk: (tc) Send flow event
>>>> >>>>
>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>>>> major/minor, allocated by the xdp-cpumap parent.
>>>> >>>> I get pretty low latency between VMs; I'll set up a test with some
>>>> real-world data very soon.
>>>> >>>>
>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Herbert
>>>> >>>>
>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>>>> Simon.Sundberg@kau.se> wrote:
>>>> >>>>>
>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of
>>>> quick
>>>> >>>>> notes.
>>>> >>>>>
>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>> >>>>> > [ Adding Simon to Cc ]
>>>> >>>>> >
>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>>> writes:
>>>> >>>>> >
>>>> >>>>> > > Hey,
>>>> >>>>> > >
>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>>> >>>>> > >
>>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>>> >>>>> > > into xdp-cpumap-tc (
>>>> https://github.com/xdp-project/xdp-cpumap-tc ).
>>>> >>>>> > >
>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed
>>>> the entry point
>>>> >>>>> > > and packet parsing code to make use of the work already done
>>>> in
>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet,
>>>> no need to do
>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had
>>>> to pin them -
>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>>> >>>>> > >
>>>> >>>>>
>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed
>>>> on
>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow
>>>> may
>>>> >>>>> be processed by different cores on ingress and egress the per-CPU
>>>> maps
>>>> >>>>> will not really work reliably as each core will have a different
>>>> view
>>>> >>>>> on the state of the flow, if there's been a previous packet with a
>>>> >>>>> certain TSval from that flow etc.
>>>> >>>>>
>>>> >>>>> Furthermore, if a flow is always processed on the same core (on
>>>> both
>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own
>>>> value. So
>>>> >>>>> all CPUs will then have their own data for each flow, but it's
>>>> only the
>>>> >>>>> CPU processing the flow that will have any relevant data for the
>>>> flow
>>>> >>>>> while the remaining CPUs will just have an empty state for that
>>>> flow.
>>>> >>>>> Under the same assumption that packets within the same flow are
>>>> always
>>>> >>>>> processed on the same core there should generally not be any
>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>>>> packets
>>>> >>>>> from the same flow cannot be processed concurrently then (and
>>>> thus no
>>>> >>>>> concurrent access to the same value in the map). I am however
>>>> still
>>>> >>>>> very unclear on if there's any considerable performance impact
>>>> between
>>>> >>>>> global and per-CPU map versions if the same key is not accessed
>>>> >>>>> concurrently.
>>>> >>>>>
>>>> >>>>> > > Right now, output
>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>>>> code. Instead,
>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe,
>>>> so I can see
>>>> >>>>> > > roughly what the output would look like.
>>>> >>>>> > >
>>>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
>>>> Gbits/sec on
>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>>> middle). :-)
>>>> >>>>> >
>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>> >>>>> > overhead, then. What Simon found was that sending the data from
>>>> kernel
>>>> >>>>> > to userspace is one of the most expensive bits of epping, at
>>>> least when
>>>> >>>>> > the number of data points goes up (which is does as additional
>>>> flows are
>>>> >>>>> > added).
>>>> >>>>>
>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you
>>>> may get
>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms
>>>> of
>>>> >>>>> direct overhead from the tool itself, but also becomes demanding
>>>> for
>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log,
>>>> parse,
>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>>>> that is
>>>> >>>>> of course to just apply some sort of sampling (the
>>>> -r/--rate-limit and
>>>> >>>>> -R/--rtt-rate
>>>> >>>>> >
>>>> >>>>> > > So my question: how would you prefer to receive this data?
>>>> I'll have to
>>>> >>>>> > > write a daemon that provides userspace control (periodic
>>>> cleanup as well as
>>>> >>>>> > > reading the performance stream), so the world's kinda our
>>>> oyster. I can
>>>> >>>>> > > stick to Kathie's original format (and dump it to a named
>>>> pipe, perhaps?),
>>>> >>>>> > > a condensed format that only shows what you want to use, an
>>>> efficient
>>>> >>>>> > > binary format if you feel like parsing that...
>>>> >>>>> >
>>>> >>>>> > It would be great if we could combine efforts a bit here so we
>>>> don't
>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>>>> epping and
>>>> >>>>> > whatever daemon you end up writing can agree on data format etc
>>>> that
>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>> >>>>> >
>>>> >>>>> > Briefly what I've discussed before with Simon was to have the
>>>> ability to
>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>>>> userspace
>>>> >>>>> > utility periodically pull them out. What we discussed was doing
>>>> this
>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would
>>>> be that
>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes)
>>>> they
>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one
>>>> key per
>>>> >>>>> > customer, for instance). Epping would then do a map lookup into
>>>> the LPM,
>>>> >>>>> > and if it gets a match it would update the statistics in that
>>>> map entry
>>>> >>>>> > (keeping a histogram of latency values seen, basically).
>>>> Simon's PR
>>>> >>>>> > below uses this technique where userspace will "reset" the
>>>> histogram
>>>> >>>>> > every time it loads it by swapping out two different map
>>>> entries when it
>>>> >>>>> > does a read; this allows you to control the sampling rate from
>>>> >>>>> > userspace, and you'll just get the data since the last time you
>>>> polled.
>>>> >>>>>
>>>> >>>>> Thank's Toke for summarzing both the current state and the plan
>>>> going
>>>> >>>>> forward. I will just note that this PR (and all my other work with
>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
>>>> less
>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish
>>>> up a
>>>> >>>>> paper.
>>>> >>>>>
>>>> >>>>> > I was thinking that if we all can agree on the map format, then
>>>> your
>>>> >>>>> > polling daemon could be one userspace "client" for that, and
>>>> the epping
>>>> >>>>> > binary itself could be another; but we could keep compatibility
>>>> between
>>>> >>>>> > the two, so we don't duplicate effort.
>>>> >>>>> >
>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>>>> plugged
>>>> >>>>> > into the cpumap-tc code would be a good goal...
>>>> >>>>>
>>>> >>>>> Should probably do that...at some point. In general I think it's
>>>> a bit
>>>> >>>>> of an interesting problem to think about how to chain multiple
>>>> XDP/tc
>>>> >>>>> programs together in an efficent way. Most XDP and tc programs
>>>> will do
>>>> >>>>> some amount of packet parsing and when you have many chained
>>>> programs
>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful.
>>>> In the
>>>> >>>>> same time it would be nice if one didn't need to manually merge
>>>> >>>>> multiple programs together into a single one like this to get rid
>>>> of
>>>> >>>>> this duplicated parsing, or at least make that process of merging
>>>> those
>>>> >>>>> programs as simple as possible.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> > -Toke
>>>> >>>>> >
>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>> >>>>>
>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi
>>>> dina personuppgifter<https://www.kau.se/gdpr>.
>>>> >>>>> When you send an e-mail to Karlstad University, we will process
>>>> your personal data<https://www.kau.se/en/gdpr>.
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> LibreQoS mailing list
>>>> >>>> LibreQoS@lists.bufferbloat.net
>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Robert Chacón
>>>> >>> CEO | JackRabbit Wireless LLC
>>>> >
>>>> > _______________________________________________
>>>> > LibreQoS mailing list
>>>> > LibreQoS@lists.bufferbloat.net
>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>>
>>>>
>>>>
>>>> --
>>>> This song goes out to all the folk that thought Stadia would work:
>>>>
>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>>>> Dave Täht CEO, TekLibre, LLC
>>>>
>>> _______________________________________________
>> LibreQoS mailing list
>> LibreQoS@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/libreqos
>>
>
>
> --
> Robert Chacón
> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>

[-- Attachment #2: Type: text/html, Size: 32058 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 15:49                     ` dan
@ 2022-10-19 16:10                       ` Herbert Wolverson
  2022-10-19 16:13                         ` Dave Taht
  0 siblings, 1 reply; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-19 16:10 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 26464 bytes --]

That's true. The 12th gen does seem to have some "special" features...
makes for a nice writing platform
(this box is primarily my "write books and articles" machine). I'll be
doing a wider test on a more normal
platform, probably at the weekend (with real traffic, hence the delay -
have to find a time in which I
minimize disruption)

On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:

> Those 'efficiency' threads in Intel 12th gen should probably be addressed
> as well.  You can't turn them off in BIOS.
>
> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
>
>> Awesome work on this!
>> I suspect there should be a slight performance bump once Hyperthreading
>> is disabled and efficient power management is off.
>> Hyperthreading/SMT always messes with HTB performance when I leave it on.
>> Thank you for mentioning that - I now went ahead and added instructions on
>> disabling hyperthreading on the Wiki for new users.
>> Super promising results!
>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping.
>> So far in your VM setup it seems to be doing very well.
>>
>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>>
>>> Also, I forgot to mention that I *think* the current version has removed
>>> the requirement that the inbound
>>> and outbound classifiers be placed on the same CPU. I know interduo was
>>> particularly keen on packing
>>> upload into fewer cores. I'll add that to my list of things to test.
>>>
>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
>>> wrote:
>>>
>>>> I'll definitely take a look - that does look interesting. I don't have
>>>> X11 on any of my test VMs, but
>>>> it looks like it can work without the GUI.
>>>>
>>>> Thanks!
>>>>
>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>>> could I coax you to adopt flent?
>>>>>
>>>>> apt-get install flent netperf irtt fping
>>>>>
>>>>> You sometimes have to compile netperf yourself with --enable-demo on
>>>>> some systems.
>>>>> There are a bunch of python libs neede for the gui, but only on the
>>>>> client.
>>>>>
>>>>> Then you can run a really gnarly test series and plot the results over
>>>>> time.
>>>>>
>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>>>>> the_server_name rrul # 110 other tests
>>>>>
>>>>>
>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>>>> <libreqos@lists.bufferbloat.net> wrote:
>>>>> >
>>>>> > Hey,
>>>>> >
>>>>> > Testing the current version (
>>>>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing
>>>>> better than I hoped. This build has shared (not per-cpu) maps, and a
>>>>> userspace daemon (xdp_pping) to extract and reset stats.
>>>>> >
>>>>> > My testing environment has grown a bit:
>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>>>>> cpumap-pping-hackjob version of xdp-cpumap.
>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>>>>> server.
>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2.
>>>>> Hosts iperf client.
>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3.
>>>>> Hosts iperf client.
>>>>> >
>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM
>>>>> are on a virtual switch.
>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>>>>> different virtual switch.
>>>>> >
>>>>> > These are all on a host machine running Windows 11, a core i7 12th
>>>>> gen, 32 Gb RAM and fast SSD setup.
>>>>> >
>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>>>> >
>>>>> > For this test, LibreQoS is configured:
>>>>> > * Two APs, each with 5gbit/s max.
>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>>>>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>>>> > * Set to use Cake
>>>>> >
>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t
>>>>> 500 (for a long run). Running xdp_pping yields correct results:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>> > {}]
>>>>> >
>>>>> > Or when I waited a while to gather/reset:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>>>> > {}]
>>>>> >
>>>>> > The ShaperVM shows no errors, just periodic logging that it is
>>>>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>>>>> expected).
>>>>> >
>>>>> > After 500 seconds of continual iperfing, each client reported a
>>>>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>>>> >
>>>>> > So for smaller streams, I'd call this a success.
>>>>> >
>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>>>> >
>>>>> > For this test, LibreQoS is configured:
>>>>> > * Two APs, each with 5gb/s max.
>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>>>>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>>>> >
>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>>>> >
>>>>> > xdp_pping shows results, too:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>>>> > {}]
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>>>> > {}]
>>>>> >
>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>>>> >
>>>>> > After 500 seconds of continual iperfing, each client reported a
>>>>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>>>> >
>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is
>>>>> to be expected), but it's not bad. I also forgot to disable hyperthreading,
>>>>> and looking at the host performance it is sometimes running the second
>>>>> virtual CPU on an underpowered "fake" CPU.
>>>>> >
>>>>> > So for two large streams, I think we're doing pretty well also!
>>>>> >
>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>>>> >
>>>>> > This test is designed to try and blow things up. It's the same as
>>>>> test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and
>>>>> 1:6.
>>>>> >
>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle.
>>>>> The pping stats start to show a bit of degradation in performance for
>>>>> pounding it so hard:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>>>> > {}]
>>>>> >
>>>>> > For whatever reason, it smoothed out over time:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>>>> > {}]
>>>>> >
>>>>> > Surprisingly (to me), I didn't encounter errors. Each client
>>>>> received 2.22 Gbit/s performance, over 129 Gbytes of data.
>>>>> >
>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>>>> >
>>>>> > This test is also designed to break things. Same as test 3, but
>>>>> using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really
>>>>> tax the flow tracking. (Shorter time window because I really wanted to go
>>>>> and find coffee)
>>>>> >
>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping
>>>>> results show that this torture test is worsening performance, and there's
>>>>> always lots of samples in the buffer:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>>>> > {}]
>>>>> >
>>>>> > This test also ran better than I expected. You can definitely see
>>>>> some latency creeping in as I make the system work hard. Each VM showed
>>>>> around 2.4 Gbit/s in total performance at the end of the iperf session.
>>>>> There's definitely some latency creeping in, which is expected - but I'm
>>>>> not sure I expected quite that much.
>>>>> >
>>>>> > WHAT'S NEXT & CONCLUSION
>>>>> >
>>>>> > I noticed that I forgot to turn off efficient power management on my
>>>>> VMs and host, and left Hyperthreading on by mistake. So that hurts overall
>>>>> performance.
>>>>> >
>>>>> > The base system seems to be working pretty solidly, at least for
>>>>> small tests.Next up, I'll be removing extraneous debug reporting code,
>>>>> removing some code paths that don't do anything but report, and looking for
>>>>> any small optimization opportunities. I'll then re-run these tests. Once
>>>>> that's done, I hope to find a maintenance window on my WISP and try it with
>>>>> actual traffic.
>>>>> >
>>>>> > I also need to re-run these tests without the pping system to
>>>>> provide some before/after analysis.
>>>>> >
>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>>>>> herberticus@gmail.com> wrote:
>>>>> >>
>>>>> >> It's probably not entirely thread-safe right now (ran into some
>>>>> issues reading per_cpu maps back from userspace; hopefully, I'll get that
>>>>> figured out) - but the commits I just pushed have it basically working on
>>>>> single-stream testing. :-)
>>>>> >>
>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives
>>>>> you per-connection RTT information in JSON:
>>>>> >>
>>>>> >> [
>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>>>> >> {}]
>>>>> >>
>>>>> >> (With the extra {} because I'm not tracking the tail and haven't
>>>>> done comma removal). The tool also empties the various maps used to gather
>>>>> data, acting as a "reset" point. There's a max of 60 samples per queue, in
>>>>> a ringbuffer setup (so newest will start to overwrite the oldest).
>>>>> >>
>>>>> >> I'll start trying to test on a larger scale now.
>>>>> >>
>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>>>>> robert.chacon@jackrabbitwireless.com> wrote:
>>>>> >>>
>>>>> >>> Hey Herbert,
>>>>> >>>
>>>>> >>> Fantastic work! Super exciting to see this coming together,
>>>>> especially so quickly.
>>>>> >>> I'll test it soon.
>>>>> >>> I understand and agree with your decision to omit certain features
>>>>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
>>>>> Like you said, in order to merge the functionality without a performance
>>>>> hit, merging them is sort of the only way right now. Otherwise there would
>>>>> be a lot of redundancy and lost throughput for an ISP's use. Though
>>>>> hopefully long term there will be a way to keep all projects working
>>>>> independently but interoperably with a plugin system of some kind.
>>>>> >>>
>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>>>>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>>>>> to the queue structure.
>>>>> >>> I'm working to set up a physical lab to test high throughput and
>>>>> high client count scenarios.
>>>>> >>> When testing beyond ~32,000 filters we get "no space left on
>>>>> device" from xdp-cpumap-tc, which I think relates to the bpf map size
>>>>> limitation you mentioned. Maybe in the coming months we can take a look at
>>>>> that.
>>>>> >>>
>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more
>>>>> on this.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Robert
>>>>> >>>
>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>>>>> libreqos@lists.bufferbloat.net> wrote:
>>>>> >>>>
>>>>> >>>> Hey,
>>>>> >>>>
>>>>> >>>> My current (unfinished) progress on this is now available here:
>>>>> https://github.com/thebracket/cpumap-pping-hackjob
>>>>> >>>>
>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged
>>>>> - and can't promise that it won't unleash the nasal demons
>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>>> >>>>
>>>>> >>>> With that said, I'm pretty happy so far:
>>>>> >>>>
>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>>>>> shunted onto a dedicated CPU. It has to run on both
>>>>> >>>>   the inbound and outbound classifiers, since otherwise it would
>>>>> only see half the conversation.
>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to
>>>>> the same interface; I do that anyway in BracketQoS. Not doing
>>>>> >>>>   that opens up a potential world of pain, since writes to the
>>>>> shared maps would require a locking scheme. Too much locking, and you lose
>>>>> all of the benefit of using multiple CPUs to begin with.
>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems
>>>>> I've worked with have lots of it.
>>>>> >>>> * I've been gradually removing features that I don't want for
>>>>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>>>>> that.
>>>>> >>>> * Rate limiting is working, but I removed the requirement for a
>>>>> shared configuration provided from userland - so right now it's always set
>>>>> to report at 1 second intervals per stream.
>>>>> >>>>
>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>>>>> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>>>>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>>>>> fast SSDs)
>>>>> >>>>
>>>>> >>>> Output currently consists of debug messages reading:
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>>>>> bpf_trace_printk: (tc) Flow open event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>>>>> bpf_trace_printk: (tc) Flow open event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>>>>> bpf_trace_printk: (tc) Send flow event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>>>>> bpf_trace_printk: (tc) Send flow event
>>>>> >>>>
>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>>>>> major/minor, allocated by the xdp-cpumap parent.
>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with
>>>>> some real-world data very soon.
>>>>> >>>>
>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> Herbert
>>>>> >>>>
>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>>>>> Simon.Sundberg@kau.se> wrote:
>>>>> >>>>>
>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of
>>>>> quick
>>>>> >>>>> notes.
>>>>> >>>>>
>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>>> >>>>> > [ Adding Simon to Cc ]
>>>>> >>>>> >
>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>>>> writes:
>>>>> >>>>> >
>>>>> >>>>> > > Hey,
>>>>> >>>>> > >
>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>>>> >>>>> > >
>>>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
>>>>> )
>>>>> >>>>> > > into xdp-cpumap-tc (
>>>>> https://github.com/xdp-project/xdp-cpumap-tc ).
>>>>> >>>>> > >
>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed
>>>>> the entry point
>>>>> >>>>> > > and packet parsing code to make use of the work already done
>>>>> in
>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the
>>>>> packet, no need to do
>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had
>>>>> to pin them -
>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>>>> >>>>> > >
>>>>> >>>>>
>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is
>>>>> processed on
>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a
>>>>> flow may
>>>>> >>>>> be processed by different cores on ingress and egress the
>>>>> per-CPU maps
>>>>> >>>>> will not really work reliably as each core will have a different
>>>>> view
>>>>> >>>>> on the state of the flow, if there's been a previous packet with
>>>>> a
>>>>> >>>>> certain TSval from that flow etc.
>>>>> >>>>>
>>>>> >>>>> Furthermore, if a flow is always processed on the same core (on
>>>>> both
>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own
>>>>> value. So
>>>>> >>>>> all CPUs will then have their own data for each flow, but it's
>>>>> only the
>>>>> >>>>> CPU processing the flow that will have any relevant data for the
>>>>> flow
>>>>> >>>>> while the remaining CPUs will just have an empty state for that
>>>>> flow.
>>>>> >>>>> Under the same assumption that packets within the same flow are
>>>>> always
>>>>> >>>>> processed on the same core there should generally not be any
>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>>>>> packets
>>>>> >>>>> from the same flow cannot be processed concurrently then (and
>>>>> thus no
>>>>> >>>>> concurrent access to the same value in the map). I am however
>>>>> still
>>>>> >>>>> very unclear on if there's any considerable performance impact
>>>>> between
>>>>> >>>>> global and per-CPU map versions if the same key is not accessed
>>>>> >>>>> concurrently.
>>>>> >>>>>
>>>>> >>>>> > > Right now, output
>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>>>>> code. Instead,
>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe,
>>>>> so I can see
>>>>> >>>>> > > roughly what the output would look like.
>>>>> >>>>> > >
>>>>> >>>>> > > With debug enabled and just logging I'm now getting about
>>>>> 4.9 Gbits/sec on
>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>>>> middle). :-)
>>>>> >>>>> >
>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>>> >>>>> > overhead, then. What Simon found was that sending the data
>>>>> from kernel
>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at
>>>>> least when
>>>>> >>>>> > the number of data points goes up (which is does as additional
>>>>> flows are
>>>>> >>>>> > added).
>>>>> >>>>>
>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you
>>>>> may get
>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in
>>>>> terms of
>>>>> >>>>> direct overhead from the tool itself, but also becomes demanding
>>>>> for
>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log,
>>>>> parse,
>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>>>>> that is
>>>>> >>>>> of course to just apply some sort of sampling (the
>>>>> -r/--rate-limit and
>>>>> >>>>> -R/--rtt-rate
>>>>> >>>>> >
>>>>> >>>>> > > So my question: how would you prefer to receive this data?
>>>>> I'll have to
>>>>> >>>>> > > write a daemon that provides userspace control (periodic
>>>>> cleanup as well as
>>>>> >>>>> > > reading the performance stream), so the world's kinda our
>>>>> oyster. I can
>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named
>>>>> pipe, perhaps?),
>>>>> >>>>> > > a condensed format that only shows what you want to use, an
>>>>> efficient
>>>>> >>>>> > > binary format if you feel like parsing that...
>>>>> >>>>> >
>>>>> >>>>> > It would be great if we could combine efforts a bit here so we
>>>>> don't
>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>>>>> epping and
>>>>> >>>>> > whatever daemon you end up writing can agree on data format
>>>>> etc that
>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>>> >>>>> >
>>>>> >>>>> > Briefly what I've discussed before with Simon was to have the
>>>>> ability to
>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>>>>> userspace
>>>>> >>>>> > utility periodically pull them out. What we discussed was
>>>>> doing this
>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would
>>>>> be that
>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes)
>>>>> they
>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one
>>>>> key per
>>>>> >>>>> > customer, for instance). Epping would then do a map lookup
>>>>> into the LPM,
>>>>> >>>>> > and if it gets a match it would update the statistics in that
>>>>> map entry
>>>>> >>>>> > (keeping a histogram of latency values seen, basically).
>>>>> Simon's PR
>>>>> >>>>> > below uses this technique where userspace will "reset" the
>>>>> histogram
>>>>> >>>>> > every time it loads it by swapping out two different map
>>>>> entries when it
>>>>> >>>>> > does a read; this allows you to control the sampling rate from
>>>>> >>>>> > userspace, and you'll just get the data since the last time
>>>>> you polled.
>>>>> >>>>>
>>>>> >>>>> Thank's Toke for summarzing both the current state and the plan
>>>>> going
>>>>> >>>>> forward. I will just note that this PR (and all my other work
>>>>> with
>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more
>>>>> or less
>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish
>>>>> up a
>>>>> >>>>> paper.
>>>>> >>>>>
>>>>> >>>>> > I was thinking that if we all can agree on the map format,
>>>>> then your
>>>>> >>>>> > polling daemon could be one userspace "client" for that, and
>>>>> the epping
>>>>> >>>>> > binary itself could be another; but we could keep
>>>>> compatibility between
>>>>> >>>>> > the two, so we don't duplicate effort.
>>>>> >>>>> >
>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>>>>> plugged
>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>>>>> >>>>>
>>>>> >>>>> Should probably do that...at some point. In general I think it's
>>>>> a bit
>>>>> >>>>> of an interesting problem to think about how to chain multiple
>>>>> XDP/tc
>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs
>>>>> will do
>>>>> >>>>> some amount of packet parsing and when you have many chained
>>>>> programs
>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful.
>>>>> In the
>>>>> >>>>> same time it would be nice if one didn't need to manually merge
>>>>> >>>>> multiple programs together into a single one like this to get
>>>>> rid of
>>>>> >>>>> this duplicated parsing, or at least make that process of
>>>>> merging those
>>>>> >>>>> programs as simple as possible.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> > -Toke
>>>>> >>>>> >
>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>> >>>>>
>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi
>>>>> dina personuppgifter<https://www.kau.se/gdpr>.
>>>>> >>>>> When you send an e-mail to Karlstad University, we will process
>>>>> your personal data<https://www.kau.se/en/gdpr>.
>>>>> >>>>
>>>>> >>>> _______________________________________________
>>>>> >>>> LibreQoS mailing list
>>>>> >>>> LibreQoS@lists.bufferbloat.net
>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Robert Chacón
>>>>> >>> CEO | JackRabbit Wireless LLC
>>>>> >
>>>>> > _______________________________________________
>>>>> > LibreQoS mailing list
>>>>> > LibreQoS@lists.bufferbloat.net
>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> This song goes out to all the folk that thought Stadia would work:
>>>>>
>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>>>>> Dave Täht CEO, TekLibre, LLC
>>>>>
>>>> _______________________________________________
>>> LibreQoS mailing list
>>> LibreQoS@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>
>>
>> --
>> Robert Chacón
>> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
>> _______________________________________________
>> LibreQoS mailing list
>> LibreQoS@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/libreqos
>>
>

[-- Attachment #2: Type: text/html, Size: 32830 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 16:10                       ` Herbert Wolverson
@ 2022-10-19 16:13                         ` Dave Taht
  2022-10-19 16:13                           ` Dave Taht
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Taht @ 2022-10-19 16:13 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

flent outputs a flent.gz file that I can parse and plot 20 differnt
ways. Also the graphing tools work on osx

On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
<libreqos@lists.bufferbloat.net> wrote:
>
> That's true. The 12th gen does seem to have some "special" features... makes for a nice writing platform
> (this box is primarily my "write books and articles" machine). I'll be doing a wider test on a more normal
> platform, probably at the weekend (with real traffic, hence the delay - have to find a time in which I
> minimize disruption)
>
> On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>>
>> Those 'efficiency' threads in Intel 12th gen should probably be addressed as well.  You can't turn them off in BIOS.
>>
>> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>>>
>>> Awesome work on this!
>>> I suspect there should be a slight performance bump once Hyperthreading is disabled and efficient power management is off.
>>> Hyperthreading/SMT always messes with HTB performance when I leave it on. Thank you for mentioning that - I now went ahead and added instructions on disabling hyperthreading on the Wiki for new users.
>>> Super promising results!
>>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping. So far in your VM setup it seems to be doing very well.
>>>
>>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>>>>
>>>> Also, I forgot to mention that I *think* the current version has removed the requirement that the inbound
>>>> and outbound classifiers be placed on the same CPU. I know interduo was particularly keen on packing
>>>> upload into fewer cores. I'll add that to my list of things to test.
>>>>
>>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com> wrote:
>>>>>
>>>>> I'll definitely take a look - that does look interesting. I don't have X11 on any of my test VMs, but
>>>>> it looks like it can work without the GUI.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>>>>>>
>>>>>> could I coax you to adopt flent?
>>>>>>
>>>>>> apt-get install flent netperf irtt fping
>>>>>>
>>>>>> You sometimes have to compile netperf yourself with --enable-demo on
>>>>>> some systems.
>>>>>> There are a bunch of python libs neede for the gui, but only on the client.
>>>>>>
>>>>>> Then you can run a really gnarly test series and plot the results over time.
>>>>>>
>>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>>>>>> the_server_name rrul # 110 other tests
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>>>>> <libreqos@lists.bufferbloat.net> wrote:
>>>>>> >
>>>>>> > Hey,
>>>>>> >
>>>>>> > Testing the current version ( https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better than I hoped. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset stats.
>>>>>> >
>>>>>> > My testing environment has grown a bit:
>>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-pping-hackjob version of xdp-cpumap.
>>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf server.
>>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts iperf client.
>>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts iperf client.
>>>>>> >
>>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on a virtual switch.
>>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a different virtual switch.
>>>>>> >
>>>>>> > These are all on a host machine running Windows 11, a core i7 12th gen, 32 Gb RAM and fast SSD setup.
>>>>>> >
>>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>>>>> >
>>>>>> > For this test, LibreQoS is configured:
>>>>>> > * Two APs, each with 5gbit/s max.
>>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>>>>> > * Set to use Cake
>>>>>> >
>>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct results:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>>> > {}]
>>>>>> >
>>>>>> > Or when I waited a while to gather/reset:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>>>>> > {}]
>>>>>> >
>>>>>> > The ShaperVM shows no errors, just periodic logging that it is recording data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
>>>>>> >
>>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>>>>> >
>>>>>> > So for smaller streams, I'd call this a success.
>>>>>> >
>>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>>>>> >
>>>>>> > For this test, LibreQoS is configured:
>>>>>> > * Two APs, each with 5gb/s max.
>>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>>>>> >
>>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>>>>> >
>>>>>> > xdp_pping shows results, too:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>>>>> > {}]
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>>>>> > {}]
>>>>>> >
>>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>>>>> >
>>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>>>>> >
>>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is to be expected), but it's not bad. I also forgot to disable hyperthreading, and looking at the host performance it is sometimes running the second virtual CPU on an underpowered "fake" CPU.
>>>>>> >
>>>>>> > So for two large streams, I think we're doing pretty well also!
>>>>>> >
>>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>>>>> >
>>>>>> > This test is designed to try and blow things up. It's the same as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
>>>>>> >
>>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The pping stats start to show a bit of degradation in performance for pounding it so hard:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>>>>> > {}]
>>>>>> >
>>>>>> > For whatever reason, it smoothed out over time:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>>>>> > {}]
>>>>>> >
>>>>>> > Surprisingly (to me), I didn't encounter errors. Each client received 2.22 Gbit/s performance, over 129 Gbytes of data.
>>>>>> >
>>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>>>>> >
>>>>>> > This test is also designed to break things. Same as test 3, but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the flow tracking. (Shorter time window because I really wanted to go and find coffee)
>>>>>> >
>>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results show that this torture test is worsening performance, and there's always lots of samples in the buffer:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>>>>> > {}]
>>>>>> >
>>>>>> > This test also ran better than I expected. You can definitely see some latency creeping in as I make the system work hard. Each VM showed around 2.4 Gbit/s in total performance at the end of the iperf session. There's definitely some latency creeping in, which is expected - but I'm not sure I expected quite that much.
>>>>>> >
>>>>>> > WHAT'S NEXT & CONCLUSION
>>>>>> >
>>>>>> > I noticed that I forgot to turn off efficient power management on my VMs and host, and left Hyperthreading on by mistake. So that hurts overall performance.
>>>>>> >
>>>>>> > The base system seems to be working pretty solidly, at least for small tests.Next up, I'll be removing extraneous debug reporting code, removing some code paths that don't do anything but report, and looking for any small optimization opportunities. I'll then re-run these tests. Once that's done, I hope to find a maintenance window on my WISP and try it with actual traffic.
>>>>>> >
>>>>>> > I also need to re-run these tests without the pping system to provide some before/after analysis.
>>>>>> >
>>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com> wrote:
>>>>>> >>
>>>>>> >> It's probably not entirely thread-safe right now (ran into some issues reading per_cpu maps back from userspace; hopefully, I'll get that figured out) - but the commits I just pushed have it basically working on single-stream testing. :-)
>>>>>> >>
>>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you per-connection RTT information in JSON:
>>>>>> >>
>>>>>> >> [
>>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>>>>> >> {}]
>>>>>> >>
>>>>>> >> (With the extra {} because I'm not tracking the tail and haven't done comma removal). The tool also empties the various maps used to gather data, acting as a "reset" point. There's a max of 60 samples per queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
>>>>>> >>
>>>>>> >> I'll start trying to test on a larger scale now.
>>>>>> >>
>>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <robert.chacon@jackrabbitwireless.com> wrote:
>>>>>> >>>
>>>>>> >>> Hey Herbert,
>>>>>> >>>
>>>>>> >>> Fantastic work! Super exciting to see this coming together, especially so quickly.
>>>>>> >>> I'll test it soon.
>>>>>> >>> I understand and agree with your decision to omit certain features (ICMP tracking,DNS tracking, etc) to optimize performance for our use case. Like you said, in order to merge the functionality without a performance hit, merging them is sort of the only way right now. Otherwise there would be a lot of redundancy and lost throughput for an ISP's use. Though hopefully long term there will be a way to keep all projects working independently but interoperably with a plugin system of some kind.
>>>>>> >>>
>>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on optimizations for high sub counts (8000+ subs) as well as stateful changes to the queue structure.
>>>>>> >>> I'm working to set up a physical lab to test high throughput and high client count scenarios.
>>>>>> >>> When testing beyond ~32,000 filters we get "no space left on device" from xdp-cpumap-tc, which I think relates to the bpf map size limitation you mentioned. Maybe in the coming months we can take a look at that.
>>>>>> >>>
>>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more on this.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> Robert
>>>>>> >>>
>>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>>>>>> >>>>
>>>>>> >>>> Hey,
>>>>>> >>>>
>>>>>> >>>> My current (unfinished) progress on this is now available here: https://github.com/thebracket/cpumap-pping-hackjob
>>>>>> >>>>
>>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged - and can't promise that it won't unleash the nasal demons
>>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>>>> >>>>
>>>>>> >>>> With that said, I'm pretty happy so far:
>>>>>> >>>>
>>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on both
>>>>>> >>>>   the inbound and outbound classifiers, since otherwise it would only see half the conversation.
>>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not doing
>>>>>> >>>>   that opens up a potential world of pain, since writes to the shared maps would require a locking scheme. Too much locking, and you lose all of the benefit of using multiple CPUs to begin with.
>>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've worked with have lots of it.
>>>>>> >>>> * I've been gradually removing features that I don't want for BracketQoS. A hypothetical future "useful to everyone" version wouldn't do that.
>>>>>> >>>> * Rate limiting is working, but I removed the requirement for a shared configuration provided from userland - so right now it's always set to report at 1 second intervals per stream.
>>>>>> >>>>
>>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)
>>>>>> >>>>
>>>>>> >>>> Output currently consists of debug messages reading:
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc) Flow open event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc) Flow open event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc) Send flow event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc) Send flow event
>>>>>> >>>>
>>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap parent.
>>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with some real-world data very soon.
>>>>>> >>>>
>>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> Herbert
>>>>>> >>>>
>>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se> wrote:
>>>>>> >>>>>
>>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>>>>>> >>>>> notes.
>>>>>> >>>>>
>>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>>>> >>>>> > [ Adding Simon to Cc ]
>>>>>> >>>>> >
>>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>>>>>> >>>>> >
>>>>>> >>>>> > > Hey,
>>>>>> >>>>> > >
>>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>>>>>> >>>>> > >
>>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed the entry point
>>>>>> >>>>> > > and packet parsing code to make use of the work already done in
>>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
>>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin them -
>>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>>>>> >>>>> > >
>>>>>> >>>>>
>>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>>>>>> >>>>> be processed by different cores on ingress and egress the per-CPU maps
>>>>>> >>>>> will not really work reliably as each core will have a different view
>>>>>> >>>>> on the state of the flow, if there's been a previous packet with a
>>>>>> >>>>> certain TSval from that flow etc.
>>>>>> >>>>>
>>>>>> >>>>> Furthermore, if a flow is always processed on the same core (on both
>>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own value. So
>>>>>> >>>>> all CPUs will then have their own data for each flow, but it's only the
>>>>>> >>>>> CPU processing the flow that will have any relevant data for the flow
>>>>>> >>>>> while the remaining CPUs will just have an empty state for that flow.
>>>>>> >>>>> Under the same assumption that packets within the same flow are always
>>>>>> >>>>> processed on the same core there should generally not be any
>>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as packets
>>>>>> >>>>> from the same flow cannot be processed concurrently then (and thus no
>>>>>> >>>>> concurrent access to the same value in the map). I am however still
>>>>>> >>>>> very unclear on if there's any considerable performance impact between
>>>>>> >>>>> global and per-CPU map versions if the same key is not accessed
>>>>>> >>>>> concurrently.
>>>>>> >>>>>
>>>>>> >>>>> > > Right now, output
>>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output code. Instead,
>>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
>>>>>> >>>>> > > roughly what the output would look like.
>>>>>> >>>>> > >
>>>>>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
>>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>>>>>> >>>>> >
>>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>>>> >>>>> > overhead, then. What Simon found was that sending the data from kernel
>>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at least when
>>>>>> >>>>> > the number of data points goes up (which is does as additional flows are
>>>>>> >>>>> > added).
>>>>>> >>>>>
>>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>>>>>> >>>>> direct overhead from the tool itself, but also becomes demanding for
>>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>>>>>> >>>>> -R/--rtt-rate
>>>>>> >>>>> >
>>>>>> >>>>> > > So my question: how would you prefer to receive this data? I'll have to
>>>>>> >>>>> > > write a daemon that provides userspace control (periodic cleanup as well as
>>>>>> >>>>> > > reading the performance stream), so the world's kinda our oyster. I can
>>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named pipe, perhaps?),
>>>>>> >>>>> > > a condensed format that only shows what you want to use, an efficient
>>>>>> >>>>> > > binary format if you feel like parsing that...
>>>>>> >>>>> >
>>>>>> >>>>> > It would be great if we could combine efforts a bit here so we don't
>>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>>>>>> >>>>> > whatever daemon you end up writing can agree on data format etc that
>>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>>>> >>>>> >
>>>>>> >>>>> > Briefly what I've discussed before with Simon was to have the ability to
>>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>>>>>> >>>>> > utility periodically pull them out. What we discussed was doing this
>>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would be that
>>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
>>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one key per
>>>>>> >>>>> > customer, for instance). Epping would then do a map lookup into the LPM,
>>>>>> >>>>> > and if it gets a match it would update the statistics in that map entry
>>>>>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>>>>>> >>>>> > below uses this technique where userspace will "reset" the histogram
>>>>>> >>>>> > every time it loads it by swapping out two different map entries when it
>>>>>> >>>>> > does a read; this allows you to control the sampling rate from
>>>>>> >>>>> > userspace, and you'll just get the data since the last time you polled.
>>>>>> >>>>>
>>>>>> >>>>> Thank's Toke for summarzing both the current state and the plan going
>>>>>> >>>>> forward. I will just note that this PR (and all my other work with
>>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
>>>>>> >>>>> paper.
>>>>>> >>>>>
>>>>>> >>>>> > I was thinking that if we all can agree on the map format, then your
>>>>>> >>>>> > polling daemon could be one userspace "client" for that, and the epping
>>>>>> >>>>> > binary itself could be another; but we could keep compatibility between
>>>>>> >>>>> > the two, so we don't duplicate effort.
>>>>>> >>>>> >
>>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be plugged
>>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>>>>>> >>>>>
>>>>>> >>>>> Should probably do that...at some point. In general I think it's a bit
>>>>>> >>>>> of an interesting problem to think about how to chain multiple XDP/tc
>>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs will do
>>>>>> >>>>> some amount of packet parsing and when you have many chained programs
>>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In the
>>>>>> >>>>> same time it would be nice if one didn't need to manually merge
>>>>>> >>>>> multiple programs together into a single one like this to get rid of
>>>>>> >>>>> this duplicated parsing, or at least make that process of merging those
>>>>>> >>>>> programs as simple as possible.
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> > -Toke
>>>>>> >>>>> >
>>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>>> >>>>>
>>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
>>>>>> >>>>> When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.
>>>>>> >>>>
>>>>>> >>>> _______________________________________________
>>>>>> >>>> LibreQoS mailing list
>>>>>> >>>> LibreQoS@lists.bufferbloat.net
>>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Robert Chacón
>>>>>> >>> CEO | JackRabbit Wireless LLC
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > LibreQoS mailing list
>>>>>> > LibreQoS@lists.bufferbloat.net
>>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> This song goes out to all the folk that thought Stadia would work:
>>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>>>>>> Dave Täht CEO, TekLibre, LLC
>>>>
>>>> _______________________________________________
>>>> LibreQoS mailing list
>>>> LibreQoS@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>>
>>>
>>> --
>>> Robert Chacón
>>> CEO | JackRabbit Wireless LLC
>>> _______________________________________________
>>> LibreQoS mailing list
>>> LibreQoS@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/libreqos
>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 16:13                         ` Dave Taht
@ 2022-10-19 16:13                           ` Dave Taht
  2022-10-22 14:32                             ` Herbert Wolverson
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Taht @ 2022-10-19 16:13 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

PS - today's (free) p99 conference is *REALLY AWESOME*. https://www.p99conf.io/

On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>
> flent outputs a flent.gz file that I can parse and plot 20 differnt
> ways. Also the graphing tools work on osx
>
> On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
> <libreqos@lists.bufferbloat.net> wrote:
> >
> > That's true. The 12th gen does seem to have some "special" features... makes for a nice writing platform
> > (this box is primarily my "write books and articles" machine). I'll be doing a wider test on a more normal
> > platform, probably at the weekend (with real traffic, hence the delay - have to find a time in which I
> > minimize disruption)
> >
> > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
> >>
> >> Those 'efficiency' threads in Intel 12th gen should probably be addressed as well.  You can't turn them off in BIOS.
> >>
> >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
> >>>
> >>> Awesome work on this!
> >>> I suspect there should be a slight performance bump once Hyperthreading is disabled and efficient power management is off.
> >>> Hyperthreading/SMT always messes with HTB performance when I leave it on. Thank you for mentioning that - I now went ahead and added instructions on disabling hyperthreading on the Wiki for new users.
> >>> Super promising results!
> >>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping. So far in your VM setup it seems to be doing very well.
> >>>
> >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
> >>>>
> >>>> Also, I forgot to mention that I *think* the current version has removed the requirement that the inbound
> >>>> and outbound classifiers be placed on the same CPU. I know interduo was particularly keen on packing
> >>>> upload into fewer cores. I'll add that to my list of things to test.
> >>>>
> >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com> wrote:
> >>>>>
> >>>>> I'll definitely take a look - that does look interesting. I don't have X11 on any of my test VMs, but
> >>>>> it looks like it can work without the GUI.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
> >>>>>>
> >>>>>> could I coax you to adopt flent?
> >>>>>>
> >>>>>> apt-get install flent netperf irtt fping
> >>>>>>
> >>>>>> You sometimes have to compile netperf yourself with --enable-demo on
> >>>>>> some systems.
> >>>>>> There are a bunch of python libs neede for the gui, but only on the client.
> >>>>>>
> >>>>>> Then you can run a really gnarly test series and plot the results over time.
> >>>>>>
> >>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
> >>>>>> the_server_name rrul # 110 other tests
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
> >>>>>> <libreqos@lists.bufferbloat.net> wrote:
> >>>>>> >
> >>>>>> > Hey,
> >>>>>> >
> >>>>>> > Testing the current version ( https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better than I hoped. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset stats.
> >>>>>> >
> >>>>>> > My testing environment has grown a bit:
> >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-pping-hackjob version of xdp-cpumap.
> >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf server.
> >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts iperf client.
> >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts iperf client.
> >>>>>> >
> >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on a virtual switch.
> >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a different virtual switch.
> >>>>>> >
> >>>>>> > These are all on a host machine running Windows 11, a core i7 12th gen, 32 Gb RAM and fast SSD setup.
> >>>>>> >
> >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
> >>>>>> >
> >>>>>> > For this test, LibreQoS is configured:
> >>>>>> > * Two APs, each with 5gbit/s max.
> >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> >>>>>> > * Set to use Cake
> >>>>>> >
> >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct results:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > Or when I waited a while to gather/reset:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > The ShaperVM shows no errors, just periodic logging that it is recording data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
> >>>>>> >
> >>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
> >>>>>> >
> >>>>>> > So for smaller streams, I'd call this a success.
> >>>>>> >
> >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
> >>>>>> >
> >>>>>> > For this test, LibreQoS is configured:
> >>>>>> > * Two APs, each with 5gb/s max.
> >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
> >>>>>> >
> >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
> >>>>>> >
> >>>>>> > xdp_pping shows results, too:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
> >>>>>> >
> >>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
> >>>>>> >
> >>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is to be expected), but it's not bad. I also forgot to disable hyperthreading, and looking at the host performance it is sometimes running the second virtual CPU on an underpowered "fake" CPU.
> >>>>>> >
> >>>>>> > So for two large streams, I think we're doing pretty well also!
> >>>>>> >
> >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
> >>>>>> >
> >>>>>> > This test is designed to try and blow things up. It's the same as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
> >>>>>> >
> >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The pping stats start to show a bit of degradation in performance for pounding it so hard:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > For whatever reason, it smoothed out over time:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client received 2.22 Gbit/s performance, over 129 Gbytes of data.
> >>>>>> >
> >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
> >>>>>> >
> >>>>>> > This test is also designed to break things. Same as test 3, but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the flow tracking. (Shorter time window because I really wanted to go and find coffee)
> >>>>>> >
> >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results show that this torture test is worsening performance, and there's always lots of samples in the buffer:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
> >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > This test also ran better than I expected. You can definitely see some latency creeping in as I make the system work hard. Each VM showed around 2.4 Gbit/s in total performance at the end of the iperf session. There's definitely some latency creeping in, which is expected - but I'm not sure I expected quite that much.
> >>>>>> >
> >>>>>> > WHAT'S NEXT & CONCLUSION
> >>>>>> >
> >>>>>> > I noticed that I forgot to turn off efficient power management on my VMs and host, and left Hyperthreading on by mistake. So that hurts overall performance.
> >>>>>> >
> >>>>>> > The base system seems to be working pretty solidly, at least for small tests.Next up, I'll be removing extraneous debug reporting code, removing some code paths that don't do anything but report, and looking for any small optimization opportunities. I'll then re-run these tests. Once that's done, I hope to find a maintenance window on my WISP and try it with actual traffic.
> >>>>>> >
> >>>>>> > I also need to re-run these tests without the pping system to provide some before/after analysis.
> >>>>>> >
> >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com> wrote:
> >>>>>> >>
> >>>>>> >> It's probably not entirely thread-safe right now (ran into some issues reading per_cpu maps back from userspace; hopefully, I'll get that figured out) - but the commits I just pushed have it basically working on single-stream testing. :-)
> >>>>>> >>
> >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you per-connection RTT information in JSON:
> >>>>>> >>
> >>>>>> >> [
> >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> >>>>>> >> {}]
> >>>>>> >>
> >>>>>> >> (With the extra {} because I'm not tracking the tail and haven't done comma removal). The tool also empties the various maps used to gather data, acting as a "reset" point. There's a max of 60 samples per queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
> >>>>>> >>
> >>>>>> >> I'll start trying to test on a larger scale now.
> >>>>>> >>
> >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <robert.chacon@jackrabbitwireless.com> wrote:
> >>>>>> >>>
> >>>>>> >>> Hey Herbert,
> >>>>>> >>>
> >>>>>> >>> Fantastic work! Super exciting to see this coming together, especially so quickly.
> >>>>>> >>> I'll test it soon.
> >>>>>> >>> I understand and agree with your decision to omit certain features (ICMP tracking,DNS tracking, etc) to optimize performance for our use case. Like you said, in order to merge the functionality without a performance hit, merging them is sort of the only way right now. Otherwise there would be a lot of redundancy and lost throughput for an ISP's use. Though hopefully long term there will be a way to keep all projects working independently but interoperably with a plugin system of some kind.
> >>>>>> >>>
> >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on optimizations for high sub counts (8000+ subs) as well as stateful changes to the queue structure.
> >>>>>> >>> I'm working to set up a physical lab to test high throughput and high client count scenarios.
> >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on device" from xdp-cpumap-tc, which I think relates to the bpf map size limitation you mentioned. Maybe in the coming months we can take a look at that.
> >>>>>> >>>
> >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more on this.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> Robert
> >>>>>> >>>
> >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
> >>>>>> >>>>
> >>>>>> >>>> Hey,
> >>>>>> >>>>
> >>>>>> >>>> My current (unfinished) progress on this is now available here: https://github.com/thebracket/cpumap-pping-hackjob
> >>>>>> >>>>
> >>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged - and can't promise that it won't unleash the nasal demons
> >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
> >>>>>> >>>>
> >>>>>> >>>> With that said, I'm pretty happy so far:
> >>>>>> >>>>
> >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on both
> >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it would only see half the conversation.
> >>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not doing
> >>>>>> >>>>   that opens up a potential world of pain, since writes to the shared maps would require a locking scheme. Too much locking, and you lose all of the benefit of using multiple CPUs to begin with.
> >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've worked with have lots of it.
> >>>>>> >>>> * I've been gradually removing features that I don't want for BracketQoS. A hypothetical future "useful to everyone" version wouldn't do that.
> >>>>>> >>>> * Rate limiting is working, but I removed the requirement for a shared configuration provided from userland - so right now it's always set to report at 1 second intervals per stream.
> >>>>>> >>>>
> >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
> >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
> >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)
> >>>>>> >>>>
> >>>>>> >>>> Output currently consists of debug messages reading:
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc) Flow open event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc) Send performance event (5,1), 374696
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc) Flow open event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc) Send performance event (5,1), 247069
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 5217155
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc) Send performance event (5,1), 4515394
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc) Send performance event (5,1), 4255268
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc) Send performance event (5,1), 5249493
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 3795993
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc) Send performance event (5,1), 3949519
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc) Send performance event (5,1), 4154910
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc) Send performance event (5,1), 4405582
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc) Send flow event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc) Send flow event
> >>>>>> >>>>
> >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap parent.
> >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with some real-world data very soon.
> >>>>>> >>>>
> >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> Herbert
> >>>>>> >>>>
> >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se> wrote:
> >>>>>> >>>>>
> >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
> >>>>>> >>>>> notes.
> >>>>>> >>>>>
> >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> >>>>>> >>>>> > [ Adding Simon to Cc ]
> >>>>>> >>>>> >
> >>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
> >>>>>> >>>>> >
> >>>>>> >>>>> > > Hey,
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
> >>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> >>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed the entry point
> >>>>>> >>>>> > > and packet parsing code to make use of the work already done in
> >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
> >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin them -
> >>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
> >>>>>> >>>>> > >
> >>>>>> >>>>>
> >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
> >>>>>> >>>>> be processed by different cores on ingress and egress the per-CPU maps
> >>>>>> >>>>> will not really work reliably as each core will have a different view
> >>>>>> >>>>> on the state of the flow, if there's been a previous packet with a
> >>>>>> >>>>> certain TSval from that flow etc.
> >>>>>> >>>>>
> >>>>>> >>>>> Furthermore, if a flow is always processed on the same core (on both
> >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
> >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
> >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own value. So
> >>>>>> >>>>> all CPUs will then have their own data for each flow, but it's only the
> >>>>>> >>>>> CPU processing the flow that will have any relevant data for the flow
> >>>>>> >>>>> while the remaining CPUs will just have an empty state for that flow.
> >>>>>> >>>>> Under the same assumption that packets within the same flow are always
> >>>>>> >>>>> processed on the same core there should generally not be any
> >>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as packets
> >>>>>> >>>>> from the same flow cannot be processed concurrently then (and thus no
> >>>>>> >>>>> concurrent access to the same value in the map). I am however still
> >>>>>> >>>>> very unclear on if there's any considerable performance impact between
> >>>>>> >>>>> global and per-CPU map versions if the same key is not accessed
> >>>>>> >>>>> concurrently.
> >>>>>> >>>>>
> >>>>>> >>>>> > > Right now, output
> >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output code. Instead,
> >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
> >>>>>> >>>>> > > roughly what the output would look like.
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
> >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
> >>>>>> >>>>> >
> >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
> >>>>>> >>>>> > overhead, then. What Simon found was that sending the data from kernel
> >>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at least when
> >>>>>> >>>>> > the number of data points goes up (which is does as additional flows are
> >>>>>> >>>>> > added).
> >>>>>> >>>>>
> >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may get
> >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> >>>>>> >>>>> direct overhead from the tool itself, but also becomes demanding for
> >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
> >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with that is
> >>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit and
> >>>>>> >>>>> -R/--rtt-rate
> >>>>>> >>>>> >
> >>>>>> >>>>> > > So my question: how would you prefer to receive this data? I'll have to
> >>>>>> >>>>> > > write a daemon that provides userspace control (periodic cleanup as well as
> >>>>>> >>>>> > > reading the performance stream), so the world's kinda our oyster. I can
> >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named pipe, perhaps?),
> >>>>>> >>>>> > > a condensed format that only shows what you want to use, an efficient
> >>>>>> >>>>> > > binary format if you feel like parsing that...
> >>>>>> >>>>> >
> >>>>>> >>>>> > It would be great if we could combine efforts a bit here so we don't
> >>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
> >>>>>> >>>>> > whatever daemon you end up writing can agree on data format etc that
> >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
> >>>>>> >>>>> >
> >>>>>> >>>>> > Briefly what I've discussed before with Simon was to have the ability to
> >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> >>>>>> >>>>> > utility periodically pull them out. What we discussed was doing this
> >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would be that
> >>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
> >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one key per
> >>>>>> >>>>> > customer, for instance). Epping would then do a map lookup into the LPM,
> >>>>>> >>>>> > and if it gets a match it would update the statistics in that map entry
> >>>>>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
> >>>>>> >>>>> > below uses this technique where userspace will "reset" the histogram
> >>>>>> >>>>> > every time it loads it by swapping out two different map entries when it
> >>>>>> >>>>> > does a read; this allows you to control the sampling rate from
> >>>>>> >>>>> > userspace, and you'll just get the data since the last time you polled.
> >>>>>> >>>>>
> >>>>>> >>>>> Thank's Toke for summarzing both the current state and the plan going
> >>>>>> >>>>> forward. I will just note that this PR (and all my other work with
> >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
> >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
> >>>>>> >>>>> paper.
> >>>>>> >>>>>
> >>>>>> >>>>> > I was thinking that if we all can agree on the map format, then your
> >>>>>> >>>>> > polling daemon could be one userspace "client" for that, and the epping
> >>>>>> >>>>> > binary itself could be another; but we could keep compatibility between
> >>>>>> >>>>> > the two, so we don't duplicate effort.
> >>>>>> >>>>> >
> >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be plugged
> >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
> >>>>>> >>>>>
> >>>>>> >>>>> Should probably do that...at some point. In general I think it's a bit
> >>>>>> >>>>> of an interesting problem to think about how to chain multiple XDP/tc
> >>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs will do
> >>>>>> >>>>> some amount of packet parsing and when you have many chained programs
> >>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In the
> >>>>>> >>>>> same time it would be nice if one didn't need to manually merge
> >>>>>> >>>>> multiple programs together into a single one like this to get rid of
> >>>>>> >>>>> this duplicated parsing, or at least make that process of merging those
> >>>>>> >>>>> programs as simple as possible.
> >>>>>> >>>>>
> >>>>>> >>>>>
> >>>>>> >>>>> > -Toke
> >>>>>> >>>>> >
> >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
> >>>>>> >>>>>
> >>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
> >>>>>> >>>>> When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.
> >>>>>> >>>>
> >>>>>> >>>> _______________________________________________
> >>>>>> >>>> LibreQoS mailing list
> >>>>>> >>>> LibreQoS@lists.bufferbloat.net
> >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> --
> >>>>>> >>> Robert Chacón
> >>>>>> >>> CEO | JackRabbit Wireless LLC
> >>>>>> >
> >>>>>> > _______________________________________________
> >>>>>> > LibreQoS mailing list
> >>>>>> > LibreQoS@lists.bufferbloat.net
> >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> This song goes out to all the folk that thought Stadia would work:
> >>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> >>>>>> Dave Täht CEO, TekLibre, LLC
> >>>>
> >>>> _______________________________________________
> >>>> LibreQoS mailing list
> >>>> LibreQoS@lists.bufferbloat.net
> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>
> >>>
> >>>
> >>> --
> >>> Robert Chacón
> >>> CEO | JackRabbit Wireless LLC
> >>> _______________________________________________
> >>> LibreQoS mailing list
> >>> LibreQoS@lists.bufferbloat.net
> >>> https://lists.bufferbloat.net/listinfo/libreqos
> >
> > _______________________________________________
> > LibreQoS mailing list
> > LibreQoS@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/libreqos
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> Dave Täht CEO, TekLibre, LLC



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-19 16:13                           ` Dave Taht
@ 2022-10-22 14:32                             ` Herbert Wolverson
  2022-10-22 14:44                               ` Dave Taht
  2022-10-22 14:47                               ` Robert Chacón
  0 siblings, 2 replies; 20+ messages in thread
From: Herbert Wolverson @ 2022-10-22 14:32 UTC (permalink / raw)
  Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 30607 bytes --]

This morning I tested cpu-pping with live customers!
A little over 1,200 mapped IP addresses, about 600 mbps of real traffic
flowing through a big
hierarchy of 52 sites. (600 is our "quiet time" traffic)

It started very well: the updated xdp-cpumap system dropped in place and
the system worked as
before. xdp_pping started to show data with correct mappings. CPU load from
the mapping
system is within 1% of where it was before.

After about 20 minutes of continuous execution, it started to run into some
scaling issues.
The shaping system continued to run wonderfully, and CPU load was still
fine. However,
it stopped reporting latency data! A bit of debugging showed that once you
exceed
16,384 in-flight TCP streams it isn't handling the "map full" situation
gracefully - and
clearing the map from userspace isn't working correctly. So I hacked away
and hacked
away.

Anyway, it turns out that it does in fact work fine at that scale. There's
just a one-line
bug in the xdp_pping.c file. I forgot to actually *call* one line of packet
cleanup code.
Adding that, and everything was awesome.

The entire patch that fixed it consists of adding one line:
cleanup_packet_ts(packet_ts);

Oops.

Anyway, with that in place it's running superbly. I did identify a couple
of places in
which it's being overly verbose with debug information, so I've patched
that also.

After reducing the overly eager warning about not being able to read a TCP
header,
CPU performance improved by another 2% on average.

Longer-term (i.e. not on a Saturday morning, when I'd rather be playing
with my
daughter!), I think I'll look at raising some of the buffer sizes.

Thanks,
Herbert

On Wed, Oct 19, 2022 at 11:13 AM Dave Taht <dave.taht@gmail.com> wrote:

> PS - today's (free) p99 conference is *REALLY AWESOME*.
> https://www.p99conf.io/
>
> On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
> >
> > flent outputs a flent.gz file that I can parse and plot 20 differnt
> > ways. Also the graphing tools work on osx
> >
> > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
> > <libreqos@lists.bufferbloat.net> wrote:
> > >
> > > That's true. The 12th gen does seem to have some "special" features...
> makes for a nice writing platform
> > > (this box is primarily my "write books and articles" machine). I'll be
> doing a wider test on a more normal
> > > platform, probably at the weekend (with real traffic, hence the delay
> - have to find a time in which I
> > > minimize disruption)
> > >
> > > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
> > >>
> > >> Those 'efficiency' threads in Intel 12th gen should probably be
> addressed as well.  You can't turn them off in BIOS.
> > >>
> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
> > >>>
> > >>> Awesome work on this!
> > >>> I suspect there should be a slight performance bump once
> Hyperthreading is disabled and efficient power management is off.
> > >>> Hyperthreading/SMT always messes with HTB performance when I leave
> it on. Thank you for mentioning that - I now went ahead and added
> instructions on disabling hyperthreading on the Wiki for new users.
> > >>> Super promising results!
> > >>> Interested to see what throughput is with xdp-cpumap-tc vs
> cpumap-pping. So far in your VM setup it seems to be doing very well.
> > >>>
> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
> > >>>>
> > >>>> Also, I forgot to mention that I *think* the current version has
> removed the requirement that the inbound
> > >>>> and outbound classifiers be placed on the same CPU. I know interduo
> was particularly keen on packing
> > >>>> upload into fewer cores. I'll add that to my list of things to test.
> > >>>>
> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <
> herberticus@gmail.com> wrote:
> > >>>>>
> > >>>>> I'll definitely take a look - that does look interesting. I don't
> have X11 on any of my test VMs, but
> > >>>>> it looks like it can work without the GUI.
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>> could I coax you to adopt flent?
> > >>>>>>
> > >>>>>> apt-get install flent netperf irtt fping
> > >>>>>>
> > >>>>>> You sometimes have to compile netperf yourself with --enable-demo
> on
> > >>>>>> some systems.
> > >>>>>> There are a bunch of python libs neede for the gui, but only on
> the client.
> > >>>>>>
> > >>>>>> Then you can run a really gnarly test series and plot the results
> over time.
> > >>>>>>
> > >>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
> > >>>>>> the_server_name rrul # 110 other tests
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
> > >>>>>> <libreqos@lists.bufferbloat.net> wrote:
> > >>>>>> >
> > >>>>>> > Hey,
> > >>>>>> >
> > >>>>>> > Testing the current version (
> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
> than I hoped. This build has shared (not per-cpu) maps, and a userspace
> daemon (xdp_pping) to extract and reset stats.
> > >>>>>> >
> > >>>>>> > My testing environment has grown a bit:
> > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
> cpumap-pping-hackjob version of xdp-cpumap.
> > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an
> iperf server.
> > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as
> 10.64.1.2. Hosts iperf client.
> > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as
> 10.64.1.3. Hosts iperf client.
> > >>>>>> >
> > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of
> ShaperVM are on a virtual switch.
> > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on
> a different virtual switch.
> > >>>>>> >
> > >>>>>> > These are all on a host machine running Windows 11, a core i7
> 12th gen, 32 Gb RAM and fast SSD setup.
> > >>>>>> >
> > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
> > >>>>>> >
> > >>>>>> > For this test, LibreQoS is configured:
> > >>>>>> > * Two APs, each with 5gbit/s max.
> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
> about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> > >>>>>> > * Set to use Cake
> > >>>>>> >
> > >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1
> -t 500 (for a long run). Running xdp_pping yields correct results:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > Or when I waited a while to gather/reset:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > The ShaperVM shows no errors, just periodic logging that it is
> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
> expected).
> > >>>>>> >
> > >>>>>> > After 500 seconds of continual iperfing, each client reported a
> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
> > >>>>>> >
> > >>>>>> > So for smaller streams, I'd call this a success.
> > >>>>>> >
> > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
> > >>>>>> >
> > >>>>>> > For this test, LibreQoS is configured:
> > >>>>>> > * Two APs, each with 5gb/s max.
> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
> 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
> > >>>>>> >
> > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
> > >>>>>> >
> > >>>>>> > xdp_pping shows results, too:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
> > >>>>>> >
> > >>>>>> > After 500 seconds of continual iperfing, each client reported a
> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
> > >>>>>> >
> > >>>>>> > Maxing out HyperV like this is inducing a bit of latency (which
> is to be expected), but it's not bad. I also forgot to disable
> hyperthreading, and looking at the host performance it is sometimes running
> the second virtual CPU on an underpowered "fake" CPU.
> > >>>>>> >
> > >>>>>> > So for two large streams, I think we're doing pretty well also!
> > >>>>>> >
> > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
> > >>>>>> >
> > >>>>>> > This test is designed to try and blow things up. It's the same
> as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5
> and 1:6.
> > >>>>>> >
> > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were
> idle. The pping stats start to show a bit of degradation in performance for
> pounding it so hard:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > For whatever reason, it smoothed out over time:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client
> received 2.22 Gbit/s performance, over 129 Gbytes of data.
> > >>>>>> >
> > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
> > >>>>>> >
> > >>>>>> > This test is also designed to break things. Same as test 3, but
> using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really
> tax the flow tracking. (Shorter time window because I really wanted to go
> and find coffee)
> > >>>>>> >
> > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping
> results show that this torture test is worsening performance, and there's
> always lots of samples in the buffer:
> > >>>>>> >
> > >>>>>> > [
> > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" :
> 49},
> > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" :
> 49},
> > >>>>>> > {}]
> > >>>>>> >
> > >>>>>> > This test also ran better than I expected. You can definitely
> see some latency creeping in as I make the system work hard. Each VM showed
> around 2.4 Gbit/s in total performance at the end of the iperf session.
> There's definitely some latency creeping in, which is expected - but I'm
> not sure I expected quite that much.
> > >>>>>> >
> > >>>>>> > WHAT'S NEXT & CONCLUSION
> > >>>>>> >
> > >>>>>> > I noticed that I forgot to turn off efficient power management
> on my VMs and host, and left Hyperthreading on by mistake. So that hurts
> overall performance.
> > >>>>>> >
> > >>>>>> > The base system seems to be working pretty solidly, at least
> for small tests.Next up, I'll be removing extraneous debug reporting code,
> removing some code paths that don't do anything but report, and looking for
> any small optimization opportunities. I'll then re-run these tests. Once
> that's done, I hope to find a maintenance window on my WISP and try it with
> actual traffic.
> > >>>>>> >
> > >>>>>> > I also need to re-run these tests without the pping system to
> provide some before/after analysis.
> > >>>>>> >
> > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
> herberticus@gmail.com> wrote:
> > >>>>>> >>
> > >>>>>> >> It's probably not entirely thread-safe right now (ran into
> some issues reading per_cpu maps back from userspace; hopefully, I'll get
> that figured out) - but the commits I just pushed have it basically working
> on single-stream testing. :-)
> > >>>>>> >>
> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This
> gives you per-connection RTT information in JSON:
> > >>>>>> >>
> > >>>>>> >> [
> > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> > >>>>>> >> {}]
> > >>>>>> >>
> > >>>>>> >> (With the extra {} because I'm not tracking the tail and
> haven't done comma removal). The tool also empties the various maps used to
> gather data, acting as a "reset" point. There's a max of 60 samples per
> queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
> > >>>>>> >>
> > >>>>>> >> I'll start trying to test on a larger scale now.
> > >>>>>> >>
> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
> robert.chacon@jackrabbitwireless.com> wrote:
> > >>>>>> >>>
> > >>>>>> >>> Hey Herbert,
> > >>>>>> >>>
> > >>>>>> >>> Fantastic work! Super exciting to see this coming together,
> especially so quickly.
> > >>>>>> >>> I'll test it soon.
> > >>>>>> >>> I understand and agree with your decision to omit certain
> features (ICMP tracking,DNS tracking, etc) to optimize performance for our
> use case. Like you said, in order to merge the functionality without a
> performance hit, merging them is sort of the only way right now. Otherwise
> there would be a lot of redundancy and lost throughput for an ISP's use.
> Though hopefully long term there will be a way to keep all projects working
> independently but interoperably with a plugin system of some kind.
> > >>>>>> >>>
> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3.
> Focusing on optimizations for high sub counts (8000+ subs) as well as
> stateful changes to the queue structure.
> > >>>>>> >>> I'm working to set up a physical lab to test high throughput
> and high client count scenarios.
> > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on
> device" from xdp-cpumap-tc, which I think relates to the bpf map size
> limitation you mentioned. Maybe in the coming months we can take a look at
> that.
> > >>>>>> >>>
> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see
> more on this.
> > >>>>>> >>>
> > >>>>>> >>> Thanks,
> > >>>>>> >>> Robert
> > >>>>>> >>>
> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via
> LibreQoS <libreqos@lists.bufferbloat.net> wrote:
> > >>>>>> >>>>
> > >>>>>> >>>> Hey,
> > >>>>>> >>>>
> > >>>>>> >>>> My current (unfinished) progress on this is now available
> here: https://github.com/thebracket/cpumap-pping-hackjob
> > >>>>>> >>>>
> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable,
> debugged - and can't promise that it won't unleash the nasal demons
> > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
> > >>>>>> >>>>
> > >>>>>> >>>> With that said, I'm pretty happy so far:
> > >>>>>> >>>>
> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has
> nicely shunted onto a dedicated CPU. It has to run on both
> > >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it
> would only see half the conversation.
> > >>>>>> >>>> * It does assume that your ingress and egress CPUs are
> mapped to the same interface; I do that anyway in BracketQoS. Not doing
> > >>>>>> >>>>   that opens up a potential world of pain, since writes to
> the shared maps would require a locking scheme. Too much locking, and you
> lose all of the benefit of using multiple CPUs to begin with.
> > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper
> systems I've worked with have lots of it.
> > >>>>>> >>>> * I've been gradually removing features that I don't want
> for BracketQoS. A hypothetical future "useful to everyone" version wouldn't
> do that.
> > >>>>>> >>>> * Rate limiting is working, but I removed the requirement
> for a shared configuration provided from userland - so right now it's
> always set to report at 1 second intervals per stream.
> > >>>>>> >>>>
> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client"
> and "world", and a "shaper" VM in between running a slightly hacked-up
> LibreQoS.
> > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow
> 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on
> my
> > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb
> RAM and fast SSDs)
> > >>>>>> >>>>
> > >>>>>> >>>> Output currently consists of debug messages reading:
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
> bpf_trace_printk: (tc) Flow open event
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
> bpf_trace_printk: (tc) Send performance event (5,1), 374696
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
> bpf_trace_printk: (tc) Flow open event
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
> bpf_trace_printk: (tc) Send performance event (5,1), 247069
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
> bpf_trace_printk: (tc) Send flow event
> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
> bpf_trace_printk: (tc) Send flow event
> > >>>>>> >>>>
> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
> major/minor, allocated by the xdp-cpumap parent.
> > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test
> with some real-world data very soon.
> > >>>>>> >>>>
> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
> > >>>>>> >>>>
> > >>>>>> >>>> Thanks,
> > >>>>>> >>>> Herbert
> > >>>>>> >>>>
> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
> Simon.Sundberg@kau.se> wrote:
> > >>>>>> >>>>>
> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple
> of quick
> > >>>>>> >>>>> notes.
> > >>>>>> >>>>>
> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen
> wrote:
> > >>>>>> >>>>> > [ Adding Simon to Cc ]
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > Herbert Wolverson via LibreQoS <
> libreqos@lists.bufferbloat.net> writes:
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > > Hey,
> > >>>>>> >>>>> > >
> > >>>>>> >>>>> > > I've had some pretty good success with merging
> xdp-pping (
> > >>>>>> >>>>> > >
> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> > >>>>>> >>>>> > > into xdp-cpumap-tc (
> https://github.com/xdp-project/xdp-cpumap-tc ).
> > >>>>>> >>>>> > >
> > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then
> changed the entry point
> > >>>>>> >>>>> > > and packet parsing code to make use of the work already
> done in
> > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the
> packet, no need to do
> > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps,
> and had to pin them -
> > >>>>>> >>>>> > > otherwise the two tc instances don't properly share
> data.
> > >>>>>> >>>>> > >
> > >>>>>> >>>>>
> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is
> processed on
> > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if
> a flow may
> > >>>>>> >>>>> be processed by different cores on ingress and egress the
> per-CPU maps
> > >>>>>> >>>>> will not really work reliably as each core will have a
> different view
> > >>>>>> >>>>> on the state of the flow, if there's been a previous packet
> with a
> > >>>>>> >>>>> certain TSval from that flow etc.
> > >>>>>> >>>>>
> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same core
> (on both
> > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit
> wasteful on
> > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are
> still
> > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its
> own value. So
> > >>>>>> >>>>> all CPUs will then have their own data for each flow, but
> it's only the
> > >>>>>> >>>>> CPU processing the flow that will have any relevant data
> for the flow
> > >>>>>> >>>>> while the remaining CPUs will just have an empty state for
> that flow.
> > >>>>>> >>>>> Under the same assumption that packets within the same flow
> are always
> > >>>>>> >>>>> processed on the same core there should generally not be any
> > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU)
> either as packets
> > >>>>>> >>>>> from the same flow cannot be processed concurrently then
> (and thus no
> > >>>>>> >>>>> concurrent access to the same value in the map). I am
> however still
> > >>>>>> >>>>> very unclear on if there's any considerable performance
> impact between
> > >>>>>> >>>>> global and per-CPU map versions if the same key is not
> accessed
> > >>>>>> >>>>> concurrently.
> > >>>>>> >>>>>
> > >>>>>> >>>>> > > Right now, output
> > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap
> output code. Instead,
> > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug
> pipe, so I can see
> > >>>>>> >>>>> > > roughly what the output would look like.
> > >>>>>> >>>>> > >
> > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting
> about 4.9 Gbits/sec on
> > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM
> in the middle). :-)
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest
> source of
> > >>>>>> >>>>> > overhead, then. What Simon found was that sending the
> data from kernel
> > >>>>>> >>>>> > to userspace is one of the most expensive bits of epping,
> at least when
> > >>>>>> >>>>> > the number of data points goes up (which is does as
> additional flows are
> > >>>>>> >>>>> > added).
> > >>>>>> >>>>>
> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them
> (you may get
> > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in
> terms of
> > >>>>>> >>>>> direct overhead from the tool itself, but also becomes
> demanding for
> > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to
> log, parse,
> > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal
> with that is
> > >>>>>> >>>>> of course to just apply some sort of sampling (the
> -r/--rate-limit and
> > >>>>>> >>>>> -R/--rtt-rate
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > > So my question: how would you prefer to receive this
> data? I'll have to
> > >>>>>> >>>>> > > write a daemon that provides userspace control
> (periodic cleanup as well as
> > >>>>>> >>>>> > > reading the performance stream), so the world's kinda
> our oyster. I can
> > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a
> named pipe, perhaps?),
> > >>>>>> >>>>> > > a condensed format that only shows what you want to
> use, an efficient
> > >>>>>> >>>>> > > binary format if you feel like parsing that...
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > It would be great if we could combine efforts a bit here
> so we don't
> > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if
> "upstream" epping and
> > >>>>>> >>>>> > whatever daemon you end up writing can agree on data
> format etc that
> > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this
> :)
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to have
> the ability to
> > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have
> a userspace
> > >>>>>> >>>>> > utility periodically pull them out. What we discussed was
> doing this
> > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea
> would be that
> > >>>>>> >>>>> > userspace would populate the LPM map with the keys
> (prefixes) they
> > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be
> one key per
> > >>>>>> >>>>> > customer, for instance). Epping would then do a map
> lookup into the LPM,
> > >>>>>> >>>>> > and if it gets a match it would update the statistics in
> that map entry
> > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically).
> Simon's PR
> > >>>>>> >>>>> > below uses this technique where userspace will "reset"
> the histogram
> > >>>>>> >>>>> > every time it loads it by swapping out two different map
> entries when it
> > >>>>>> >>>>> > does a read; this allows you to control the sampling rate
> from
> > >>>>>> >>>>> > userspace, and you'll just get the data since the last
> time you polled.
> > >>>>>> >>>>>
> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and the
> plan going
> > >>>>>> >>>>> forward. I will just note that this PR (and all my other
> work with
> > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be
> more or less
> > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to
> finish up a
> > >>>>>> >>>>> paper.
> > >>>>>> >>>>>
> > >>>>>> >>>>> > I was thinking that if we all can agree on the map
> format, then your
> > >>>>>> >>>>> > polling daemon could be one userspace "client" for that,
> and the epping
> > >>>>>> >>>>> > binary itself could be another; but we could keep
> compatibility between
> > >>>>>> >>>>> > the two, so we don't duplicate effort.
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it
> can be plugged
> > >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
> > >>>>>> >>>>>
> > >>>>>> >>>>> Should probably do that...at some point. In general I think
> it's a bit
> > >>>>>> >>>>> of an interesting problem to think about how to chain
> multiple XDP/tc
> > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc
> programs will do
> > >>>>>> >>>>> some amount of packet parsing and when you have many
> chained programs
> > >>>>>> >>>>> parsing the same packets this obviously becomes a bit
> wasteful. In the
> > >>>>>> >>>>> same time it would be nice if one didn't need to manually
> merge
> > >>>>>> >>>>> multiple programs together into a single one like this to
> get rid of
> > >>>>>> >>>>> this duplicated parsing, or at least make that process of
> merging those
> > >>>>>> >>>>> programs as simple as possible.
> > >>>>>> >>>>>
> > >>>>>> >>>>>
> > >>>>>> >>>>> > -Toke
> > >>>>>> >>>>> >
> > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
> > >>>>>> >>>>>
> > >>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar
> vi dina personuppgifter<https://www.kau.se/gdpr>.
> > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will
> process your personal data<https://www.kau.se/en/gdpr>.
> > >>>>>> >>>>
> > >>>>>> >>>> _______________________________________________
> > >>>>>> >>>> LibreQoS mailing list
> > >>>>>> >>>> LibreQoS@lists.bufferbloat.net
> > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> > >>>>>> >>>
> > >>>>>> >>>
> > >>>>>> >>>
> > >>>>>> >>> --
> > >>>>>> >>> Robert Chacón
> > >>>>>> >>> CEO | JackRabbit Wireless LLC
> > >>>>>> >
> > >>>>>> > _______________________________________________
> > >>>>>> > LibreQoS mailing list
> > >>>>>> > LibreQoS@lists.bufferbloat.net
> > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> This song goes out to all the folk that thought Stadia would work:
> > >>>>>>
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> > >>>>>> Dave Täht CEO, TekLibre, LLC
> > >>>>
> > >>>> _______________________________________________
> > >>>> LibreQoS mailing list
> > >>>> LibreQoS@lists.bufferbloat.net
> > >>>> https://lists.bufferbloat.net/listinfo/libreqos
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Robert Chacón
> > >>> CEO | JackRabbit Wireless LLC
> > >>> _______________________________________________
> > >>> LibreQoS mailing list
> > >>> LibreQoS@lists.bufferbloat.net
> > >>> https://lists.bufferbloat.net/listinfo/libreqos
> > >
> > > _______________________________________________
> > > LibreQoS mailing list
> > > LibreQoS@lists.bufferbloat.net
> > > https://lists.bufferbloat.net/listinfo/libreqos
> >
> >
> >
> > --
> > This song goes out to all the folk that thought Stadia would work:
> >
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> > Dave Täht CEO, TekLibre, LLC
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
>
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> Dave Täht CEO, TekLibre, LLC
>

[-- Attachment #2: Type: text/html, Size: 47439 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-22 14:32                             ` Herbert Wolverson
@ 2022-10-22 14:44                               ` Dave Taht
  2022-10-22 14:47                               ` Robert Chacón
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Taht @ 2022-10-22 14:44 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

On Sat, Oct 22, 2022 at 7:32 AM Herbert Wolverson via LibreQoS
<libreqos@lists.bufferbloat.net> wrote:
>
> This morning I tested cpu-pping with live customers!
> A little over 1,200 mapped IP addresses, about 600 mbps of real traffic flowing through a big
> hierarchy of 52 sites. (600 is our "quiet time" traffic)
>
> It started very well: the updated xdp-cpumap system dropped in place and the system worked as
> before. xdp_pping started to show data with correct mappings. CPU load from the mapping
> system is within 1% of where it was before.
>
> After about 20 minutes of continuous execution, it started to run into some scaling issues.
> The shaping system continued to run wonderfully, and CPU load was still fine. However,
> it stopped reporting latency data! A bit of debugging showed that once you exceed
> 16,384 in-flight TCP streams it isn't handling the "map full" situation gracefully - and
> clearing the map from userspace isn't working correctly. So I hacked away and hacked
> away.
>
> Anyway, it turns out that it does in fact work fine at that scale. There's just a one-line
> bug in the xdp_pping.c file. I forgot to actually *call* one line of packet cleanup code.
> Adding that, and everything was awesome.
>
> The entire patch that fixed it consists of adding one line:
> cleanup_packet_ts(packet_ts);
>
> Oops.

:woot:

>
> Anyway, with that in place it's running superbly. I did identify a couple of places in
> which it's being overly verbose with debug information, so I've patched that also.

> After reducing the overly eager warning about not being able to read a TCP header,
> CPU performance improved by another 2% on average.

I note I am VERY interested, science wize, about the actual incidence
of malformed TCP headers,
mistaken or incorrect ecn-related congestion responses, etc.

This is stuff endpoints just arbitrarily drop on the floor. Always
interested in the anomalies.

https://www.youtube.com/watch?v=K1jasTyGLr8

>
> Longer-term (i.e. not on a Saturday morning, when I'd rather be playing with my
> daughter!), I think I'll look at raising some of the buffer sizes.
>
> Thanks,
> Herbert
>
> On Wed, Oct 19, 2022 at 11:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>>
>> PS - today's (free) p99 conference is *REALLY AWESOME*. https://www.p99conf.io/
>>
>> On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>> >
>> > flent outputs a flent.gz file that I can parse and plot 20 differnt
>> > ways. Also the graphing tools work on osx
>> >
>> > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
>> > <libreqos@lists.bufferbloat.net> wrote:
>> > >
>> > > That's true. The 12th gen does seem to have some "special" features... makes for a nice writing platform
>> > > (this box is primarily my "write books and articles" machine). I'll be doing a wider test on a more normal
>> > > platform, probably at the weekend (with real traffic, hence the delay - have to find a time in which I
>> > > minimize disruption)
>> > >
>> > > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>> > >>
>> > >> Those 'efficiency' threads in Intel 12th gen should probably be addressed as well.  You can't turn them off in BIOS.
>> > >>
>> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>
>> > >>> Awesome work on this!
>> > >>> I suspect there should be a slight performance bump once Hyperthreading is disabled and efficient power management is off.
>> > >>> Hyperthreading/SMT always messes with HTB performance when I leave it on. Thank you for mentioning that - I now went ahead and added instructions on disabling hyperthreading on the Wiki for new users.
>> > >>> Super promising results!
>> > >>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping. So far in your VM setup it seems to be doing very well.
>> > >>>
>> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>
>> > >>>> Also, I forgot to mention that I *think* the current version has removed the requirement that the inbound
>> > >>>> and outbound classifiers be placed on the same CPU. I know interduo was particularly keen on packing
>> > >>>> upload into fewer cores. I'll add that to my list of things to test.
>> > >>>>
>> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com> wrote:
>> > >>>>>
>> > >>>>> I'll definitely take a look - that does look interesting. I don't have X11 on any of my test VMs, but
>> > >>>>> it looks like it can work without the GUI.
>> > >>>>>
>> > >>>>> Thanks!
>> > >>>>>
>> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>> > >>>>>>
>> > >>>>>> could I coax you to adopt flent?
>> > >>>>>>
>> > >>>>>> apt-get install flent netperf irtt fping
>> > >>>>>>
>> > >>>>>> You sometimes have to compile netperf yourself with --enable-demo on
>> > >>>>>> some systems.
>> > >>>>>> There are a bunch of python libs neede for the gui, but only on the client.
>> > >>>>>>
>> > >>>>>> Then you can run a really gnarly test series and plot the results over time.
>> > >>>>>>
>> > >>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>> > >>>>>> the_server_name rrul # 110 other tests
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> > >>>>>> <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >
>> > >>>>>> > Hey,
>> > >>>>>> >
>> > >>>>>> > Testing the current version ( https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better than I hoped. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset stats.
>> > >>>>>> >
>> > >>>>>> > My testing environment has grown a bit:
>> > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-pping-hackjob version of xdp-cpumap.
>> > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf server.
>> > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts iperf client.
>> > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts iperf client.
>> > >>>>>> >
>> > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on a virtual switch.
>> > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a different virtual switch.
>> > >>>>>> >
>> > >>>>>> > These are all on a host machine running Windows 11, a core i7 12th gen, 32 Gb RAM and fast SSD setup.
>> > >>>>>> >
>> > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gbit/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> > * Set to use Cake
>> > >>>>>> >
>> > >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct results:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Or when I waited a while to gather/reset:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows no errors, just periodic logging that it is recording data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> > >>>>>> >
>> > >>>>>> > So for smaller streams, I'd call this a success.
>> > >>>>>> >
>> > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gb/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> >
>> > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>> > >>>>>> >
>> > >>>>>> > xdp_pping shows results, too:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>> > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>> > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>> > >>>>>> >
>> > >>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is to be expected), but it's not bad. I also forgot to disable hyperthreading, and looking at the host performance it is sometimes running the second virtual CPU on an underpowered "fake" CPU.
>> > >>>>>> >
>> > >>>>>> > So for two large streams, I think we're doing pretty well also!
>> > >>>>>> >
>> > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> > >>>>>> >
>> > >>>>>> > This test is designed to try and blow things up. It's the same as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The pping stats start to show a bit of degradation in performance for pounding it so hard:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > For whatever reason, it smoothed out over time:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client received 2.22 Gbit/s performance, over 129 Gbytes of data.
>> > >>>>>> >
>> > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> > >>>>>> >
>> > >>>>>> > This test is also designed to break things. Same as test 3, but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the flow tracking. (Shorter time window because I really wanted to go and find coffee)
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results show that this torture test is worsening performance, and there's always lots of samples in the buffer:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>> > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > This test also ran better than I expected. You can definitely see some latency creeping in as I make the system work hard. Each VM showed around 2.4 Gbit/s in total performance at the end of the iperf session. There's definitely some latency creeping in, which is expected - but I'm not sure I expected quite that much.
>> > >>>>>> >
>> > >>>>>> > WHAT'S NEXT & CONCLUSION
>> > >>>>>> >
>> > >>>>>> > I noticed that I forgot to turn off efficient power management on my VMs and host, and left Hyperthreading on by mistake. So that hurts overall performance.
>> > >>>>>> >
>> > >>>>>> > The base system seems to be working pretty solidly, at least for small tests.Next up, I'll be removing extraneous debug reporting code, removing some code paths that don't do anything but report, and looking for any small optimization opportunities. I'll then re-run these tests. Once that's done, I hope to find a maintenance window on my WISP and try it with actual traffic.
>> > >>>>>> >
>> > >>>>>> > I also need to re-run these tests without the pping system to provide some before/after analysis.
>> > >>>>>> >
>> > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com> wrote:
>> > >>>>>> >>
>> > >>>>>> >> It's probably not entirely thread-safe right now (ran into some issues reading per_cpu maps back from userspace; hopefully, I'll get that figured out) - but the commits I just pushed have it basically working on single-stream testing. :-)
>> > >>>>>> >>
>> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you per-connection RTT information in JSON:
>> > >>>>>> >>
>> > >>>>>> >> [
>> > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>> > >>>>>> >> {}]
>> > >>>>>> >>
>> > >>>>>> >> (With the extra {} because I'm not tracking the tail and haven't done comma removal). The tool also empties the various maps used to gather data, acting as a "reset" point. There's a max of 60 samples per queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
>> > >>>>>> >>
>> > >>>>>> >> I'll start trying to test on a larger scale now.
>> > >>>>>> >>
>> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <robert.chacon@jackrabbitwireless.com> wrote:
>> > >>>>>> >>>
>> > >>>>>> >>> Hey Herbert,
>> > >>>>>> >>>
>> > >>>>>> >>> Fantastic work! Super exciting to see this coming together, especially so quickly.
>> > >>>>>> >>> I'll test it soon.
>> > >>>>>> >>> I understand and agree with your decision to omit certain features (ICMP tracking,DNS tracking, etc) to optimize performance for our use case. Like you said, in order to merge the functionality without a performance hit, merging them is sort of the only way right now. Otherwise there would be a lot of redundancy and lost throughput for an ISP's use. Though hopefully long term there will be a way to keep all projects working independently but interoperably with a plugin system of some kind.
>> > >>>>>> >>>
>> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on optimizations for high sub counts (8000+ subs) as well as stateful changes to the queue structure.
>> > >>>>>> >>> I'm working to set up a physical lab to test high throughput and high client count scenarios.
>> > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on device" from xdp-cpumap-tc, which I think relates to the bpf map size limitation you mentioned. Maybe in the coming months we can take a look at that.
>> > >>>>>> >>>
>> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more on this.
>> > >>>>>> >>>
>> > >>>>>> >>> Thanks,
>> > >>>>>> >>> Robert
>> > >>>>>> >>>
>> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >>>>
>> > >>>>>> >>>> Hey,
>> > >>>>>> >>>>
>> > >>>>>> >>>> My current (unfinished) progress on this is now available here: https://github.com/thebracket/cpumap-pping-hackjob
>> > >>>>>> >>>>
>> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged - and can't promise that it won't unleash the nasal demons
>> > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>> > >>>>>> >>>>
>> > >>>>>> >>>> With that said, I'm pretty happy so far:
>> > >>>>>> >>>>
>> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on both
>> > >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it would only see half the conversation.
>> > >>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not doing
>> > >>>>>> >>>>   that opens up a potential world of pain, since writes to the shared maps would require a locking scheme. Too much locking, and you lose all of the benefit of using multiple CPUs to begin with.
>> > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've worked with have lots of it.
>> > >>>>>> >>>> * I've been gradually removing features that I don't want for BracketQoS. A hypothetical future "useful to everyone" version wouldn't do that.
>> > >>>>>> >>>> * Rate limiting is working, but I removed the requirement for a shared configuration provided from userland - so right now it's always set to report at 1 second intervals per stream.
>> > >>>>>> >>>>
>> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>> > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>> > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)
>> > >>>>>> >>>>
>> > >>>>>> >>>> Output currently consists of debug messages reading:
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc) Send performance event (5,1), 374696
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc) Send performance event (5,1), 247069
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>
>> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap parent.
>> > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with some real-world data very soon.
>> > >>>>>> >>>>
>> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> > >>>>>> >>>>
>> > >>>>>> >>>> Thanks,
>> > >>>>>> >>>> Herbert
>> > >>>>>> >>>>
>> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se> wrote:
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
>> > >>>>>> >>>>> notes.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>> > >>>>>> >>>>> > [ Adding Simon to Cc ]
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > Hey,
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>> > >>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>> > >>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed the entry point
>> > >>>>>> >>>>> > > and packet parsing code to make use of the work already done in
>> > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need to do
>> > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin them -
>> > >>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>> > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>> > >>>>>> >>>>> be processed by different cores on ingress and egress the per-CPU maps
>> > >>>>>> >>>>> will not really work reliably as each core will have a different view
>> > >>>>>> >>>>> on the state of the flow, if there's been a previous packet with a
>> > >>>>>> >>>>> certain TSval from that flow etc.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same core (on both
>> > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>> > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>> > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own value. So
>> > >>>>>> >>>>> all CPUs will then have their own data for each flow, but it's only the
>> > >>>>>> >>>>> CPU processing the flow that will have any relevant data for the flow
>> > >>>>>> >>>>> while the remaining CPUs will just have an empty state for that flow.
>> > >>>>>> >>>>> Under the same assumption that packets within the same flow are always
>> > >>>>>> >>>>> processed on the same core there should generally not be any
>> > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as packets
>> > >>>>>> >>>>> from the same flow cannot be processed concurrently then (and thus no
>> > >>>>>> >>>>> concurrent access to the same value in the map). I am however still
>> > >>>>>> >>>>> very unclear on if there's any considerable performance impact between
>> > >>>>>> >>>>> global and per-CPU map versions if the same key is not accessed
>> > >>>>>> >>>>> concurrently.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > > Right now, output
>> > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output code. Instead,
>> > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can see
>> > >>>>>> >>>>> > > roughly what the output would look like.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9 Gbits/sec on
>> > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the middle). :-)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>> > >>>>>> >>>>> > overhead, then. What Simon found was that sending the data from kernel
>> > >>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at least when
>> > >>>>>> >>>>> > the number of data points goes up (which is does as additional flows are
>> > >>>>>> >>>>> > added).
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may get
>> > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>> > >>>>>> >>>>> direct overhead from the tool itself, but also becomes demanding for
>> > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
>> > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with that is
>> > >>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit and
>> > >>>>>> >>>>> -R/--rtt-rate
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > So my question: how would you prefer to receive this data? I'll have to
>> > >>>>>> >>>>> > > write a daemon that provides userspace control (periodic cleanup as well as
>> > >>>>>> >>>>> > > reading the performance stream), so the world's kinda our oyster. I can
>> > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named pipe, perhaps?),
>> > >>>>>> >>>>> > > a condensed format that only shows what you want to use, an efficient
>> > >>>>>> >>>>> > > binary format if you feel like parsing that...
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > It would be great if we could combine efforts a bit here so we don't
>> > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>> > >>>>>> >>>>> > whatever daemon you end up writing can agree on data format etc that
>> > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to have the ability to
>> > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>> > >>>>>> >>>>> > utility periodically pull them out. What we discussed was doing this
>> > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would be that
>> > >>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
>> > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one key per
>> > >>>>>> >>>>> > customer, for instance). Epping would then do a map lookup into the LPM,
>> > >>>>>> >>>>> > and if it gets a match it would update the statistics in that map entry
>> > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
>> > >>>>>> >>>>> > below uses this technique where userspace will "reset" the histogram
>> > >>>>>> >>>>> > every time it loads it by swapping out two different map entries when it
>> > >>>>>> >>>>> > does a read; this allows you to control the sampling rate from
>> > >>>>>> >>>>> > userspace, and you'll just get the data since the last time you polled.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and the plan going
>> > >>>>>> >>>>> forward. I will just note that this PR (and all my other work with
>> > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>> > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
>> > >>>>>> >>>>> paper.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > I was thinking that if we all can agree on the map format, then your
>> > >>>>>> >>>>> > polling daemon could be one userspace "client" for that, and the epping
>> > >>>>>> >>>>> > binary itself could be another; but we could keep compatibility between
>> > >>>>>> >>>>> > the two, so we don't duplicate effort.
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be plugged
>> > >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Should probably do that...at some point. In general I think it's a bit
>> > >>>>>> >>>>> of an interesting problem to think about how to chain multiple XDP/tc
>> > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs will do
>> > >>>>>> >>>>> some amount of packet parsing and when you have many chained programs
>> > >>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In the
>> > >>>>>> >>>>> same time it would be nice if one didn't need to manually merge
>> > >>>>>> >>>>> multiple programs together into a single one like this to get rid of
>> > >>>>>> >>>>> this duplicated parsing, or at least make that process of merging those
>> > >>>>>> >>>>> programs as simple as possible.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > -Toke
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
>> > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will process your personal data<https://www.kau.se/en/gdpr>.
>> > >>>>>> >>>>
>> > >>>>>> >>>> _______________________________________________
>> > >>>>>> >>>> LibreQoS mailing list
>> > >>>>>> >>>> LibreQoS@lists.bufferbloat.net
>> > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>> --
>> > >>>>>> >>> Robert Chacón
>> > >>>>>> >>> CEO | JackRabbit Wireless LLC
>> > >>>>>> >
>> > >>>>>> > _______________________________________________
>> > >>>>>> > LibreQoS mailing list
>> > >>>>>> > LibreQoS@lists.bufferbloat.net
>> > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> This song goes out to all the folk that thought Stadia would work:
>> > >>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> > >>>>>> Dave Täht CEO, TekLibre, LLC
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> LibreQoS mailing list
>> > >>>> LibreQoS@lists.bufferbloat.net
>> > >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Robert Chacón
>> > >>> CEO | JackRabbit Wireless LLC
>> > >>> _______________________________________________
>> > >>> LibreQoS mailing list
>> > >>> LibreQoS@lists.bufferbloat.net
>> > >>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >
>> > > _______________________________________________
>> > > LibreQoS mailing list
>> > > LibreQoS@lists.bufferbloat.net
>> > > https://lists.bufferbloat.net/listinfo/libreqos
>> >
>> >
>> >
>> > --
>> > This song goes out to all the folk that thought Stadia would work:
>> > https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> > Dave Täht CEO, TekLibre, LLC
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> Dave Täht CEO, TekLibre, LLC
>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [LibreQoS] In BPF pping - so far
  2022-10-22 14:32                             ` Herbert Wolverson
  2022-10-22 14:44                               ` Dave Taht
@ 2022-10-22 14:47                               ` Robert Chacón
  1 sibling, 0 replies; 20+ messages in thread
From: Robert Chacón @ 2022-10-22 14:47 UTC (permalink / raw)
  To: Herbert Wolverson; +Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 31852 bytes --]

Awesome work! It's really amazing how little additional CPU the TCP
tracking adds. Super excited to start testing in production myself soon.
Have a great restful morning with your daughter. 😌

On Sat, Oct 22, 2022, 8:32 AM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> This morning I tested cpu-pping with live customers!
> A little over 1,200 mapped IP addresses, about 600 mbps of real traffic
> flowing through a big
> hierarchy of 52 sites. (600 is our "quiet time" traffic)
>
> It started very well: the updated xdp-cpumap system dropped in place and
> the system worked as
> before. xdp_pping started to show data with correct mappings. CPU load
> from the mapping
> system is within 1% of where it was before.
>
> After about 20 minutes of continuous execution, it started to run into
> some scaling issues.
> The shaping system continued to run wonderfully, and CPU load was still
> fine. However,
> it stopped reporting latency data! A bit of debugging showed that once you
> exceed
> 16,384 in-flight TCP streams it isn't handling the "map full" situation
> gracefully - and
> clearing the map from userspace isn't working correctly. So I hacked away
> and hacked
> away.
>
> Anyway, it turns out that it does in fact work fine at that scale. There's
> just a one-line
> bug in the xdp_pping.c file. I forgot to actually *call* one line of
> packet cleanup code.
> Adding that, and everything was awesome.
>
> The entire patch that fixed it consists of adding one line:
> cleanup_packet_ts(packet_ts);
>
> Oops.
>
> Anyway, with that in place it's running superbly. I did identify a couple
> of places in
> which it's being overly verbose with debug information, so I've patched
> that also.
>
> After reducing the overly eager warning about not being able to read a TCP
> header,
> CPU performance improved by another 2% on average.
>
> Longer-term (i.e. not on a Saturday morning, when I'd rather be playing
> with my
> daughter!), I think I'll look at raising some of the buffer sizes.
>
> Thanks,
> Herbert
>
> On Wed, Oct 19, 2022 at 11:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>
>> PS - today's (free) p99 conference is *REALLY AWESOME*.
>> https://www.p99conf.io/
>>
>> On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>> >
>> > flent outputs a flent.gz file that I can parse and plot 20 differnt
>> > ways. Also the graphing tools work on osx
>> >
>> > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
>> > <libreqos@lists.bufferbloat.net> wrote:
>> > >
>> > > That's true. The 12th gen does seem to have some "special"
>> features... makes for a nice writing platform
>> > > (this box is primarily my "write books and articles" machine). I'll
>> be doing a wider test on a more normal
>> > > platform, probably at the weekend (with real traffic, hence the delay
>> - have to find a time in which I
>> > > minimize disruption)
>> > >
>> > > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>> > >>
>> > >> Those 'efficiency' threads in Intel 12th gen should probably be
>> addressed as well.  You can't turn them off in BIOS.
>> > >>
>> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> > >>>
>> > >>> Awesome work on this!
>> > >>> I suspect there should be a slight performance bump once
>> Hyperthreading is disabled and efficient power management is off.
>> > >>> Hyperthreading/SMT always messes with HTB performance when I leave
>> it on. Thank you for mentioning that - I now went ahead and added
>> instructions on disabling hyperthreading on the Wiki for new users.
>> > >>> Super promising results!
>> > >>> Interested to see what throughput is with xdp-cpumap-tc vs
>> cpumap-pping. So far in your VM setup it seems to be doing very well.
>> > >>>
>> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> > >>>>
>> > >>>> Also, I forgot to mention that I *think* the current version has
>> removed the requirement that the inbound
>> > >>>> and outbound classifiers be placed on the same CPU. I know
>> interduo was particularly keen on packing
>> > >>>> upload into fewer cores. I'll add that to my list of things to
>> test.
>> > >>>>
>> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> > >>>>>
>> > >>>>> I'll definitely take a look - that does look interesting. I don't
>> have X11 on any of my test VMs, but
>> > >>>>> it looks like it can work without the GUI.
>> > >>>>>
>> > >>>>> Thanks!
>> > >>>>>
>> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com>
>> wrote:
>> > >>>>>>
>> > >>>>>> could I coax you to adopt flent?
>> > >>>>>>
>> > >>>>>> apt-get install flent netperf irtt fping
>> > >>>>>>
>> > >>>>>> You sometimes have to compile netperf yourself with
>> --enable-demo on
>> > >>>>>> some systems.
>> > >>>>>> There are a bunch of python libs neede for the gui, but only on
>> the client.
>> > >>>>>>
>> > >>>>>> Then you can run a really gnarly test series and plot the
>> results over time.
>> > >>>>>>
>> > >>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>> > >>>>>> the_server_name rrul # 110 other tests
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> > >>>>>> <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >
>> > >>>>>> > Hey,
>> > >>>>>> >
>> > >>>>>> > Testing the current version (
>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
>> than I hoped. This build has shared (not per-cpu) maps, and a userspace
>> daemon (xdp_pping) to extract and reset stats.
>> > >>>>>> >
>> > >>>>>> > My testing environment has grown a bit:
>> > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>> cpumap-pping-hackjob version of xdp-cpumap.
>> > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an
>> iperf server.
>> > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as
>> 10.64.1.2. Hosts iperf client.
>> > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as
>> 10.64.1.3. Hosts iperf client.
>> > >>>>>> >
>> > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of
>> ShaperVM are on a virtual switch.
>> > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are
>> on a different virtual switch.
>> > >>>>>> >
>> > >>>>>> > These are all on a host machine running Windows 11, a core i7
>> 12th gen, 32 Gb RAM and fast SSD setup.
>> > >>>>>> >
>> > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gbit/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
>> about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> > * Set to use Cake
>> > >>>>>> >
>> > >>>>>> > On each client, roughly simultaneously run: iperf -c
>> 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct
>> results:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Or when I waited a while to gather/reset:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows no errors, just periodic logging that it is
>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>> expected).
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported
>> a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> > >>>>>> >
>> > >>>>>> > So for smaller streams, I'd call this a success.
>> > >>>>>> >
>> > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gb/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
>> 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> >
>> > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same
>> time.
>> > >>>>>> >
>> > >>>>>> > xdp_pping shows results, too:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>> > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>> > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported
>> a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226
>> GBytes.
>> > >>>>>> >
>> > >>>>>> > Maxing out HyperV like this is inducing a bit of latency
>> (which is to be expected), but it's not bad. I also forgot to disable
>> hyperthreading, and looking at the host performance it is sometimes running
>> the second virtual CPU on an underpowered "fake" CPU.
>> > >>>>>> >
>> > >>>>>> > So for two large streams, I think we're doing pretty well also!
>> > >>>>>> >
>> > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> > >>>>>> >
>> > >>>>>> > This test is designed to try and blow things up. It's the same
>> as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5
>> and 1:6.
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were
>> idle. The pping stats start to show a bit of degradation in performance for
>> pounding it so hard:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" :
>> 24},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" :
>> 24},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > For whatever reason, it smoothed out over time:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" :
>> 50},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" :
>> 50},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client
>> received 2.22 Gbit/s performance, over 129 Gbytes of data.
>> > >>>>>> >
>> > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> > >>>>>> >
>> > >>>>>> > This test is also designed to break things. Same as test 3,
>> but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and
>> really tax the flow tracking. (Shorter time window because I really wanted
>> to go and find coffee)
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping
>> results show that this torture test is worsening performance, and there's
>> always lots of samples in the buffer:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" :
>> 49},
>> > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" :
>> 49},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > This test also ran better than I expected. You can definitely
>> see some latency creeping in as I make the system work hard. Each VM showed
>> around 2.4 Gbit/s in total performance at the end of the iperf session.
>> There's definitely some latency creeping in, which is expected - but I'm
>> not sure I expected quite that much.
>> > >>>>>> >
>> > >>>>>> > WHAT'S NEXT & CONCLUSION
>> > >>>>>> >
>> > >>>>>> > I noticed that I forgot to turn off efficient power management
>> on my VMs and host, and left Hyperthreading on by mistake. So that hurts
>> overall performance.
>> > >>>>>> >
>> > >>>>>> > The base system seems to be working pretty solidly, at least
>> for small tests.Next up, I'll be removing extraneous debug reporting code,
>> removing some code paths that don't do anything but report, and looking for
>> any small optimization opportunities. I'll then re-run these tests. Once
>> that's done, I hope to find a maintenance window on my WISP and try it with
>> actual traffic.
>> > >>>>>> >
>> > >>>>>> > I also need to re-run these tests without the pping system to
>> provide some before/after analysis.
>> > >>>>>> >
>> > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> > >>>>>> >>
>> > >>>>>> >> It's probably not entirely thread-safe right now (ran into
>> some issues reading per_cpu maps back from userspace; hopefully, I'll get
>> that figured out) - but the commits I just pushed have it basically working
>> on single-stream testing. :-)
>> > >>>>>> >>
>> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This
>> gives you per-connection RTT information in JSON:
>> > >>>>>> >>
>> > >>>>>> >> [
>> > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>> > >>>>>> >> {}]
>> > >>>>>> >>
>> > >>>>>> >> (With the extra {} because I'm not tracking the tail and
>> haven't done comma removal). The tool also empties the various maps used to
>> gather data, acting as a "reset" point. There's a max of 60 samples per
>> queue, in a ringbuffer setup (so newest will start to overwrite the oldest).
>> > >>>>>> >>
>> > >>>>>> >> I'll start trying to test on a larger scale now.
>> > >>>>>> >>
>> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>> robert.chacon@jackrabbitwireless.com> wrote:
>> > >>>>>> >>>
>> > >>>>>> >>> Hey Herbert,
>> > >>>>>> >>>
>> > >>>>>> >>> Fantastic work! Super exciting to see this coming together,
>> especially so quickly.
>> > >>>>>> >>> I'll test it soon.
>> > >>>>>> >>> I understand and agree with your decision to omit certain
>> features (ICMP tracking,DNS tracking, etc) to optimize performance for our
>> use case. Like you said, in order to merge the functionality without a
>> performance hit, merging them is sort of the only way right now. Otherwise
>> there would be a lot of redundancy and lost throughput for an ISP's use.
>> Though hopefully long term there will be a way to keep all projects working
>> independently but interoperably with a plugin system of some kind.
>> > >>>>>> >>>
>> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3.
>> Focusing on optimizations for high sub counts (8000+ subs) as well as
>> stateful changes to the queue structure.
>> > >>>>>> >>> I'm working to set up a physical lab to test high throughput
>> and high client count scenarios.
>> > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on
>> device" from xdp-cpumap-tc, which I think relates to the bpf map size
>> limitation you mentioned. Maybe in the coming months we can take a look at
>> that.
>> > >>>>>> >>>
>> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to
>> see more on this.
>> > >>>>>> >>>
>> > >>>>>> >>> Thanks,
>> > >>>>>> >>> Robert
>> > >>>>>> >>>
>> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via
>> LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >>>>
>> > >>>>>> >>>> Hey,
>> > >>>>>> >>>>
>> > >>>>>> >>>> My current (unfinished) progress on this is now available
>> here: https://github.com/thebracket/cpumap-pping-hackjob
>> > >>>>>> >>>>
>> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable,
>> debugged - and can't promise that it won't unleash the nasal demons
>> > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>> > >>>>>> >>>>
>> > >>>>>> >>>> With that said, I'm pretty happy so far:
>> > >>>>>> >>>>
>> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has
>> nicely shunted onto a dedicated CPU. It has to run on both
>> > >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it
>> would only see half the conversation.
>> > >>>>>> >>>> * It does assume that your ingress and egress CPUs are
>> mapped to the same interface; I do that anyway in BracketQoS. Not doing
>> > >>>>>> >>>>   that opens up a potential world of pain, since writes to
>> the shared maps would require a locking scheme. Too much locking, and you
>> lose all of the benefit of using multiple CPUs to begin with.
>> > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper
>> systems I've worked with have lots of it.
>> > >>>>>> >>>> * I've been gradually removing features that I don't want
>> for BracketQoS. A hypothetical future "useful to everyone" version wouldn't
>> do that.
>> > >>>>>> >>>> * Rate limiting is working, but I removed the requirement
>> for a shared configuration provided from userland - so right now it's
>> always set to report at 1 second intervals per stream.
>> > >>>>>> >>>>
>> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client"
>> and "world", and a "shaper" VM in between running a slightly hacked-up
>> LibreQoS.
>> > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow
>> 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on
>> my
>> > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb
>> RAM and fast SSDs)
>> > >>>>>> >>>>
>> > >>>>>> >>>> Output currently consists of debug messages reading:
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>> bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>> bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>> bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>> bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>
>> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>> major/minor, allocated by the xdp-cpumap parent.
>> > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test
>> with some real-world data very soon.
>> > >>>>>> >>>>
>> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> > >>>>>> >>>>
>> > >>>>>> >>>> Thanks,
>> > >>>>>> >>>> Herbert
>> > >>>>>> >>>>
>> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>> Simon.Sundberg@kau.se> wrote:
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a
>> couple of quick
>> > >>>>>> >>>>> notes.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen
>> wrote:
>> > >>>>>> >>>>> > [ Adding Simon to Cc ]
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> writes:
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > Hey,
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I've had some pretty good success with merging
>> xdp-pping (
>> > >>>>>> >>>>> > >
>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>> > >>>>>> >>>>> > > into xdp-cpumap-tc (
>> https://github.com/xdp-project/xdp-cpumap-tc ).
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then
>> changed the entry point
>> > >>>>>> >>>>> > > and packet parsing code to make use of the work
>> already done in
>> > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the
>> packet, no need to do
>> > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps,
>> and had to pin them -
>> > >>>>>> >>>>> > > otherwise the two tc instances don't properly share
>> data.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is
>> processed on
>> > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if
>> a flow may
>> > >>>>>> >>>>> be processed by different cores on ingress and egress the
>> per-CPU maps
>> > >>>>>> >>>>> will not really work reliably as each core will have a
>> different view
>> > >>>>>> >>>>> on the state of the flow, if there's been a previous
>> packet with a
>> > >>>>>> >>>>> certain TSval from that flow etc.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same
>> core (on both
>> > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit
>> wasteful on
>> > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps
>> are still
>> > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its
>> own value. So
>> > >>>>>> >>>>> all CPUs will then have their own data for each flow, but
>> it's only the
>> > >>>>>> >>>>> CPU processing the flow that will have any relevant data
>> for the flow
>> > >>>>>> >>>>> while the remaining CPUs will just have an empty state for
>> that flow.
>> > >>>>>> >>>>> Under the same assumption that packets within the same
>> flow are always
>> > >>>>>> >>>>> processed on the same core there should generally not be
>> any
>> > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU)
>> either as packets
>> > >>>>>> >>>>> from the same flow cannot be processed concurrently then
>> (and thus no
>> > >>>>>> >>>>> concurrent access to the same value in the map). I am
>> however still
>> > >>>>>> >>>>> very unclear on if there's any considerable performance
>> impact between
>> > >>>>>> >>>>> global and per-CPU map versions if the same key is not
>> accessed
>> > >>>>>> >>>>> concurrently.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > > Right now, output
>> > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap
>> output code. Instead,
>> > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug
>> pipe, so I can see
>> > >>>>>> >>>>> > > roughly what the output would look like.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting
>> about 4.9 Gbits/sec on
>> > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM
>> in the middle). :-)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest
>> source of
>> > >>>>>> >>>>> > overhead, then. What Simon found was that sending the
>> data from kernel
>> > >>>>>> >>>>> > to userspace is one of the most expensive bits of
>> epping, at least when
>> > >>>>>> >>>>> > the number of data points goes up (which is does as
>> additional flows are
>> > >>>>>> >>>>> > added).
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them
>> (you may get
>> > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic
>> in terms of
>> > >>>>>> >>>>> direct overhead from the tool itself, but also becomes
>> demanding for
>> > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to
>> log, parse,
>> > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal
>> with that is
>> > >>>>>> >>>>> of course to just apply some sort of sampling (the
>> -r/--rate-limit and
>> > >>>>>> >>>>> -R/--rtt-rate
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > So my question: how would you prefer to receive this
>> data? I'll have to
>> > >>>>>> >>>>> > > write a daemon that provides userspace control
>> (periodic cleanup as well as
>> > >>>>>> >>>>> > > reading the performance stream), so the world's kinda
>> our oyster. I can
>> > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a
>> named pipe, perhaps?),
>> > >>>>>> >>>>> > > a condensed format that only shows what you want to
>> use, an efficient
>> > >>>>>> >>>>> > > binary format if you feel like parsing that...
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > It would be great if we could combine efforts a bit here
>> so we don't
>> > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if
>> "upstream" epping and
>> > >>>>>> >>>>> > whatever daemon you end up writing can agree on data
>> format etc that
>> > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this
>> :)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to
>> have the ability to
>> > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and
>> have a userspace
>> > >>>>>> >>>>> > utility periodically pull them out. What we discussed
>> was doing this
>> > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea
>> would be that
>> > >>>>>> >>>>> > userspace would populate the LPM map with the keys
>> (prefixes) they
>> > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be
>> one key per
>> > >>>>>> >>>>> > customer, for instance). Epping would then do a map
>> lookup into the LPM,
>> > >>>>>> >>>>> > and if it gets a match it would update the statistics in
>> that map entry
>> > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically).
>> Simon's PR
>> > >>>>>> >>>>> > below uses this technique where userspace will "reset"
>> the histogram
>> > >>>>>> >>>>> > every time it loads it by swapping out two different map
>> entries when it
>> > >>>>>> >>>>> > does a read; this allows you to control the sampling
>> rate from
>> > >>>>>> >>>>> > userspace, and you'll just get the data since the last
>> time you polled.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and the
>> plan going
>> > >>>>>> >>>>> forward. I will just note that this PR (and all my other
>> work with
>> > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be
>> more or less
>> > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to
>> finish up a
>> > >>>>>> >>>>> paper.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > I was thinking that if we all can agree on the map
>> format, then your
>> > >>>>>> >>>>> > polling daemon could be one userspace "client" for that,
>> and the epping
>> > >>>>>> >>>>> > binary itself could be another; but we could keep
>> compatibility between
>> > >>>>>> >>>>> > the two, so we don't duplicate effort.
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it
>> can be plugged
>> > >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Should probably do that...at some point. In general I
>> think it's a bit
>> > >>>>>> >>>>> of an interesting problem to think about how to chain
>> multiple XDP/tc
>> > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc
>> programs will do
>> > >>>>>> >>>>> some amount of packet parsing and when you have many
>> chained programs
>> > >>>>>> >>>>> parsing the same packets this obviously becomes a bit
>> wasteful. In the
>> > >>>>>> >>>>> same time it would be nice if one didn't need to manually
>> merge
>> > >>>>>> >>>>> multiple programs together into a single one like this to
>> get rid of
>> > >>>>>> >>>>> this duplicated parsing, or at least make that process of
>> merging those
>> > >>>>>> >>>>> programs as simple as possible.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > -Toke
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar
>> vi dina personuppgifter<https://www.kau.se/gdpr>.
>> > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will
>> process your personal data<https://www.kau.se/en/gdpr>.
>> > >>>>>> >>>>
>> > >>>>>> >>>> _______________________________________________
>> > >>>>>> >>>> LibreQoS mailing list
>> > >>>>>> >>>> LibreQoS@lists.bufferbloat.net
>> > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>> --
>> > >>>>>> >>> Robert Chacón
>> > >>>>>> >>> CEO | JackRabbit Wireless LLC
>> > >>>>>> >
>> > >>>>>> > _______________________________________________
>> > >>>>>> > LibreQoS mailing list
>> > >>>>>> > LibreQoS@lists.bufferbloat.net
>> > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> This song goes out to all the folk that thought Stadia would
>> work:
>> > >>>>>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> > >>>>>> Dave Täht CEO, TekLibre, LLC
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> LibreQoS mailing list
>> > >>>> LibreQoS@lists.bufferbloat.net
>> > >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Robert Chacón
>> > >>> CEO | JackRabbit Wireless LLC
>> > >>> _______________________________________________
>> > >>> LibreQoS mailing list
>> > >>> LibreQoS@lists.bufferbloat.net
>> > >>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >
>> > > _______________________________________________
>> > > LibreQoS mailing list
>> > > LibreQoS@lists.bufferbloat.net
>> > > https://lists.bufferbloat.net/listinfo/libreqos
>> >
>> >
>> >
>> > --
>> > This song goes out to all the folk that thought Stadia would work:
>> >
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> > Dave Täht CEO, TekLibre, LLC
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>> Dave Täht CEO, TekLibre, LLC
>>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>

[-- Attachment #2: Type: text/html, Size: 48903 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-10-22 14:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-16  1:59 [LibreQoS] In BPF pping - so far Herbert Wolverson
2022-10-16  2:26 ` Robert Chacón
2022-10-17 14:13 ` Toke Høiland-Jørgensen
2022-10-17 14:59   ` Herbert Wolverson
2022-10-17 15:14   ` Simon Sundberg
2022-10-17 18:45     ` Herbert Wolverson
2022-10-17 20:33       ` Robert Chacón
2022-10-18 18:01         ` Herbert Wolverson
2022-10-19 13:44           ` Herbert Wolverson
2022-10-19 13:58             ` Dave Taht
2022-10-19 14:01               ` Herbert Wolverson
2022-10-19 14:05                 ` Herbert Wolverson
2022-10-19 14:48                   ` Robert Chacón
2022-10-19 15:49                     ` dan
2022-10-19 16:10                       ` Herbert Wolverson
2022-10-19 16:13                         ` Dave Taht
2022-10-19 16:13                           ` Dave Taht
2022-10-22 14:32                             ` Herbert Wolverson
2022-10-22 14:44                               ` Dave Taht
2022-10-22 14:47                               ` Robert Chacón

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox