[LibreQoS] In BPF pping - so far

Wed Oct 19 10:01:49 EDT 2022

I'll definitely take a look - that does look interesting. I don't have X11
on any of my test VMs, but
it looks like it can work without the GUI.

Thanks!

On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht at gmail.com> wrote:

> could I coax you to adopt flent?
>
> apt-get install flent netperf irtt fping
>
> You sometimes have to compile netperf yourself with --enable-demo on
> some systems.
> There are a bunch of python libs neede for the gui, but only on the client.
>
> Then you can run a really gnarly test series and plot the results over
> time.
>
> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
> the_server_name rrul # 110 other tests
>
>
> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
> <libreqos at lists.bufferbloat.net> wrote:
> >
> > Hey,
> >
> > Testing the current version (
> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
> than I hoped. This build has shared (not per-cpu) maps, and a userspace
> daemon (xdp_pping) to extract and reset stats.
> >
> > My testing environment has grown a bit:
> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
> cpumap-pping-hackjob version of xdp-cpumap.
> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
> server.
> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Hosts
> iperf client.
> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Hosts
> iperf client.
> >
> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are on
> a virtual switch.
> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
> different virtual switch.
> >
> > These are all on a host machine running Windows 11, a core i7 12th gen,
> 32 Gb RAM and fast SSD setup.
> >
> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
> >
> > For this test, LibreQoS is configured:
> > * Two APs, each with 5gbit/s max.
> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> > * Set to use Cake
> >
> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500
> (for a long run). Running xdp_pping yields correct results:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> > {}]
> >
> > Or when I waited a while to gather/reset:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> > {}]
> >
> > The ShaperVM shows no errors, just periodic logging that it is recording
> data.  CPU is about 2-3% on two CPUs, zero on the others (as expected).
> >
> > After 500 seconds of continual iperfing, each client reported a
> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
> >
> > So for smaller streams, I'd call this a success.
> >
> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
> >
> > For this test, LibreQoS is configured:
> > * Two APs, each with 5gb/s max.
> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
> Mapped to 1:5 and 2:5 respectively (separate CPUs).
> >
> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
> >
> > xdp_pping shows results, too:
> >
> > [
> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> > {}]
> >
> > [
> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> > {}]
> >
> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
> >
> > After 500 seconds of continual iperfing, each client reported a
> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
> >
> > Maxing out HyperV like this is inducing a bit of latency (which is to be
> expected), but it's not bad. I also forgot to disable hyperthreading, and
> looking at the host performance it is sometimes running the second virtual
> CPU on an underpowered "fake" CPU.
> >
> > So for two large streams, I think we're doing pretty well also!
> >
> > TEST 3: DUAL STREAMS, SINGLE CPU
> >
> > This test is designed to try and blow things up. It's the same as test
> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1:6.
> >
> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
> pping stats start to show a bit of degradation in performance for pounding
> it so hard:
> >
> > [
> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> > {}]
> >
> > For whatever reason, it smoothed out over time:
> >
> > [
> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> > {}]
> >
> > Surprisingly (to me), I didn't encounter errors. Each client received
> 2.22 Gbit/s performance, over 129 Gbytes of data.
> >
> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
> >
> > This test is also designed to break things. Same as test 3, but using
> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the
> flow tracking. (Shorter time window because I really wanted to go and find
> coffee)
> >
> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
> show that this torture test is worsening performance, and there's always
> lots of samples in the buffer:
> >
> > [
> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
> > {}]
> >
> > This test also ran better than I expected. You can definitely see some
> latency creeping in as I make the system work hard. Each VM showed around
> 2.4 Gbit/s in total performance at the end of the iperf session. There's
> definitely some latency creeping in, which is expected - but I'm not sure I
> expected quite that much.
> >
> > WHAT'S NEXT & CONCLUSION
> >
> > I noticed that I forgot to turn off efficient power management on my VMs
> and host, and left Hyperthreading on by mistake. So that hurts overall
> performance.
> >
> > The base system seems to be working pretty solidly, at least for small
> tests.Next up, I'll be removing extraneous debug reporting code, removing
> some code paths that don't do anything but report, and looking for any
> small optimization opportunities. I'll then re-run these tests. Once that's
> done, I hope to find a maintenance window on my WISP and try it with actual
> traffic.
> >
> > I also need to re-run these tests without the pping system to provide
> some before/after analysis.
> >
> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus at gmail.com>
> wrote:
> >>
> >> It's probably not entirely thread-safe right now (ran into some issues
> reading per_cpu maps back from userspace; hopefully, I'll get that figured
> out) - but the commits I just pushed have it basically working on
> single-stream testing. :-)
> >>
> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you
> per-connection RTT information in JSON:
> >>
> >> [
> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> >> {}]
> >>
> >> (With the extra {} because I'm not tracking the tail and haven't done
> comma removal). The tool also empties the various maps used to gather data,
> acting as a "reset" point. There's a max of 60 samples per queue, in a
> ringbuffer setup (so newest will start to overwrite the oldest).
> >>
> >> I'll start trying to test on a larger scale now.
> >>
> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
> robert.chacon at jackrabbitwireless.com> wrote:
> >>>
> >>> Hey Herbert,
> >>>
> >>> Fantastic work! Super exciting to see this coming together, especially
> so quickly.
> >>> I'll test it soon.
> >>> I understand and agree with your decision to omit certain features
> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
> Like you said, in order to merge the functionality without a performance
> hit, merging them is sort of the only way right now. Otherwise there would
> be a lot of redundancy and lost throughput for an ISP's use. Though
> hopefully long term there will be a way to keep all projects working
> independently but interoperably with a plugin system of some kind.
> >>>
> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
> optimizations for high sub counts (8000+ subs) as well as stateful changes
> to the queue structure.
> >>> I'm working to set up a physical lab to test high throughput and high
> client count scenarios.
> >>> When testing beyond ~32,000 filters we get "no space left on device"
> from xdp-cpumap-tc, which I think relates to the bpf map size limitation
> you mentioned. Maybe in the coming months we can take a look at that.
> >>>
> >>> Anyway great work on the cpumap-pping program! Excited to see more on
> this.
> >>>
> >>> Thanks,
> >>> Robert
> >>>
> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
> libreqos at lists.bufferbloat.net> wrote:
> >>>>
> >>>> Hey,
> >>>>
> >>>> My current (unfinished) progress on this is now available here:
> https://github.com/thebracket/cpumap-pping-hackjob
> >>>>
> >>>> I mean it about the warnings, this isn't at all stable, debugged -
> and can't promise that it won't unleash the nasal demons
> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
> >>>>
> >>>> With that said, I'm pretty happy so far:
> >>>>
> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
> shunted onto a dedicated CPU. It has to run on both
> >>>>   the inbound and outbound classifiers, since otherwise it would only
> see half the conversation.
> >>>> * It does assume that your ingress and egress CPUs are mapped to the
> same interface; I do that anyway in BracketQoS. Not doing
> >>>>   that opens up a potential world of pain, since writes to the shared
> maps would require a locking scheme. Too much locking, and you lose all of
> the benefit of using multiple CPUs to begin with.
> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
> worked with have lots of it.
> >>>> * I've been gradually removing features that I don't want for
> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
> that.
> >>>> * Rate limiting is working, but I removed the requirement for a
> shared configuration provided from userland - so right now it's always set
> to report at 1 second intervals per stream.
> >>>>
> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s max,
> via a cake/HTB queue setup) is around 5 gbit/s at present, on my
> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
> fast SSDs)
> >>>>
> >>>> Output currently consists of debug messages reading:
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk:
> (tc) Flow open event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk:
> (tc) Send performance event (5,1), 374696
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk:
> (tc) Flow open event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk:
> (tc) Send performance event (5,1), 247069
> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk:
> (tc) Send performance event (5,1), 5217155
> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk:
> (tc) Send performance event (5,1), 4515394
> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk:
> (tc) Send performance event (5,1), 4481289
> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk:
> (tc) Send performance event (5,1), 4255268
> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk:
> (tc) Send performance event (5,1), 5249493
> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk:
> (tc) Send performance event (5,1), 3795993
> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk:
> (tc) Send performance event (5,1), 3949519
> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk:
> (tc) Send performance event (5,1), 4365335
> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk:
> (tc) Send performance event (5,1), 4154910
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk:
> (tc) Send performance event (5,1), 4405582
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk:
> (tc) Send flow event
> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk:
> (tc) Send flow event
> >>>>
> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
> major/minor, allocated by the xdp-cpumap parent.
> >>>> I get pretty low latency between VMs; I'll set up a test with some
> real-world data very soon.
> >>>>
> >>>> I plan to keep hacking away, but feel free to take a peek.
> >>>>
> >>>> Thanks,
> >>>> Herbert
> >>>>
> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
> Simon.Sundberg at kau.se> wrote:
> >>>>>
> >>>>> Hi, thanks for adding me to the conversation. Just a couple of quick
> >>>>> notes.
> >>>>>
> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
> >>>>> > [ Adding Simon to Cc ]
> >>>>> >
> >>>>> > Herbert Wolverson via LibreQoS <libreqos at lists.bufferbloat.net>
> writes:
> >>>>> >
> >>>>> > > Hey,
> >>>>> > >
> >>>>> > > I've had some pretty good success with merging xdp-pping (
> >>>>> > >
> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
> >>>>> > > into xdp-cpumap-tc (
> https://github.com/xdp-project/xdp-cpumap-tc ).
> >>>>> > >
> >>>>> > > I ported over most of the xdp-pping code, and then changed the
> entry point
> >>>>> > > and packet parsing code to make use of the work already done in
> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no
> need to do
> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had to
> pin them -
> >>>>> > > otherwise the two tc instances don't properly share data.
> >>>>> > >
> >>>>>
> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow may
> >>>>> be processed by different cores on ingress and egress the per-CPU
> maps
> >>>>> will not really work reliably as each core will have a different view
> >>>>> on the state of the flow, if there's been a previous packet with a
> >>>>> certain TSval from that flow etc.
> >>>>>
> >>>>> Furthermore, if a flow is always processed on the same core (on both
> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
> >>>>> memory. From my understanding the keys for per-CPU maps are still
> >>>>> shared across all CPUs, it's just that each CPU gets its own value.
> So
> >>>>> all CPUs will then have their own data for each flow, but it's only
> the
> >>>>> CPU processing the flow that will have any relevant data for the flow
> >>>>> while the remaining CPUs will just have an empty state for that flow.
> >>>>> Under the same assumption that packets within the same flow are
> always
> >>>>> processed on the same core there should generally not be any
> >>>>> concurrency issues with having a global (non-per-CPU) either as
> packets
> >>>>> from the same flow cannot be processed concurrently then (and thus no
> >>>>> concurrent access to the same value in the map). I am however still
> >>>>> very unclear on if there's any considerable performance impact
> between
> >>>>> global and per-CPU map versions if the same key is not accessed
> >>>>> concurrently.
> >>>>>
> >>>>> > > Right now, output
> >>>>> > > is just stubbed - I've still got to port the perfmap output
> code. Instead,
> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I
> can see
> >>>>> > > roughly what the output would look like.
> >>>>> > >
> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
> Gbits/sec on
> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
> middle). :-)
> >>>>> >
> >>>>> > Just FYI, that "just logging" is probably the biggest source of
> >>>>> > overhead, then. What Simon found was that sending the data from
> kernel
> >>>>> > to userspace is one of the most expensive bits of epping, at least
> when
> >>>>> > the number of data points goes up (which is does as additional
> flows are
> >>>>> > added).
> >>>>>
> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may
> get
> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> >>>>> direct overhead from the tool itself, but also becomes demanding for
> >>>>> whatever you use all those RTT samples for (i.e. need to log, parse,
> >>>>> analyze etc. a very large amount of RTTs). One way to deal with that
> is
> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit
> and
> >>>>> -R/--rtt-rate
> >>>>> >
> >>>>> > > So my question: how would you prefer to receive this data? I'll
> have to
> >>>>> > > write a daemon that provides userspace control (periodic cleanup
> as well as
> >>>>> > > reading the performance stream), so the world's kinda our
> oyster. I can
> >>>>> > > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> >>>>> > > a condensed format that only shows what you want to use, an
> efficient
> >>>>> > > binary format if you feel like parsing that...
> >>>>> >
> >>>>> > It would be great if we could combine efforts a bit here so we
> don't
> >>>>> > fork the codebase more than we have to. I.e., if "upstream" epping
> and
> >>>>> > whatever daemon you end up writing can agree on data format etc
> that
> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
> >>>>> >
> >>>>> > Briefly what I've discussed before with Simon was to have the
> ability to
> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
> userspace
> >>>>> > utility periodically pull them out. What we discussed was doing
> this
> >>>>> > using an LPM map (which is not in that PR yet). The idea would be
> that
> >>>>> > userspace would populate the LPM map with the keys (prefixes) they
> >>>>> > wanted statistics for (in LibreQOS context that could be one key
> per
> >>>>> > customer, for instance). Epping would then do a map lookup into
> the LPM,
> >>>>> > and if it gets a match it would update the statistics in that map
> entry
> >>>>> > (keeping a histogram of latency values seen, basically). Simon's PR
> >>>>> > below uses this technique where userspace will "reset" the
> histogram
> >>>>> > every time it loads it by swapping out two different map entries
> when it
> >>>>> > does a read; this allows you to control the sampling rate from
> >>>>> > userspace, and you'll just get the data since the last time you
> polled.
> >>>>>
> >>>>> Thank's Toke for summarzing both the current state and the plan going
> >>>>> forward. I will just note that this PR (and all my other work with
> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
> less
> >>>>> on hold for a couple of weeks right now as I'm trying to finish up a
> >>>>> paper.
> >>>>>
> >>>>> > I was thinking that if we all can agree on the map format, then
> your
> >>>>> > polling daemon could be one userspace "client" for that, and the
> epping
> >>>>> > binary itself could be another; but we could keep compatibility
> between
> >>>>> > the two, so we don't duplicate effort.
> >>>>> >
> >>>>> > Similarly, refactoring of the epping code itself so it can be
> plugged
> >>>>> > into the cpumap-tc code would be a good goal...
> >>>>>
> >>>>> Should probably do that...at some point. In general I think it's a
> bit
> >>>>> of an interesting problem to think about how to chain multiple XDP/tc
> >>>>> programs together in an efficent way. Most XDP and tc programs will
> do
> >>>>> some amount of packet parsing and when you have many chained programs
> >>>>> parsing the same packets this obviously becomes a bit wasteful. In
> the
> >>>>> same time it would be nice if one didn't need to manually merge
> >>>>> multiple programs together into a single one like this to get rid of
> >>>>> this duplicated parsing, or at least make that process of merging
> those
> >>>>> programs as simple as possible.
> >>>>>
> >>>>>
> >>>>> > -Toke
> >>>>> >
> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
> >>>>>
> >>>>> När du skickar e-post till Karlstads universitet behandlar vi dina
> personuppgifter<https://www.kau.se/gdpr>.
> >>>>> When you send an e-mail to Karlstad University, we will process your
> personal data<https://www.kau.se/en/gdpr>.
> >>>>
> >>>> _______________________________________________
> >>>> LibreQoS mailing list
> >>>> LibreQoS at lists.bufferbloat.net
> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>
> >>>
> >>>
> >>> --
> >>> Robert Chacón
> >>> CEO | JackRabbit Wireless LLC
> >
> > _______________________________________________
> > LibreQoS mailing list
> > LibreQoS at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/libreqos
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
>
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> Dave Täht CEO, TekLibre, LLC
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/libreqos/attachments/20221019/b7496736/attachment-0001.html>