That's true. The 12th gen does seem to have some "special" features...
makes for a nice writing platform
(this box is primarily my "write books and articles" machine). I'll be
doing a wider test on a more normal
platform, probably at the weekend (with real traffic, hence the delay -
have to find a time in which I
minimize disruption)

On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:

> Those 'efficiency' threads in Intel 12th gen should probably be addressed
> as well.  You can't turn them off in BIOS.
>
> On Wed, Oct 19, 2022 at 8:48 AM Robert Chacón via LibreQoS <
> libreqos@lists.bufferbloat.net> wrote:
>
>> Awesome work on this!
>> I suspect there should be a slight performance bump once Hyperthreading
>> is disabled and efficient power management is off.
>> Hyperthreading/SMT always messes with HTB performance when I leave it on.
>> Thank you for mentioning that - I now went ahead and added instructions on
>> disabling hyperthreading on the Wiki for new users.
>> Super promising results!
>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping.
>> So far in your VM setup it seems to be doing very well.
>>
>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>>
>>> Also, I forgot to mention that I *think* the current version has removed
>>> the requirement that the inbound
>>> and outbound classifiers be placed on the same CPU. I know interduo was
>>> particularly keen on packing
>>> upload into fewer cores. I'll add that to my list of things to test.
>>>
>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
>>> wrote:
>>>
>>>> I'll definitely take a look - that does look interesting. I don't have
>>>> X11 on any of my test VMs, but
>>>> it looks like it can work without the GUI.
>>>>
>>>> Thanks!
>>>>
>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>>> could I coax you to adopt flent?
>>>>>
>>>>> apt-get install flent netperf irtt fping
>>>>>
>>>>> You sometimes have to compile netperf yourself with --enable-demo on
>>>>> some systems.
>>>>> There are a bunch of python libs neede for the gui, but only on the
>>>>> client.
>>>>>
>>>>> Then you can run a really gnarly test series and plot the results over
>>>>> time.
>>>>>
>>>>> flent --socket-stats --step-size=.05 -t 'the-test-conditions' -H
>>>>> the_server_name rrul # 110 other tests
>>>>>
>>>>>
>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>>>> <libreqos@lists.bufferbloat.net> wrote:
>>>>> >
>>>>> > Hey,
>>>>> >
>>>>> > Testing the current version (
>>>>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing
>>>>> better than I hoped. This build has shared (not per-cpu) maps, and a
>>>>> userspace daemon (xdp_pping) to extract and reset stats.
>>>>> >
>>>>> > My testing environment has grown a bit:
>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>>>>> cpumap-pping-hackjob version of xdp-cpumap.
>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>>>>> server.
>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2.
>>>>> Hosts iperf client.
>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3.
>>>>> Hosts iperf client.
>>>>> >
>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM
>>>>> are on a virtual switch.
>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>>>>> different virtual switch.
>>>>> >
>>>>> > These are all on a host machine running Windows 11, a core i7 12th
>>>>> gen, 32 Gb RAM and fast SSD setup.
>>>>> >
>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>>>> >
>>>>> > For this test, LibreQoS is configured:
>>>>> > * Two APs, each with 5gbit/s max.
>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>>>>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>>>> > * Set to use Cake
>>>>> >
>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t
>>>>> 500 (for a long run). Running xdp_pping yields correct results:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>> > {}]
>>>>> >
>>>>> > Or when I waited a while to gather/reset:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>>>> > {}]
>>>>> >
>>>>> > The ShaperVM shows no errors, just periodic logging that it is
>>>>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>>>>> expected).
>>>>> >
>>>>> > After 500 seconds of continual iperfing, each client reported a
>>>>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>>>> >
>>>>> > So for smaller streams, I'd call this a success.
>>>>> >
>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>>>> >
>>>>> > For this test, LibreQoS is configured:
>>>>> > * Two APs, each with 5gb/s max.
>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>>>>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>>>> >
>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>>>> >
>>>>> > xdp_pping shows results, too:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>>>> > {}]
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>>>> > {}]
>>>>> >
>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>>>> >
>>>>> > After 500 seconds of continual iperfing, each client reported a
>>>>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>>>> >
>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is
>>>>> to be expected), but it's not bad. I also forgot to disable hyperthreading,
>>>>> and looking at the host performance it is sometimes running the second
>>>>> virtual CPU on an underpowered "fake" CPU.
>>>>> >
>>>>> > So for two large streams, I think we're doing pretty well also!
>>>>> >
>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>>>> >
>>>>> > This test is designed to try and blow things up. It's the same as
>>>>> test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and
>>>>> 1:6.
>>>>> >
>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle.
>>>>> The pping stats start to show a bit of degradation in performance for
>>>>> pounding it so hard:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>>>> > {}]
>>>>> >
>>>>> > For whatever reason, it smoothed out over time:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>>>> > {}]
>>>>> >
>>>>> > Surprisingly (to me), I didn't encounter errors. Each client
>>>>> received 2.22 Gbit/s performance, over 129 Gbytes of data.
>>>>> >
>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>>>> >
>>>>> > This test is also designed to break things. Same as test 3, but
>>>>> using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really
>>>>> tax the flow tracking. (Shorter time window because I really wanted to go
>>>>> and find coffee)
>>>>> >
>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping
>>>>> results show that this torture test is worsening performance, and there's
>>>>> always lots of samples in the buffer:
>>>>> >
>>>>> > [
>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>>>> > {}]
>>>>> >
>>>>> > This test also ran better than I expected. You can definitely see
>>>>> some latency creeping in as I make the system work hard. Each VM showed
>>>>> around 2.4 Gbit/s in total performance at the end of the iperf session.
>>>>> There's definitely some latency creeping in, which is expected - but I'm
>>>>> not sure I expected quite that much.
>>>>> >
>>>>> > WHAT'S NEXT & CONCLUSION
>>>>> >
>>>>> > I noticed that I forgot to turn off efficient power management on my
>>>>> VMs and host, and left Hyperthreading on by mistake. So that hurts overall
>>>>> performance.
>>>>> >
>>>>> > The base system seems to be working pretty solidly, at least for
>>>>> small tests.Next up, I'll be removing extraneous debug reporting code,
>>>>> removing some code paths that don't do anything but report, and looking for
>>>>> any small optimization opportunities. I'll then re-run these tests. Once
>>>>> that's done, I hope to find a maintenance window on my WISP and try it with
>>>>> actual traffic.
>>>>> >
>>>>> > I also need to re-run these tests without the pping system to
>>>>> provide some before/after analysis.
>>>>> >
>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>>>>> herberticus@gmail.com> wrote:
>>>>> >>
>>>>> >> It's probably not entirely thread-safe right now (ran into some
>>>>> issues reading per_cpu maps back from userspace; hopefully, I'll get that
>>>>> figured out) - but the commits I just pushed have it basically working on
>>>>> single-stream testing. :-)
>>>>> >>
>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives
>>>>> you per-connection RTT information in JSON:
>>>>> >>
>>>>> >> [
>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>>>> >> {}]
>>>>> >>
>>>>> >> (With the extra {} because I'm not tracking the tail and haven't
>>>>> done comma removal). The tool also empties the various maps used to gather
>>>>> data, acting as a "reset" point. There's a max of 60 samples per queue, in
>>>>> a ringbuffer setup (so newest will start to overwrite the oldest).
>>>>> >>
>>>>> >> I'll start trying to test on a larger scale now.
>>>>> >>
>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chacón <
>>>>> robert.chacon@jackrabbitwireless.com> wrote:
>>>>> >>>
>>>>> >>> Hey Herbert,
>>>>> >>>
>>>>> >>> Fantastic work! Super exciting to see this coming together,
>>>>> especially so quickly.
>>>>> >>> I'll test it soon.
>>>>> >>> I understand and agree with your decision to omit certain features
>>>>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use case.
>>>>> Like you said, in order to merge the functionality without a performance
>>>>> hit, merging them is sort of the only way right now. Otherwise there would
>>>>> be a lot of redundancy and lost throughput for an ISP's use. Though
>>>>> hopefully long term there will be a way to keep all projects working
>>>>> independently but interoperably with a plugin system of some kind.
>>>>> >>>
>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>>>>> optimizations for high sub counts (8000+ subs) as well as stateful changes
>>>>> to the queue structure.
>>>>> >>> I'm working to set up a physical lab to test high throughput and
>>>>> high client count scenarios.
>>>>> >>> When testing beyond ~32,000 filters we get "no space left on
>>>>> device" from xdp-cpumap-tc, which I think relates to the bpf map size
>>>>> limitation you mentioned. Maybe in the coming months we can take a look at
>>>>> that.
>>>>> >>>
>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see more
>>>>> on this.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Robert
>>>>> >>>
>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>>>>> libreqos@lists.bufferbloat.net> wrote:
>>>>> >>>>
>>>>> >>>> Hey,
>>>>> >>>>
>>>>> >>>> My current (unfinished) progress on this is now available here:
>>>>> https://github.com/thebracket/cpumap-pping-hackjob
>>>>> >>>>
>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugged
>>>>> - and can't promise that it won't unleash the nasal demons
>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>>> >>>>
>>>>> >>>> With that said, I'm pretty happy so far:
>>>>> >>>>
>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>>>>> shunted onto a dedicated CPU. It has to run on both
>>>>> >>>>   the inbound and outbound classifiers, since otherwise it would
>>>>> only see half the conversation.
>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped to
>>>>> the same interface; I do that anyway in BracketQoS. Not doing
>>>>> >>>>   that opens up a potential world of pain, since writes to the
>>>>> shared maps would require a locking scheme. Too much locking, and you lose
>>>>> all of the benefit of using multiple CPUs to begin with.
>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems
>>>>> I've worked with have lots of it.
>>>>> >>>> * I've been gradually removing features that I don't want for
>>>>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't do
>>>>> that.
>>>>> >>>> * Rate limiting is working, but I removed the requirement for a
>>>>> shared configuration provided from userland - so right now it's always set
>>>>> to report at 1 second intervals per stream.
>>>>> >>>>
>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>>>>> "world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>>>>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>>>>> fast SSDs)
>>>>> >>>>
>>>>> >>>> Output currently consists of debug messages reading:
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>>>>> bpf_trace_printk: (tc) Flow open event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>>>>> bpf_trace_printk: (tc) Flow open event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>>>>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>>>>> bpf_trace_printk: (tc) Send flow event
>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>>>>> bpf_trace_printk: (tc) Send flow event
>>>>> >>>>
>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>>>>> major/minor, allocated by the xdp-cpumap parent.
>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with
>>>>> some real-world data very soon.
>>>>> >>>>
>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> Herbert
>>>>> >>>>
>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>>>>> Simon.Sundberg@kau.se> wrote:
>>>>> >>>>>
>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of
>>>>> quick
>>>>> >>>>> notes.
>>>>> >>>>>
>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke Høiland-Jørgensen wrote:
>>>>> >>>>> > [ Adding Simon to Cc ]
>>>>> >>>>> >
>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>>>>> writes:
>>>>> >>>>> >
>>>>> >>>>> > > Hey,
>>>>> >>>>> > >
>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>>>> >>>>> > >
>>>>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
>>>>> )
>>>>> >>>>> > > into xdp-cpumap-tc (
>>>>> https://github.com/xdp-project/xdp-cpumap-tc ).
>>>>> >>>>> > >
>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed
>>>>> the entry point
>>>>> >>>>> > > and packet parsing code to make use of the work already done
>>>>> in
>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the
>>>>> packet, no need to do
>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had
>>>>> to pin them -
>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>>>> >>>>> > >
>>>>> >>>>>
>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is
>>>>> processed on
>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a
>>>>> flow may
>>>>> >>>>> be processed by different cores on ingress and egress the
>>>>> per-CPU maps
>>>>> >>>>> will not really work reliably as each core will have a different
>>>>> view
>>>>> >>>>> on the state of the flow, if there's been a previous packet with
>>>>> a
>>>>> >>>>> certain TSval from that flow etc.
>>>>> >>>>>
>>>>> >>>>> Furthermore, if a flow is always processed on the same core (on
>>>>> both
>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own
>>>>> value. So
>>>>> >>>>> all CPUs will then have their own data for each flow, but it's
>>>>> only the
>>>>> >>>>> CPU processing the flow that will have any relevant data for the
>>>>> flow
>>>>> >>>>> while the remaining CPUs will just have an empty state for that
>>>>> flow.
>>>>> >>>>> Under the same assumption that packets within the same flow are
>>>>> always
>>>>> >>>>> processed on the same core there should generally not be any
>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>>>>> packets
>>>>> >>>>> from the same flow cannot be processed concurrently then (and
>>>>> thus no
>>>>> >>>>> concurrent access to the same value in the map). I am however
>>>>> still
>>>>> >>>>> very unclear on if there's any considerable performance impact
>>>>> between
>>>>> >>>>> global and per-CPU map versions if the same key is not accessed
>>>>> >>>>> concurrently.
>>>>> >>>>>
>>>>> >>>>> > > Right now, output
>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>>>>> code. Instead,
>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe,
>>>>> so I can see
>>>>> >>>>> > > roughly what the output would look like.
>>>>> >>>>> > >
>>>>> >>>>> > > With debug enabled and just logging I'm now getting about
>>>>> 4.9 Gbits/sec on
>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>>>>> middle). :-)
>>>>> >>>>> >
>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>>>>> >>>>> > overhead, then. What Simon found was that sending the data
>>>>> from kernel
>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at
>>>>> least when
>>>>> >>>>> > the number of data points goes up (which is does as additional
>>>>> flows are
>>>>> >>>>> > added).
>>>>> >>>>>
>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you
>>>>> may get
>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in
>>>>> terms of
>>>>> >>>>> direct overhead from the tool itself, but also becomes demanding
>>>>> for
>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log,
>>>>> parse,
>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>>>>> that is
>>>>> >>>>> of course to just apply some sort of sampling (the
>>>>> -r/--rate-limit and
>>>>> >>>>> -R/--rtt-rate
>>>>> >>>>> >
>>>>> >>>>> > > So my question: how would you prefer to receive this data?
>>>>> I'll have to
>>>>> >>>>> > > write a daemon that provides userspace control (periodic
>>>>> cleanup as well as
>>>>> >>>>> > > reading the performance stream), so the world's kinda our
>>>>> oyster. I can
>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named
>>>>> pipe, perhaps?),
>>>>> >>>>> > > a condensed format that only shows what you want to use, an
>>>>> efficient
>>>>> >>>>> > > binary format if you feel like parsing that...
>>>>> >>>>> >
>>>>> >>>>> > It would be great if we could combine efforts a bit here so we
>>>>> don't
>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>>>>> epping and
>>>>> >>>>> > whatever daemon you end up writing can agree on data format
>>>>> etc that
>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>>> >>>>> >
>>>>> >>>>> > Briefly what I've discussed before with Simon was to have the
>>>>> ability to
>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>>>>> userspace
>>>>> >>>>> > utility periodically pull them out. What we discussed was
>>>>> doing this
>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea would
>>>>> be that
>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes)
>>>>> they
>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one
>>>>> key per
>>>>> >>>>> > customer, for instance). Epping would then do a map lookup
>>>>> into the LPM,
>>>>> >>>>> > and if it gets a match it would update the statistics in that
>>>>> map entry
>>>>> >>>>> > (keeping a histogram of latency values seen, basically).
>>>>> Simon's PR
>>>>> >>>>> > below uses this technique where userspace will "reset" the
>>>>> histogram
>>>>> >>>>> > every time it loads it by swapping out two different map
>>>>> entries when it
>>>>> >>>>> > does a read; this allows you to control the sampling rate from
>>>>> >>>>> > userspace, and you'll just get the data since the last time
>>>>> you polled.
>>>>> >>>>>
>>>>> >>>>> Thank's Toke for summarzing both the current state and the plan
>>>>> going
>>>>> >>>>> forward. I will just note that this PR (and all my other work
>>>>> with
>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more
>>>>> or less
>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finish
>>>>> up a
>>>>> >>>>> paper.
>>>>> >>>>>
>>>>> >>>>> > I was thinking that if we all can agree on the map format,
>>>>> then your
>>>>> >>>>> > polling daemon could be one userspace "client" for that, and
>>>>> the epping
>>>>> >>>>> > binary itself could be another; but we could keep
>>>>> compatibility between
>>>>> >>>>> > the two, so we don't duplicate effort.
>>>>> >>>>> >
>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>>>>> plugged
>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>>>>> >>>>>
>>>>> >>>>> Should probably do that...at some point. In general I think it's
>>>>> a bit
>>>>> >>>>> of an interesting problem to think about how to chain multiple
>>>>> XDP/tc
>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs
>>>>> will do
>>>>> >>>>> some amount of packet parsing and when you have many chained
>>>>> programs
>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful.
>>>>> In the
>>>>> >>>>> same time it would be nice if one didn't need to manually merge
>>>>> >>>>> multiple programs together into a single one like this to get
>>>>> rid of
>>>>> >>>>> this duplicated parsing, or at least make that process of
>>>>> merging those
>>>>> >>>>> programs as simple as possible.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> > -Toke
>>>>> >>>>> >
>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>> >>>>>
>>>>> >>>>> När du skickar e-post till Karlstads universitet behandlar vi
>>>>> dina personuppgifter<https://www.kau.se/gdpr>.
>>>>> >>>>> When you send an e-mail to Karlstad University, we will process
>>>>> your personal data<https://www.kau.se/en/gdpr>.
>>>>> >>>>
>>>>> >>>> _______________________________________________
>>>>> >>>> LibreQoS mailing list
>>>>> >>>> LibreQoS@lists.bufferbloat.net
>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Robert Chacón
>>>>> >>> CEO | JackRabbit Wireless LLC
>>>>> >
>>>>> > _______________________________________________
>>>>> > LibreQoS mailing list
>>>>> > LibreQoS@lists.bufferbloat.net
>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> This song goes out to all the folk that thought Stadia would work:
>>>>>
>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
>>>>> Dave Täht CEO, TekLibre, LLC
>>>>>
>>>> _______________________________________________
>>> LibreQoS mailing list
>>> LibreQoS@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>
>>
>> --
>> Robert Chacón
>> CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>
>> _______________________________________________
>> LibreQoS mailing list
>> LibreQoS@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/libreqos
>>
>