From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id EAE0A3B29D for ; Wed, 19 Oct 2022 10:48:33 -0400 (EDT) Received: by mail-ej1-x62c.google.com with SMTP id 13so40452547ejn.3 for ; Wed, 19 Oct 2022 07:48:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jackrabbitwireless.com; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=BvIi6mt1RTeoUcKsNZDvRqAOYa6I9iOMk4lDSyVooSk=; b=z07y+taTvelcl02m8XOSmyb+RWw/JYHS4KGgS2OakUXXpDQES9INqydWZLT1MPUlxl NCD2+lAy5AR+sEgOK4boq7Hxz8A0HIhPH46p1NLqd6DCHjIboJFC8iRpY8Z9VjyXgsAh gyAAPrcBcbmyf27KIzaib5oTwqHMYzQEPwCZI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BvIi6mt1RTeoUcKsNZDvRqAOYa6I9iOMk4lDSyVooSk=; b=ko3U3/FnZlPR3Etg7VhDI21iDIkKSZfeJYISTd8PSceXKRXkN1N6+uMa721t0WkZ/R JKIuXqa8++jYFj4qa/8XOIE6DYi3g4tpgAg8fosPhFYk8sNgKD+Ku9mq0qZN9kwxqgPc zqW6cWUGmqE+cZ5StpDRRPdCQ5FikU8yz+rXIBpiy/kBrGc7rec/s0wr7hXATuliLmUZ YO5bbU1TyXLe+Xr6Rcn0iHJOiU6DxleSPAHmEbUFXCGQDQDRwkMey43hi4JQMdBAmjfi psnGzGfIeamfm+zRX0vpYWqvFT1bWJoOcOEee8yvSlOIcDHRPzgHXvivBT3sep8p/G/k 9eYQ== X-Gm-Message-State: ACrzQf1Ra81Io/mMskN8Pvh3Wz3SQO7eowztNhZNMMZhMdEpcEsuxA3B 1oncunCghfRWvXwgyUjWbYJbcCWaueTUdAxrSyQddORusWuRKg== X-Google-Smtp-Source: AMsMyM7IeZETifdGCooUXEIZwjTcA9QNCQnrMnYW6pPwKKt5sRvre4iewJ+ndNb3DGJOBFjTSzqgzecklzLdYqL6vv0= X-Received: by 2002:a17:907:7215:b0:791:a61f:56b3 with SMTP id dr21-20020a170907721500b00791a61f56b3mr3816051ejc.331.1666190912621; Wed, 19 Oct 2022 07:48:32 -0700 (PDT) MIME-Version: 1.0 References: <87bkqatu61.fsf@toke.dk> <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se> In-Reply-To: From: =?UTF-8?Q?Robert_Chac=C3=B3n?= Date: Wed, 19 Oct 2022 08:48:21 -0600 Message-ID: To: Herbert Wolverson Cc: "libreqos@lists.bufferbloat.net" Content-Type: multipart/alternative; boundary="00000000000079b13c05eb644d68" Subject: Re: [LibreQoS] In BPF pping - so far X-BeenThere: libreqos@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Many ISPs need the kinds of quality shaping cake can do List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Oct 2022 14:48:34 -0000 --00000000000079b13c05eb644d68 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Awesome work on this! I suspect there should be a slight performance bump once Hyperthreading is disabled and efficient power management is off. Hyperthreading/SMT always messes with HTB performance when I leave it on. Thank you for mentioning that - I now went ahead and added instructions on disabling hyperthreading on the Wiki for new users. Super promising results! Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping. So far in your VM setup it seems to be doing very well. On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS < libreqos@lists.bufferbloat.net> wrote: > Also, I forgot to mention that I *think* the current version has removed > the requirement that the inbound > and outbound classifiers be placed on the same CPU. I know interduo was > particularly keen on packing > upload into fewer cores. I'll add that to my list of things to test. > > On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson > wrote: > >> I'll definitely take a look - that does look interesting. I don't have >> X11 on any of my test VMs, but >> it looks like it can work without the GUI. >> >> Thanks! >> >> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht wrote: >> >>> could I coax you to adopt flent? >>> >>> apt-get install flent netperf irtt fping >>> >>> You sometimes have to compile netperf yourself with --enable-demo on >>> some systems. >>> There are a bunch of python libs neede for the gui, but only on the >>> client. >>> >>> Then you can run a really gnarly test series and plot the results over >>> time. >>> >>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H >>> the_server_name rrul # 110 other tests >>> >>> >>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS >>> wrote: >>> > >>> > Hey, >>> > >>> > Testing the current version ( >>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better >>> than I hoped. This build has shared (not per-cpu) maps, and a userspace >>> daemon (xdp_pping) to extract and reset stats. >>> > >>> > My testing environment has grown a bit: >>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new >>> cpumap-pping-hackjob version of xdp-cpumap. >>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf >>> server. >>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. >>> Hosts iperf client. >>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. >>> Hosts iperf client. >>> > >>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are >>> on a virtual switch. >>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a >>> different virtual switch. >>> > >>> > These are all on a host machine running Windows 11, a core i7 12th >>> gen, 32 Gb RAM and fast SSD setup. >>> > >>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT >>> > >>> > For this test, LibreQoS is configured: >>> > * Two APs, each with 5gbit/s max. >>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about >>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs). >>> > * Set to use Cake >>> > >>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 50= 0 >>> (for a long run). Running xdp_pping yields correct results: >>> > >>> > [ >>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}, >>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}, >>> > {}] >>> > >>> > Or when I waited a while to gather/reset: >>> > >>> > [ >>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60}, >>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60}, >>> > {}] >>> > >>> > The ShaperVM shows no errors, just periodic logging that it is >>> recording data. CPU is about 2-3% on two CPUs, zero on the others (as >>> expected). >>> > >>> > After 500 seconds of continual iperfing, each client reported a >>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted. >>> > >>> > So for smaller streams, I'd call this a success. >>> > >>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT >>> > >>> > For this test, LibreQoS is configured: >>> > * Two APs, each with 5gb/s max. >>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! >>> Mapped to 1:5 and 2:5 respectively (separate CPUs). >>> > >>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time. >>> > >>> > xdp_pping shows results, too: >>> > >>> > [ >>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58}, >>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58}, >>> > {}] >>> > >>> > [ >>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13}, >>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13}, >>> > {}] >>> > >>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent. >>> > >>> > After 500 seconds of continual iperfing, each client reported a >>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GB= ytes. >>> > >>> > Maxing out HyperV like this is inducing a bit of latency (which is to >>> be expected), but it's not bad. I also forgot to disable hyperthreading= , >>> and looking at the host performance it is sometimes running the second >>> virtual CPU on an underpowered "fake" CPU. >>> > >>> > So for two large streams, I think we're doing pretty well also! >>> > >>> > TEST 3: DUAL STREAMS, SINGLE CPU >>> > >>> > This test is designed to try and blow things up. It's the same as tes= t >>> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and = 1:6. >>> > >>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. Th= e >>> pping stats start to show a bit of degradation in performance for pound= ing >>> it so hard: >>> > >>> > [ >>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24}, >>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24}, >>> > {}] >>> > >>> > For whatever reason, it smoothed out over time: >>> > >>> > [ >>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50}, >>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50}, >>> > {}] >>> > >>> > Surprisingly (to me), I didn't encounter errors. Each client received >>> 2.22 Gbit/s performance, over 129 Gbytes of data. >>> > >>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS >>> > >>> > This test is also designed to break things. Same as test 3, but using >>> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax= the >>> flow tracking. (Shorter time window because I really wanted to go and f= ind >>> coffee) >>> > >>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results >>> show that this torture test is worsening performance, and there's alway= s >>> lots of samples in the buffer: >>> > >>> > [ >>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49}, >>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49}, >>> > {}] >>> > >>> > This test also ran better than I expected. You can definitely see som= e >>> latency creeping in as I make the system work hard. Each VM showed arou= nd >>> 2.4 Gbit/s in total performance at the end of the iperf session. There'= s >>> definitely some latency creeping in, which is expected - but I'm not su= re I >>> expected quite that much. >>> > >>> > WHAT'S NEXT & CONCLUSION >>> > >>> > I noticed that I forgot to turn off efficient power management on my >>> VMs and host, and left Hyperthreading on by mistake. So that hurts over= all >>> performance. >>> > >>> > The base system seems to be working pretty solidly, at least for smal= l >>> tests.Next up, I'll be removing extraneous debug reporting code, removi= ng >>> some code paths that don't do anything but report, and looking for any >>> small optimization opportunities. I'll then re-run these tests. Once th= at's >>> done, I hope to find a maintenance window on my WISP and try it with ac= tual >>> traffic. >>> > >>> > I also need to re-run these tests without the pping system to provide >>> some before/after analysis. >>> > >>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson < >>> herberticus@gmail.com> wrote: >>> >> >>> >> It's probably not entirely thread-safe right now (ran into some >>> issues reading per_cpu maps back from userspace; hopefully, I'll get th= at >>> figured out) - but the commits I just pushed have it basically working = on >>> single-stream testing. :-) >>> >> >>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives yo= u >>> per-connection RTT information in JSON: >>> >> >>> >> [ >>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1}, >>> >> {}] >>> >> >>> >> (With the extra {} because I'm not tracking the tail and haven't don= e >>> comma removal). The tool also empties the various maps used to gather d= ata, >>> acting as a "reset" point. There's a max of 60 samples per queue, in a >>> ringbuffer setup (so newest will start to overwrite the oldest). >>> >> >>> >> I'll start trying to test on a larger scale now. >>> >> >>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n < >>> robert.chacon@jackrabbitwireless.com> wrote: >>> >>> >>> >>> Hey Herbert, >>> >>> >>> >>> Fantastic work! Super exciting to see this coming together, >>> especially so quickly. >>> >>> I'll test it soon. >>> >>> I understand and agree with your decision to omit certain features >>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use c= ase. >>> Like you said, in order to merge the functionality without a performanc= e >>> hit, merging them is sort of the only way right now. Otherwise there wo= uld >>> be a lot of redundancy and lost throughput for an ISP's use. Though >>> hopefully long term there will be a way to keep all projects working >>> independently but interoperably with a plugin system of some kind. >>> >>> >>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on >>> optimizations for high sub counts (8000+ subs) as well as stateful chan= ges >>> to the queue structure. >>> >>> I'm working to set up a physical lab to test high throughput and >>> high client count scenarios. >>> >>> When testing beyond ~32,000 filters we get "no space left on device= " >>> from xdp-cpumap-tc, which I think relates to the bpf map size limitatio= n >>> you mentioned. Maybe in the coming months we can take a look at that. >>> >>> >>> >>> Anyway great work on the cpumap-pping program! Excited to see more >>> on this. >>> >>> >>> >>> Thanks, >>> >>> Robert >>> >>> >>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS < >>> libreqos@lists.bufferbloat.net> wrote: >>> >>>> >>> >>>> Hey, >>> >>>> >>> >>>> My current (unfinished) progress on this is now available here: >>> https://github.com/thebracket/cpumap-pping-hackjob >>> >>>> >>> >>>> I mean it about the warnings, this isn't at all stable, debugged - >>> and can't promise that it won't unleash the nasal demons >>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-) >>> >>>> >>> >>>> With that said, I'm pretty happy so far: >>> >>>> >>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely >>> shunted onto a dedicated CPU. It has to run on both >>> >>>> the inbound and outbound classifiers, since otherwise it would >>> only see half the conversation. >>> >>>> * It does assume that your ingress and egress CPUs are mapped to >>> the same interface; I do that anyway in BracketQoS. Not doing >>> >>>> that opens up a potential world of pain, since writes to the >>> shared maps would require a locking scheme. Too much locking, and you l= ose >>> all of the benefit of using multiple CPUs to begin with. >>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I'v= e >>> worked with have lots of it. >>> >>>> * I've been gradually removing features that I don't want for >>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't= do >>> that. >>> >>>> * Rate limiting is working, but I removed the requirement for a >>> shared configuration provided from userland - so right now it's always = set >>> to report at 1 second intervals per stream. >>> >>>> >>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and >>> "world", and a "shaper" VM in between running a slightly hacked-up Libr= eQoS. >>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s >>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my >>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and >>> fast SSDs) >>> >>>> >>> >>>> Output currently consists of debug messages reading: >>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399222: >>> bpf_trace_printk: (tc) Flow open event >>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399239: >>> bpf_trace_printk: (tc) Send performance event (5,1), 374696 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399466: >>> bpf_trace_printk: (tc) Flow open event >>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399475: >>> bpf_trace_printk: (tc) Send performance event (5,1), 247069 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 516.405151: >>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 517.405248: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 518.406117: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 519.406255: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 520.407864: >>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 521.406664: >>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 522.407469: >>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 523.408126: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 524.408929: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.410048: >>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582 >>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.434080: >>> bpf_trace_printk: (tc) Send flow event >>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.482714: >>> bpf_trace_printk: (tc) Send flow event >>> >>>> >>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle >>> major/minor, allocated by the xdp-cpumap parent. >>> >>>> I get pretty low latency between VMs; I'll set up a test with some >>> real-world data very soon. >>> >>>> >>> >>>> I plan to keep hacking away, but feel free to take a peek. >>> >>>> >>> >>>> Thanks, >>> >>>> Herbert >>> >>>> >>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg < >>> Simon.Sundberg@kau.se> wrote: >>> >>>>> >>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of >>> quick >>> >>>>> notes. >>> >>>>> >>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgens= en wrote: >>> >>>>> > [ Adding Simon to Cc ] >>> >>>>> > >>> >>>>> > Herbert Wolverson via LibreQoS >>> writes: >>> >>>>> > >>> >>>>> > > Hey, >>> >>>>> > > >>> >>>>> > > I've had some pretty good success with merging xdp-pping ( >>> >>>>> > > >>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h ) >>> >>>>> > > into xdp-cpumap-tc ( >>> https://github.com/xdp-project/xdp-cpumap-tc ). >>> >>>>> > > >>> >>>>> > > I ported over most of the xdp-pping code, and then changed th= e >>> entry point >>> >>>>> > > and packet parsing code to make use of the work already done = in >>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, >>> no need to do >>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had >>> to pin them - >>> >>>>> > > otherwise the two tc instances don't properly share data. >>> >>>>> > > >>> >>>>> >>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed >>> on >>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow >>> may >>> >>>>> be processed by different cores on ingress and egress the per-CPU >>> maps >>> >>>>> will not really work reliably as each core will have a different >>> view >>> >>>>> on the state of the flow, if there's been a previous packet with = a >>> >>>>> certain TSval from that flow etc. >>> >>>>> >>> >>>>> Furthermore, if a flow is always processed on the same core (on >>> both >>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on >>> >>>>> memory. From my understanding the keys for per-CPU maps are still >>> >>>>> shared across all CPUs, it's just that each CPU gets its own >>> value. So >>> >>>>> all CPUs will then have their own data for each flow, but it's >>> only the >>> >>>>> CPU processing the flow that will have any relevant data for the >>> flow >>> >>>>> while the remaining CPUs will just have an empty state for that >>> flow. >>> >>>>> Under the same assumption that packets within the same flow are >>> always >>> >>>>> processed on the same core there should generally not be any >>> >>>>> concurrency issues with having a global (non-per-CPU) either as >>> packets >>> >>>>> from the same flow cannot be processed concurrently then (and thu= s >>> no >>> >>>>> concurrent access to the same value in the map). I am however sti= ll >>> >>>>> very unclear on if there's any considerable performance impact >>> between >>> >>>>> global and per-CPU map versions if the same key is not accessed >>> >>>>> concurrently. >>> >>>>> >>> >>>>> > > Right now, output >>> >>>>> > > is just stubbed - I've still got to port the perfmap output >>> code. Instead, >>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, s= o >>> I can see >>> >>>>> > > roughly what the output would look like. >>> >>>>> > > >>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9 >>> Gbits/sec on >>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the >>> middle). :-) >>> >>>>> > >>> >>>>> > Just FYI, that "just logging" is probably the biggest source of >>> >>>>> > overhead, then. What Simon found was that sending the data from >>> kernel >>> >>>>> > to userspace is one of the most expensive bits of epping, at >>> least when >>> >>>>> > the number of data points goes up (which is does as additional >>> flows are >>> >>>>> > added). >>> >>>>> >>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you ma= y >>> get >>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms >>> of >>> >>>>> direct overhead from the tool itself, but also becomes demanding >>> for >>> >>>>> whatever you use all those RTT samples for (i.e. need to log, >>> parse, >>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with >>> that is >>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limi= t >>> and >>> >>>>> -R/--rtt-rate >>> >>>>> > >>> >>>>> > > So my question: how would you prefer to receive this data? >>> I'll have to >>> >>>>> > > write a daemon that provides userspace control (periodic >>> cleanup as well as >>> >>>>> > > reading the performance stream), so the world's kinda our >>> oyster. I can >>> >>>>> > > stick to Kathie's original format (and dump it to a named >>> pipe, perhaps?), >>> >>>>> > > a condensed format that only shows what you want to use, an >>> efficient >>> >>>>> > > binary format if you feel like parsing that... >>> >>>>> > >>> >>>>> > It would be great if we could combine efforts a bit here so we >>> don't >>> >>>>> > fork the codebase more than we have to. I.e., if "upstream" >>> epping and >>> >>>>> > whatever daemon you end up writing can agree on data format etc >>> that >>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :) >>> >>>>> > >>> >>>>> > Briefly what I've discussed before with Simon was to have the >>> ability to >>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a >>> userspace >>> >>>>> > utility periodically pull them out. What we discussed was doing >>> this >>> >>>>> > using an LPM map (which is not in that PR yet). The idea would >>> be that >>> >>>>> > userspace would populate the LPM map with the keys (prefixes) >>> they >>> >>>>> > wanted statistics for (in LibreQOS context that could be one ke= y >>> per >>> >>>>> > customer, for instance). Epping would then do a map lookup into >>> the LPM, >>> >>>>> > and if it gets a match it would update the statistics in that >>> map entry >>> >>>>> > (keeping a histogram of latency values seen, basically). Simon'= s >>> PR >>> >>>>> > below uses this technique where userspace will "reset" the >>> histogram >>> >>>>> > every time it loads it by swapping out two different map entrie= s >>> when it >>> >>>>> > does a read; this allows you to control the sampling rate from >>> >>>>> > userspace, and you'll just get the data since the last time you >>> polled. >>> >>>>> >>> >>>>> Thank's Toke for summarzing both the current state and the plan >>> going >>> >>>>> forward. I will just note that this PR (and all my other work wit= h >>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or >>> less >>> >>>>> on hold for a couple of weeks right now as I'm trying to finish u= p >>> a >>> >>>>> paper. >>> >>>>> >>> >>>>> > I was thinking that if we all can agree on the map format, then >>> your >>> >>>>> > polling daemon could be one userspace "client" for that, and th= e >>> epping >>> >>>>> > binary itself could be another; but we could keep compatibility >>> between >>> >>>>> > the two, so we don't duplicate effort. >>> >>>>> > >>> >>>>> > Similarly, refactoring of the epping code itself so it can be >>> plugged >>> >>>>> > into the cpumap-tc code would be a good goal... >>> >>>>> >>> >>>>> Should probably do that...at some point. In general I think it's = a >>> bit >>> >>>>> of an interesting problem to think about how to chain multiple >>> XDP/tc >>> >>>>> programs together in an efficent way. Most XDP and tc programs >>> will do >>> >>>>> some amount of packet parsing and when you have many chained >>> programs >>> >>>>> parsing the same packets this obviously becomes a bit wasteful. I= n >>> the >>> >>>>> same time it would be nice if one didn't need to manually merge >>> >>>>> multiple programs together into a single one like this to get rid >>> of >>> >>>>> this duplicated parsing, or at least make that process of merging >>> those >>> >>>>> programs as simple as possible. >>> >>>>> >>> >>>>> >>> >>>>> > -Toke >>> >>>>> > >>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59 >>> >>>>> >>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet behandlar v= i dina >>> personuppgifter. >>> >>>>> When you send an e-mail to Karlstad University, we will process >>> your personal data. >>> >>>> >>> >>>> _______________________________________________ >>> >>>> LibreQoS mailing list >>> >>>> LibreQoS@lists.bufferbloat.net >>> >>>> https://lists.bufferbloat.net/listinfo/libreqos >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Robert Chac=C3=B3n >>> >>> CEO | JackRabbit Wireless LLC >>> > >>> > _______________________________________________ >>> > LibreQoS mailing list >>> > LibreQoS@lists.bufferbloat.net >>> > https://lists.bufferbloat.net/listinfo/libreqos >>> >>> >>> >>> -- >>> This song goes out to all the folk that thought Stadia would work: >>> >>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366= 665607352320-FXtz >>> Dave T=C3=A4ht CEO, TekLibre, LLC >>> >> _______________________________________________ > LibreQoS mailing list > LibreQoS@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/libreqos > --=20 Robert Chac=C3=B3n CEO | JackRabbit Wireless LLC --00000000000079b13c05eb644d68 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Awesome work on this!
I suspect there shoul= d be a slight performance bump once Hyperthreading is disabled and efficien= t power management is off.
Hyperthreading/SMT always messes with = HTB performance when I leave it on. Thank you for mentioning that - I now w= ent ahead and added instructions on disabling hyperthreading on the Wiki fo= r new users.
Super promising results!
Interested to= see what throughput is with xdp-cpumap-tc vs cpumap-pping. So far in your = VM setup it seems to be doing very well.

On Wed, Oct 19, 2022 at 8= :06 AM Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
Also, = I forgot to mention that I *think* the current version has removed the requ= irement that the inbound
and outbound classifiers be placed = on the same CPU. I know interduo was particularly keen on packing
upload into fewer cores. I'll add that to my list of things to te= st.


could I coax you to adopt fl= ent?

apt-get install flent netperf irtt fping

You sometimes have to compile netperf yourself with --enable-demo on
some systems.
There are a bunch of python libs neede for the gui, but only on the client.=

Then you can run a really gnarly test series and plot the results over time= .

flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H<= br> the_server_name rrul # 110 other tests


On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
<lib= reqos@lists.bufferbloat.net> wrote:
>
> Hey,
>
> Testing the current version ( https://github.= com/thebracket/cpumap-pping-hackjob ), it's doing better than I hop= ed. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_p= ping) to extract and reset stats.
>
> My testing environment has grown a bit:
> * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-p= ping-hackjob version of xdp-cpumap.
> * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf se= rver.
> * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Host= s iperf client.
> * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Host= s iperf client.
>
> ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are = on a virtual switch.
> ExtTest and the other interface (WAN facing) of ShaperVM are on a diff= erent virtual switch.
>
> These are all on a host machine running Windows 11, a core i7 12th gen= , 32 Gb RAM and fast SSD setup.
>
> TEST 1: DUAL STREAMS, LOW THROUGHPUT
>
> For this test, LibreQoS is configured:
> * Two APs, each with 5gbit/s max.
> * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mb= it/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> * Set to use Cake
>
> On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500= (for a long run). Running xdp_pping yields correct results:
>
> [
> {"tc":"1:5", "avg" : 4, "min" = : 3, "max" : 5, "samples" : 11},
> {"tc":"2:5", "avg" : 4, "min" = : 3, "max" : 5, "samples" : 11},
> {}]
>
> Or when I waited a while to gather/reset:
>
> [
> {"tc":"1:5", "avg" : 4, "min" = : 3, "max" : 6, "samples" : 60},
> {"tc":"2:5", "avg" : 4, "min" = : 3, "max" : 5, "samples" : 60},
> {}]
>
> The ShaperVM shows no errors, just periodic logging that it is recordi= ng data.=C2=A0 CPU is about 2-3% on two CPUs, zero on the others (as expect= ed).
>
> After 500 seconds of continual iperfing, each client reported a throug= hput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>
> So for smaller streams, I'd call this a success.
>
> TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>
> For this test, LibreQoS is configured:
> * Two APs, each with 5gb/s max.
> * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Ma= pped to 1:5 and 2:5 respectively (separate CPUs).
>
> Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>
> xdp_pping shows results, too:
>
> [
> {"tc":"1:5", "avg" : 4, "min" = : 1, "max" : 7, "samples" : 58},
> {"tc":"2:5", "avg" : 7, "min" = : 3, "max" : 11, "samples" : 58},
> {}]
>
> [
> {"tc":"1:5", "avg" : 5, "min" = : 4, "max" : 8, "samples" : 13},
> {"tc":"2:5", "avg" : 8, "min" = : 7, "max" : 10, "samples" : 13},
> {}]
>
> The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>
> After 500 seconds of continual iperfing, each client reported a throug= hput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>
> Maxing out HyperV like this is inducing a bit of latency (which is to = be expected), but it's not bad. I also forgot to disable hyperthreading= , and looking at the host performance it is sometimes running the second vi= rtual CPU on an underpowered "fake" CPU.
>
> So for two large streams, I think we're doing pretty well also! >
> TEST 3: DUAL STREAMS, SINGLE CPU
>
> This test is designed to try and blow things up. It's the same as = test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and= 1:6.
>
> ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The= pping stats start to show a bit of degradation in performance for pounding= it so hard:
>
> [
> {"tc":"1:6", "avg" : 10, "min"= : 9, "max" : 19, "samples" : 24},
> {"tc":"1:5", "avg" : 10, "min"= : 8, "max" : 18, "samples" : 24},
> {}]
>
> For whatever reason, it smoothed out over time:
>
> [
> {"tc":"1:6", "avg" : 10, "min"= : 9, "max" : 12, "samples" : 50},
> {"tc":"1:5", "avg" : 10, "min"= : 8, "max" : 13, "samples" : 50},
> {}]
>
> Surprisingly (to me), I didn't encounter errors. Each client recei= ved 2.22 Gbit/s performance, over 129 Gbytes of data.
>
> TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>
> This test is also designed to break things. Same as test 3, but using = iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the= flow tracking. (Shorter time window because I really wanted to go and find= coffee)
>
> ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results = show that this torture test is worsening performance, and there's alway= s lots of samples in the buffer:
>
> [
> {"tc":"1:6", "avg" : 23, "min"= : 19, "max" : 27, "samples" : 49},
> {"tc":"1:5", "avg" : 24, "min"= : 19, "max" : 27, "samples" : 49},
> {}]
>
> This test also ran better than I expected. You can definitely see some= latency creeping in as I make the system work hard. Each VM showed around = 2.4 Gbit/s in total performance at the end of the iperf session. There'= s definitely some latency creeping in, which is expected - but I'm not = sure I expected quite that much.
>
> WHAT'S NEXT & CONCLUSION
>
> I noticed that I forgot to turn off efficient power management on my V= Ms and host, and left Hyperthreading on by mistake. So that hurts overall p= erformance.
>
> The base system seems to be working pretty solidly, at least for small= tests.Next up, I'll be removing extraneous debug reporting code, remov= ing some code paths that don't do anything but report, and looking for = any small optimization opportunities. I'll then re-run these tests. Onc= e that's done, I hope to find a maintenance window on my WISP and try i= t with actual traffic.
>
> I also need to re-run these tests without the pping system to provide = some before/after analysis.
>
> On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gmail.com> wr= ote:
>>
>> It's probably not entirely thread-safe right now (ran into som= e issues reading per_cpu maps back from userspace; hopefully, I'll get = that figured out) - but the commits I just pushed have it basically working= on single-stream testing. :-)
>>
>> Setup cpumap as usual, and periodically run xdp-pping. This gives = you per-connection RTT information in JSON:
>>
>> [
>> {"tc":"1:5", "avg" : 5, "min&qu= ot; : 5, "max" : 5, "samples" : 1},
>> {}]
>>
>> (With the extra {} because I'm not tracking the tail and haven= 't done comma removal). The tool also empties the various maps used to = gather data, acting as a "reset" point. There's a max of 60 s= amples per queue, in a ringbuffer setup (so newest will start to overwrite = the oldest).
>>
>> I'll start trying to test on a larger scale now.
>>
>> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <robert.chaco= n@jackrabbitwireless.com> wrote:
>>>
>>> Hey Herbert,
>>>
>>> Fantastic work! Super exciting to see this coming together, es= pecially so quickly.
>>> I'll test it soon.
>>> I understand and agree with your decision to omit certain feat= ures (ICMP tracking,DNS tracking, etc) to optimize performance for our use = case. Like you said, in order to merge the functionality without a performa= nce hit, merging them is sort of the only way right now. Otherwise there wo= uld be a lot of redundancy and lost throughput for an ISP's use. Though= hopefully long term there will be a way to keep all projects working indep= endently but interoperably with a plugin system of some kind.
>>>
>>> By the way, I'm making some headway on LibreQoS v1.3. Focu= sing on optimizations for high sub counts (8000+ subs) as well as stateful = changes to the queue structure.
>>> I'm working to set up a physical lab to test high throughp= ut and high client count scenarios.
>>> When testing beyond ~32,000 filters we get "no space left= on device" from xdp-cpumap-tc, which I think relates to the bpf map s= ize limitation you mentioned. Maybe in the coming months we can take a look= at that.
>>>
>>> Anyway great work on the cpumap-pping program! Excited to see = more on this.
>>>
>>> Thanks,
>>> Robert
>>>
>>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQo= S <l= ibreqos@lists.bufferbloat.net> wrote:
>>>>
>>>> Hey,
>>>>
>>>> My current (unfinished) progress on this is now available = here: https://github.com/thebracket/cpumap-pping-h= ackjob
>>>>
>>>> I mean it about the warnings, this isn't at all stable= , debugged - and can't promise that it won't unleash the nasal demo= ns
>>>> (to use a popular C++ phrase). The name is descriptive! ;-= )
>>>>
>>>> With that said, I'm pretty happy so far:
>>>>
>>>> * It runs only on the classifier - which xdp-cpumap-tc has= nicely shunted onto a dedicated CPU. It has to run on both
>>>>=C2=A0 =C2=A0the inbound and outbound classifiers, since ot= herwise it would only see half the conversation.
>>>> * It does assume that your ingress and egress CPUs are map= ped to the same interface; I do that anyway in BracketQoS. Not doing
>>>>=C2=A0 =C2=A0that opens up a potential world of pain, since= writes to the shared maps would require a locking scheme. Too much locking= , and you lose all of the benefit of using multiple CPUs to begin with.
>>>> * It is pretty wasteful of RAM, but most of the shaper sys= tems I've worked with have lots of it.
>>>> * I've been gradually removing features that I don'= ;t want for BracketQoS. A hypothetical future "useful to everyone"= ; version wouldn't do that.
>>>> * Rate limiting is working, but I removed the requirement = for a shared configuration provided from userland - so right now it's a= lways set to report at 1 second intervals per stream.
>>>>
>>>> My testbed is currently 3 Hyper-V VMs - a simple "cli= ent" and "world", and a "shaper" VM in between run= ning a slightly hacked-up LibreQoS.
>>>> iperf from "client" to "world" (with L= ibre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbi= t/s at present, on my
>>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb = RAM and fast SSDs)
>>>>
>>>> Output currently consists of debug messages reading:
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0515.399222: bpf_trace_printk: (tc) Flow open event
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0515.399239: bpf_trace_printk: (tc) Send performance event (5,1= ), 374696
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0515.399466: bpf_trace_printk: (tc) Flow open event
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0515.399475: bpf_trace_printk: (tc) Send performance event (5,1= ), 247069
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0516.405151: bpf_trace_printk: (tc) Send performance event (5,1= ), 5217155
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0517.405248: bpf_trace_printk: (tc) Send performance event (5,1= ), 4515394
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0518.406117: bpf_trace_printk: (tc) Send performance event (5,1= ), 4481289
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0519.406255: bpf_trace_printk: (tc) Send performance event (5,1= ), 4255268
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0520.407864: bpf_trace_printk: (tc) Send performance event (5,1= ), 5249493
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0521.406664: bpf_trace_printk: (tc) Send performance event (5,1= ), 3795993
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0522.407469: bpf_trace_printk: (tc) Send performance event (5,1= ), 3949519
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0523.408126: bpf_trace_printk: (tc) Send performance event (5,1= ), 4365335
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0524.408929: bpf_trace_printk: (tc) Send performance event (5,1= ), 4154910
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0525.410048: bpf_trace_printk: (tc) Send performance event (5,1= ), 4405582
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0525.434080: bpf_trace_printk: (tc) Send flow event
>>>>=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.= =C2=A0 =C2=A0525.482714: bpf_trace_printk: (tc) Send flow event
>>>>
>>>> The times haven't been tweaked yet. The (5,1) is tc ha= ndle major/minor, allocated by the xdp-cpumap parent.
>>>> I get pretty low latency between VMs; I'll set up a te= st with some real-world data very soon.
>>>>
>>>> I plan to keep hacking away, but feel free to take a peek.=
>>>>
>>>> Thanks,
>>>> Herbert
>>>>
>>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se<= /a>> wrote:
>>>>>
>>>>> Hi, thanks for adding me to the conversation. Just a c= ouple of quick
>>>>> notes.
>>>>>
>>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J= =C3=B8rgensen wrote:
>>>>> > [ Adding Simon to Cc ]
>>>>> >
>>>>> > Herbert Wolverson via LibreQoS <
libreqos@lists.buffer= bloat.net> writes:
>>>>> >
>>>>> > > Hey,
>>>>> > >
>>>>> > > I've had some pretty good success with m= erging xdp-pping (
>>>>> > > = https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )=
>>>>> > > into xdp-cpumap-tc ( http= s://github.com/xdp-project/xdp-cpumap-tc ).
>>>>> > >
>>>>> > > I ported over most of the xdp-pping code, an= d then changed the entry point
>>>>> > > and packet parsing code to make use of the w= ork already done in
>>>>> > > xdp-cpumap-tc (it's already parsed a big= chunk of the packet, no need to do
>>>>> > > it twice). Then I switched the maps to per-c= pu maps, and had to pin them -
>>>>> > > otherwise the two tc instances don't pro= perly share data.
>>>>> > >
>>>>>
>>>>> I guess the xdp-cpumap-tc ensures that the same flow i= s processed on
>>>>> the same CPU core at both ingress or egress. Otherwise= , if a flow may
>>>>> be processed by different cores on ingress and egress = the per-CPU maps
>>>>> will not really work reliably as each core will have a= different view
>>>>> on the state of the flow, if there's been a previo= us packet with a
>>>>> certain TSval from that flow etc.
>>>>>
>>>>> Furthermore, if a flow is always processed on the same= core (on both
>>>>> ingress and egress) I think per-CPU maps may be a bit = wasteful on
>>>>> memory. From my understanding the keys for per-CPU map= s are still
>>>>> shared across all CPUs, it's just that each CPU ge= ts its own value. So
>>>>> all CPUs will then have their own data for each flow, = but it's only the
>>>>> CPU processing the flow that will have any relevant da= ta for the flow
>>>>> while the remaining CPUs will just have an empty state= for that flow.
>>>>> Under the same assumption that packets within the same= flow are always
>>>>> processed on the same core there should generally not = be any
>>>>> concurrency issues with having a global (non-per-CPU) = either as packets
>>>>> from the same flow cannot be processed concurrently th= en (and thus no
>>>>> concurrent access to the same value in the map). I am = however still
>>>>> very unclear on if there's any considerable perfor= mance impact between
>>>>> global and per-CPU map versions if the same key is not= accessed
>>>>> concurrently.
>>>>>
>>>>> > > Right now, output
>>>>> > > is just stubbed - I've still got to port= the perfmap output code. Instead,
>>>>> > > I'm dumping a bunch of extra data to the= kernel debug pipe, so I can see
>>>>> > > roughly what the output would look like.
>>>>> > >
>>>>> > > With debug enabled and just logging I'm = now getting about 4.9 Gbits/sec on
>>>>> > > single-stream iperf between two VMs (with a = shaper VM in the middle). :-)
>>>>> >
>>>>> > Just FYI, that "just logging" is probab= ly the biggest source of
>>>>> > overhead, then. What Simon found was that sending= the data from kernel
>>>>> > to userspace is one of the most expensive bits of= epping, at least when
>>>>> > the number of data points goes up (which is does = as additional flows are
>>>>> > added).
>>>>>
>>>>> Yhea, reporting individual RTTs when there's lots = of them (you may get
>>>>> upwards of 1000 RTTs/s per flow) is not only problemat= ic in terms of
>>>>> direct overhead from the tool itself, but also becomes= demanding for
>>>>> whatever you use all those RTT samples for (i.e. need = to log, parse,
>>>>> analyze etc. a very large amount of RTTs). One way to = deal with that is
>>>>> of course to just apply some sort of sampling (the -r/= --rate-limit and
>>>>> -R/--rtt-rate
>>>>> >
>>>>> > > So my question: how would you prefer to rece= ive this data? I'll have to
>>>>> > > write a daemon that provides userspace contr= ol (periodic cleanup as well as
>>>>> > > reading the performance stream), so the worl= d's kinda our oyster. I can
>>>>> > > stick to Kathie's original format (and d= ump it to a named pipe, perhaps?),
>>>>> > > a condensed format that only shows what you = want to use, an efficient
>>>>> > > binary format if you feel like parsing that.= ..
>>>>> >
>>>>> > It would be great if we could combine efforts a b= it here so we don't
>>>>> > fork the codebase more than we have to. I.e., if = "upstream" epping and
>>>>> > whatever daemon you end up writing can agree on d= ata format etc that
>>>>> > would be fantastic! Added Simon to Cc to facilita= te this :)
>>>>> >
>>>>> > Briefly what I've discussed before with Simon= was to have the ability to
>>>>> > aggregate the metrics in the kernel (WiP PR [0]) = and have a userspace
>>>>> > utility periodically pull them out. What we discu= ssed was doing this
>>>>> > using an LPM map (which is not in that PR yet). T= he idea would be that
>>>>> > userspace would populate the LPM map with the key= s (prefixes) they
>>>>> > wanted statistics for (in LibreQOS context that c= ould be one key per
>>>>> > customer, for instance). Epping would then do a m= ap lookup into the LPM,
>>>>> > and if it gets a match it would update the statis= tics in that map entry
>>>>> > (keeping a histogram of latency values seen, basi= cally). Simon's PR
>>>>> > below uses this technique where userspace will &q= uot;reset" the histogram
>>>>> > every time it loads it by swapping out two differ= ent map entries when it
>>>>> > does a read; this allows you to control the sampl= ing rate from
>>>>> > userspace, and you'll just get the data since= the last time you polled.
>>>>>
>>>>> Thank's Toke for summarzing both the current state= and the plan going
>>>>> forward. I will just note that this PR (and all my oth= er work with
>>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will= be more or less
>>>>> on hold for a couple of weeks right now as I'm try= ing to finish up a
>>>>> paper.
>>>>>
>>>>> > I was thinking that if we all can agree on the ma= p format, then your
>>>>> > polling daemon could be one userspace "clien= t" for that, and the epping
>>>>> > binary itself could be another; but we could keep= compatibility between
>>>>> > the two, so we don't duplicate effort.
>>>>> >
>>>>> > Similarly, refactoring of the epping code itself = so it can be plugged
>>>>> > into the cpumap-tc code would be a good goal... >>>>>
>>>>> Should probably do that...at some point. In general I = think it's a bit
>>>>> of an interesting problem to think about how to chain = multiple XDP/tc
>>>>> programs together in an efficent way. Most XDP and tc = programs will do
>>>>> some amount of packet parsing and when you have many c= hained programs
>>>>> parsing the same packets this obviously becomes a bit = wasteful. In the
>>>>> same time it would be nice if one didn't need to m= anually merge
>>>>> multiple programs together into a single one like this= to get rid of
>>>>> this duplicated parsing, or at least make that process= of merging those
>>>>> programs as simple as possible.
>>>>>
>>>>>
>>>>> > -Toke
>>>>> >
>>>>> > [0] https://github.com/= xdp-project/bpf-examples/pull/59
>>>>>
>>>>> N=C3=A4r du skickar e-post till Karlstads universitet = behandlar vi dina personuppgifter<https://www.kau.se/gdpr>.
>>>>> When you send an e-mail to Karlstad University, we wil= l process your personal data<https://www.kau.se/en/gdpr>.
>>>>
>>>> _______________________________________________
>>>> LibreQoS mailing list
>>>> LibreQoS@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listin= fo/libreqos
>>>
>>>
>>>
>>> --
>>> Robert Chac=C3=B3n
>>> CEO | JackRabbit Wireless LLC
>
> _______________________________________________
> LibreQoS mailing list
> Li= breQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos<= /a>



--
This song goes out to all the folk that thought Stadia would work:
https://www.= linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXt= z
Dave T=C3=A4ht CEO, TekLibre, LLC
_______________________________________________
LibreQoS mailing list
LibreQo= S@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/libreqos


--
Robert Chac=C3=B3n
CEO | JackRabbit Wireless LLC
--00000000000079b13c05eb644d68--