From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x330.google.com (mail-wm1-x330.google.com [IPv6:2a00:1450:4864:20::330]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 517E13B29D for ; Wed, 19 Oct 2022 12:13:53 -0400 (EDT) Received: by mail-wm1-x330.google.com with SMTP id l32so13243496wms.2 for ; Wed, 19 Oct 2022 09:13:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5aYl0urXTyJc1jDY9yT4LSPRX5PrCDOANSxkw1yC5co=; b=VxbiFkPxZdZkNAMMBvK6CzOvSLpnwRY+lj1NrTPsZZrtpfQEPL7gEnU9TKfsuUmU2B iwINjKwo/s0OAa79umDGtcvv7stZuYOL6liMCJq8Txc8gkQLwRGdCNVzSfhod8rf/7oh ViRGkiptYVVZ9LyUpxdM3v1FaJp/heSk/x21UNWGpnLss8ufvjh17Db7BCMcMzDf7ft3 yyYoZ2+94Ub0b+VGK3mIFd2nZsbS28zuyawXUsM5fVG2VyAmTb+qjoc47Scpx7BjSInO 3ISpRRgxSShy8Wff5qjMANV2s+5yuYCmLZ+e9sGEeUziEpQeLBOJUArkVvgtwkl2mC0r pRvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5aYl0urXTyJc1jDY9yT4LSPRX5PrCDOANSxkw1yC5co=; b=70NI2UjpsXbkXnc64ALmTLo+3yr8E4iYRkez9qQ4uxPca+ksJRKM0MFAPQX2VzbHn/ Vm/+u7YSY8sDW8AdQrvrQIR4+DphGJte6FdHZYbqJW8+RG+jXGUvIJcKvIK41xtCeMNu BzTLp3sMq5qDjQ5f3izrX1p2ahrOn/tYSutKpTtsw7ycoG1Rz/uDoYQ2/LyLp71vf0Nt W8dkImFhEobBOfu2jwDWXaO4pLB+G092GpmpbiwZVC9TPZcB59dLFNZwEHSg05vHMgha aF580seLy/LJAk55DN/GgUZkEelhrs9jRVY30nzOI1eR2xRfab7DqX80VpANKsXskI/Y w2GQ== X-Gm-Message-State: ACrzQf2hmhFDeANM4D+mZ1UFn0Diqfi4Z1KY5zGhM+ubUw1SdlRTaRUn gLc4DlyYBeeKnHkF9WBQLXtI7ceTNYmBVORo+KN4/vcevgs= X-Google-Smtp-Source: AMsMyM4trrmtTlbQrC/EiFHALf9X+zY1McsFlBOCAzBs63RlbEDIrFcVS9yR9O1S4ivJgOroi5PsfOjRmAtcDLizjb0= X-Received: by 2002:a7b:c3c4:0:b0:3c4:785a:36d7 with SMTP id t4-20020a7bc3c4000000b003c4785a36d7mr27802090wmj.138.1666196031853; Wed, 19 Oct 2022 09:13:51 -0700 (PDT) MIME-Version: 1.0 References: <87bkqatu61.fsf@toke.dk> <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se> In-Reply-To: From: Dave Taht Date: Wed, 19 Oct 2022 09:13:38 -0700 Message-ID: To: Herbert Wolverson Cc: "libreqos@lists.bufferbloat.net" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [LibreQoS] In BPF pping - so far X-BeenThere: libreqos@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Many ISPs need the kinds of quality shaping cake can do List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Oct 2022 16:13:53 -0000 PS - today's (free) p99 conference is *REALLY AWESOME*. https://www.p99conf= .io/ On Wed, Oct 19, 2022 at 9:13 AM Dave Taht wrote: > > flent outputs a flent.gz file that I can parse and plot 20 differnt > ways. Also the graphing tools work on osx > > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS > wrote: > > > > That's true. The 12th gen does seem to have some "special" features... = makes for a nice writing platform > > (this box is primarily my "write books and articles" machine). I'll be = doing a wider test on a more normal > > platform, probably at the weekend (with real traffic, hence the delay -= have to find a time in which I > > minimize disruption) > > > > On Wed, Oct 19, 2022 at 10:49 AM dan wrote: > >> > >> Those 'efficiency' threads in Intel 12th gen should probably be addres= sed as well. You can't turn them off in BIOS. > >> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQoS wrote: > >>> > >>> Awesome work on this! > >>> I suspect there should be a slight performance bump once Hyperthreadi= ng is disabled and efficient power management is off. > >>> Hyperthreading/SMT always messes with HTB performance when I leave it= on. Thank you for mentioning that - I now went ahead and added instruction= s on disabling hyperthreading on the Wiki for new users. > >>> Super promising results! > >>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-ppi= ng. So far in your VM setup it seems to be doing very well. > >>> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS wrote: > >>>> > >>>> Also, I forgot to mention that I *think* the current version has rem= oved the requirement that the inbound > >>>> and outbound classifiers be placed on the same CPU. I know interduo = was particularly keen on packing > >>>> upload into fewer cores. I'll add that to my list of things to test. > >>>> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson wrote: > >>>>> > >>>>> I'll definitely take a look - that does look interesting. I don't h= ave X11 on any of my test VMs, but > >>>>> it looks like it can work without the GUI. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht wro= te: > >>>>>> > >>>>>> could I coax you to adopt flent? > >>>>>> > >>>>>> apt-get install flent netperf irtt fping > >>>>>> > >>>>>> You sometimes have to compile netperf yourself with --enable-demo = on > >>>>>> some systems. > >>>>>> There are a bunch of python libs neede for the gui, but only on th= e client. > >>>>>> > >>>>>> Then you can run a really gnarly test series and plot the results = over time. > >>>>>> > >>>>>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H > >>>>>> the_server_name rrul # 110 other tests > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS > >>>>>> wrote: > >>>>>> > > >>>>>> > Hey, > >>>>>> > > >>>>>> > Testing the current version ( https://github.com/thebracket/cpum= ap-pping-hackjob ), it's doing better than I hoped. This build has shared (= not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset = stats. > >>>>>> > > >>>>>> > My testing environment has grown a bit: > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cp= umap-pping-hackjob version of xdp-cpumap. > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an ip= erf server. > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2= . Hosts iperf client. > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3= . Hosts iperf client. > >>>>>> > > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperV= M are on a virtual switch. > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on = a different virtual switch. > >>>>>> > > >>>>>> > These are all on a host machine running Windows 11, a core i7 12= th gen, 32 Gb RAM and fast SSD setup. > >>>>>> > > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT > >>>>>> > > >>>>>> > For this test, LibreQoS is configured: > >>>>>> > * Two APs, each with 5gbit/s max. > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about= 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs). > >>>>>> > * Set to use Cake > >>>>>> > > >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 = -t 500 (for a long run). Running xdp_pping yields correct results: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}, > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}, > >>>>>> > {}] > >>>>>> > > >>>>>> > Or when I waited a while to gather/reset: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60}, > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60}, > >>>>>> > {}] > >>>>>> > > >>>>>> > The ShaperVM shows no errors, just periodic logging that it is r= ecording data. CPU is about 2-3% on two CPUs, zero on the others (as expec= ted). > >>>>>> > > >>>>>> > After 500 seconds of continual iperfing, each client reported a = throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted. > >>>>>> > > >>>>>> > So for smaller streams, I'd call this a success. > >>>>>> > > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT > >>>>>> > > >>>>>> > For this test, LibreQoS is configured: > >>>>>> > * Two APs, each with 5gb/s max. > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit= /s! Mapped to 1:5 and 2:5 respectively (separate CPUs). > >>>>>> > > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time. > >>>>>> > > >>>>>> > xdp_pping shows results, too: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58}, > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58}, > >>>>>> > {}] > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13}, > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13}, > >>>>>> > {}] > >>>>>> > > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent. > >>>>>> > > >>>>>> > After 500 seconds of continual iperfing, each client reported a = throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes= . > >>>>>> > > >>>>>> > Maxing out HyperV like this is inducing a bit of latency (which = is to be expected), but it's not bad. I also forgot to disable hyperthreadi= ng, and looking at the host performance it is sometimes running the second = virtual CPU on an underpowered "fake" CPU. > >>>>>> > > >>>>>> > So for two large streams, I think we're doing pretty well also! > >>>>>> > > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU > >>>>>> > > >>>>>> > This test is designed to try and blow things up. It's the same a= s test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 a= nd 1:6. > >>>>>> > > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idl= e. The pping stats start to show a bit of degradation in performance for po= unding it so hard: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24}, > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24}, > >>>>>> > {}] > >>>>>> > > >>>>>> > For whatever reason, it smoothed out over time: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50}, > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50}, > >>>>>> > {}] > >>>>>> > > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client rec= eived 2.22 Gbit/s performance, over 129 Gbytes of data. > >>>>>> > > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS > >>>>>> > > >>>>>> > This test is also designed to break things. Same as test 3, but = using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really t= ax the flow tracking. (Shorter time window because I really wanted to go an= d find coffee) > >>>>>> > > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping re= sults show that this torture test is worsening performance, and there's alw= ays lots of samples in the buffer: > >>>>>> > > >>>>>> > [ > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49}= , > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49}= , > >>>>>> > {}] > >>>>>> > > >>>>>> > This test also ran better than I expected. You can definitely se= e some latency creeping in as I make the system work hard. Each VM showed a= round 2.4 Gbit/s in total performance at the end of the iperf session. Ther= e's definitely some latency creeping in, which is expected - but I'm not su= re I expected quite that much. > >>>>>> > > >>>>>> > WHAT'S NEXT & CONCLUSION > >>>>>> > > >>>>>> > I noticed that I forgot to turn off efficient power management o= n my VMs and host, and left Hyperthreading on by mistake. So that hurts ove= rall performance. > >>>>>> > > >>>>>> > The base system seems to be working pretty solidly, at least for= small tests.Next up, I'll be removing extraneous debug reporting code, rem= oving some code paths that don't do anything but report, and looking for an= y small optimization opportunities. I'll then re-run these tests. Once that= 's done, I hope to find a maintenance window on my WISP and try it with act= ual traffic. > >>>>>> > > >>>>>> > I also need to re-run these tests without the pping system to pr= ovide some before/after analysis. > >>>>>> > > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson wrote: > >>>>>> >> > >>>>>> >> It's probably not entirely thread-safe right now (ran into some= issues reading per_cpu maps back from userspace; hopefully, I'll get that = figured out) - but the commits I just pushed have it basically working on s= ingle-stream testing. :-) > >>>>>> >> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This giv= es you per-connection RTT information in JSON: > >>>>>> >> > >>>>>> >> [ > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1}, > >>>>>> >> {}] > >>>>>> >> > >>>>>> >> (With the extra {} because I'm not tracking the tail and haven'= t done comma removal). The tool also empties the various maps used to gathe= r data, acting as a "reset" point. There's a max of 60 samples per queue, i= n a ringbuffer setup (so newest will start to overwrite the oldest). > >>>>>> >> > >>>>>> >> I'll start trying to test on a larger scale now. > >>>>>> >> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n wrote: > >>>>>> >>> > >>>>>> >>> Hey Herbert, > >>>>>> >>> > >>>>>> >>> Fantastic work! Super exciting to see this coming together, es= pecially so quickly. > >>>>>> >>> I'll test it soon. > >>>>>> >>> I understand and agree with your decision to omit certain feat= ures (ICMP tracking,DNS tracking, etc) to optimize performance for our use = case. Like you said, in order to merge the functionality without a performa= nce hit, merging them is sort of the only way right now. Otherwise there wo= uld be a lot of redundancy and lost throughput for an ISP's use. Though hop= efully long term there will be a way to keep all projects working independe= ntly but interoperably with a plugin system of some kind. > >>>>>> >>> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing= on optimizations for high sub counts (8000+ subs) as well as stateful chan= ges to the queue structure. > >>>>>> >>> I'm working to set up a physical lab to test high throughput a= nd high client count scenarios. > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on d= evice" from xdp-cpumap-tc, which I think relates to the bpf map size limita= tion you mentioned. Maybe in the coming months we can take a look at that. > >>>>>> >>> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see = more on this. > >>>>>> >>> > >>>>>> >>> Thanks, > >>>>>> >>> Robert > >>>>>> >>> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQo= S wrote: > >>>>>> >>>> > >>>>>> >>>> Hey, > >>>>>> >>>> > >>>>>> >>>> My current (unfinished) progress on this is now available her= e: https://github.com/thebracket/cpumap-pping-hackjob > >>>>>> >>>> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable, debug= ged - and can't promise that it won't unleash the nasal demons > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-) > >>>>>> >>>> > >>>>>> >>>> With that said, I'm pretty happy so far: > >>>>>> >>>> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has ni= cely shunted onto a dedicated CPU. It has to run on both > >>>>>> >>>> the inbound and outbound classifiers, since otherwise it wo= uld only see half the conversation. > >>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped= to the same interface; I do that anyway in BracketQoS. Not doing > >>>>>> >>>> that opens up a potential world of pain, since writes to th= e shared maps would require a locking scheme. Too much locking, and you los= e all of the benefit of using multiple CPUs to begin with. > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper system= s I've worked with have lots of it. > >>>>>> >>>> * I've been gradually removing features that I don't want for= BracketQoS. A hypothetical future "useful to everyone" version wouldn't do= that. > >>>>>> >>>> * Rate limiting is working, but I removed the requirement for= a shared configuration provided from userland - so right now it's always s= et to report at 1 second intervals per stream. > >>>>>> >>>> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and= "world", and a "shaper" VM in between running a slightly hacked-up LibreQo= S. > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbi= t/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM= and fast SSDs) > >>>>>> >>>> > >>>>>> >>>> Output currently consists of debug messages reading: > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399222: bpf_trace_= printk: (tc) Flow open event > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399239: bpf_trace_= printk: (tc) Send performance event (5,1), 374696 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399466: bpf_trace_= printk: (tc) Flow open event > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 515.399475: bpf_trace_= printk: (tc) Send performance event (5,1), 247069 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 516.405151: bpf_trace_= printk: (tc) Send performance event (5,1), 5217155 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 517.405248: bpf_trace_= printk: (tc) Send performance event (5,1), 4515394 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 518.406117: bpf_trace_= printk: (tc) Send performance event (5,1), 4481289 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 519.406255: bpf_trace_= printk: (tc) Send performance event (5,1), 4255268 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 520.407864: bpf_trace_= printk: (tc) Send performance event (5,1), 5249493 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 521.406664: bpf_trace_= printk: (tc) Send performance event (5,1), 3795993 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 522.407469: bpf_trace_= printk: (tc) Send performance event (5,1), 3949519 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 523.408126: bpf_trace_= printk: (tc) Send performance event (5,1), 4365335 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 524.408929: bpf_trace_= printk: (tc) Send performance event (5,1), 4154910 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.410048: bpf_trace_= printk: (tc) Send performance event (5,1), 4405582 > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.434080: bpf_trace_= printk: (tc) Send flow event > >>>>>> >>>> cpumap/0/map:4-1371 [000] D..2. 525.482714: bpf_trace_= printk: (tc) Send flow event > >>>>>> >>>> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle ma= jor/minor, allocated by the xdp-cpumap parent. > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with= some real-world data very soon. > >>>>>> >>>> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek. > >>>>>> >>>> > >>>>>> >>>> Thanks, > >>>>>> >>>> Herbert > >>>>>> >>>> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg wrote: > >>>>>> >>>>> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple = of quick > >>>>>> >>>>> notes. > >>>>>> >>>>> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8= rgensen wrote: > >>>>>> >>>>> > [ Adding Simon to Cc ] > >>>>>> >>>>> > > >>>>>> >>>>> > Herbert Wolverson via LibreQoS writes: > >>>>>> >>>>> > > >>>>>> >>>>> > > Hey, > >>>>>> >>>>> > > > >>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping= ( > >>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/= pping/pping.h ) > >>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-= cpumap-tc ). > >>>>>> >>>>> > > > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then chang= ed the entry point > >>>>>> >>>>> > > and packet parsing code to make use of the work already = done in > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the pa= cket, no need to do > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and= had to pin them - > >>>>>> >>>>> > > otherwise the two tc instances don't properly share data= . > >>>>>> >>>>> > > > >>>>>> >>>>> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is proc= essed on > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a= flow may > >>>>>> >>>>> be processed by different cores on ingress and egress the pe= r-CPU maps > >>>>>> >>>>> will not really work reliably as each core will have a diffe= rent view > >>>>>> >>>>> on the state of the flow, if there's been a previous packet = with a > >>>>>> >>>>> certain TSval from that flow etc. > >>>>>> >>>>> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same core = (on both > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wastef= ul on > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are = still > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own= value. So > >>>>>> >>>>> all CPUs will then have their own data for each flow, but it= 's only the > >>>>>> >>>>> CPU processing the flow that will have any relevant data for= the flow > >>>>>> >>>>> while the remaining CPUs will just have an empty state for t= hat flow. > >>>>>> >>>>> Under the same assumption that packets within the same flow = are always > >>>>>> >>>>> processed on the same core there should generally not be any > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either= as packets > >>>>>> >>>>> from the same flow cannot be processed concurrently then (an= d thus no > >>>>>> >>>>> concurrent access to the same value in the map). I am howeve= r still > >>>>>> >>>>> very unclear on if there's any considerable performance impa= ct between > >>>>>> >>>>> global and per-CPU map versions if the same key is not acces= sed > >>>>>> >>>>> concurrently. > >>>>>> >>>>> > >>>>>> >>>>> > > Right now, output > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap out= put code. Instead, > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pi= pe, so I can see > >>>>>> >>>>> > > roughly what the output would look like. > >>>>>> >>>>> > > > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting abou= t 4.9 Gbits/sec on > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in= the middle). :-) > >>>>>> >>>>> > > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest sour= ce of > >>>>>> >>>>> > overhead, then. What Simon found was that sending the data= from kernel > >>>>>> >>>>> > to userspace is one of the most expensive bits of epping, = at least when > >>>>>> >>>>> > the number of data points goes up (which is does as additi= onal flows are > >>>>>> >>>>> > added). > >>>>>> >>>>> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (y= ou may get > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in = terms of > >>>>>> >>>>> direct overhead from the tool itself, but also becomes deman= ding for > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log= , parse, > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal w= ith that is > >>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate= -limit and > >>>>>> >>>>> -R/--rtt-rate > >>>>>> >>>>> > > >>>>>> >>>>> > > So my question: how would you prefer to receive this dat= a? I'll have to > >>>>>> >>>>> > > write a daemon that provides userspace control (periodic= cleanup as well as > >>>>>> >>>>> > > reading the performance stream), so the world's kinda ou= r oyster. I can > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a name= d pipe, perhaps?), > >>>>>> >>>>> > > a condensed format that only shows what you want to use,= an efficient > >>>>>> >>>>> > > binary format if you feel like parsing that... > >>>>>> >>>>> > > >>>>>> >>>>> > It would be great if we could combine efforts a bit here s= o we don't > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream= " epping and > >>>>>> >>>>> > whatever daemon you end up writing can agree on data forma= t etc that > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this := ) > >>>>>> >>>>> > > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to have = the ability to > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have = a userspace > >>>>>> >>>>> > utility periodically pull them out. What we discussed was = doing this > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea w= ould be that > >>>>>> >>>>> > userspace would populate the LPM map with the keys (prefix= es) they > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be o= ne key per > >>>>>> >>>>> > customer, for instance). Epping would then do a map lookup= into the LPM, > >>>>>> >>>>> > and if it gets a match it would update the statistics in t= hat map entry > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically). S= imon's PR > >>>>>> >>>>> > below uses this technique where userspace will "reset" the= histogram > >>>>>> >>>>> > every time it loads it by swapping out two different map e= ntries when it > >>>>>> >>>>> > does a read; this allows you to control the sampling rate = from > >>>>>> >>>>> > userspace, and you'll just get the data since the last tim= e you polled. > >>>>>> >>>>> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and the p= lan going > >>>>>> >>>>> forward. I will just note that this PR (and all my other wor= k with > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be mo= re or less > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to fin= ish up a > >>>>>> >>>>> paper. > >>>>>> >>>>> > >>>>>> >>>>> > I was thinking that if we all can agree on the map format,= then your > >>>>>> >>>>> > polling daemon could be one userspace "client" for that, a= nd the epping > >>>>>> >>>>> > binary itself could be another; but we could keep compatib= ility between > >>>>>> >>>>> > the two, so we don't duplicate effort. > >>>>>> >>>>> > > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can= be plugged > >>>>>> >>>>> > into the cpumap-tc code would be a good goal... > >>>>>> >>>>> > >>>>>> >>>>> Should probably do that...at some point. In general I think = it's a bit > >>>>>> >>>>> of an interesting problem to think about how to chain multip= le XDP/tc > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc progra= ms will do > >>>>>> >>>>> some amount of packet parsing and when you have many chained= programs > >>>>>> >>>>> parsing the same packets this obviously becomes a bit wastef= ul. In the > >>>>>> >>>>> same time it would be nice if one didn't need to manually me= rge > >>>>>> >>>>> multiple programs together into a single one like this to ge= t rid of > >>>>>> >>>>> this duplicated parsing, or at least make that process of me= rging those > >>>>>> >>>>> programs as simple as possible. > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > -Toke > >>>>>> >>>>> > > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59 > >>>>>> >>>>> > >>>>>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet behand= lar vi dina personuppgifter. > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will proc= ess your personal data. > >>>>>> >>>> > >>>>>> >>>> _______________________________________________ > >>>>>> >>>> LibreQoS mailing list > >>>>>> >>>> LibreQoS@lists.bufferbloat.net > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> -- > >>>>>> >>> Robert Chac=C3=B3n > >>>>>> >>> CEO | JackRabbit Wireless LLC > >>>>>> > > >>>>>> > _______________________________________________ > >>>>>> > LibreQoS mailing list > >>>>>> > LibreQoS@lists.bufferbloat.net > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> This song goes out to all the folk that thought Stadia would work: > >>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69= 81366665607352320-FXtz > >>>>>> Dave T=C3=A4ht CEO, TekLibre, LLC > >>>> > >>>> _______________________________________________ > >>>> LibreQoS mailing list > >>>> LibreQoS@lists.bufferbloat.net > >>>> https://lists.bufferbloat.net/listinfo/libreqos > >>> > >>> > >>> > >>> -- > >>> Robert Chac=C3=B3n > >>> CEO | JackRabbit Wireless LLC > >>> _______________________________________________ > >>> LibreQoS mailing list > >>> LibreQoS@lists.bufferbloat.net > >>> https://lists.bufferbloat.net/listinfo/libreqos > > > > _______________________________________________ > > LibreQoS mailing list > > LibreQoS@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/libreqos > > > > -- > This song goes out to all the folk that thought Stadia would work: > https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-698136666= 5607352320-FXtz > Dave T=C3=A4ht CEO, TekLibre, LLC --=20 This song goes out to all the folk that thought Stadia would work: https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656= 07352320-FXtz Dave T=C3=A4ht CEO, TekLibre, LLC