From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com
 [IPv6:2a00:1450:4864:20::334])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id A920C3B29D
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 12:13:17 -0400 (EDT)
Received: by mail-wm1-x334.google.com with SMTP id
 m29-20020a05600c3b1d00b003c6bf423c71so389540wms.0
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 09:13:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=Dlj4o/csof07HDpddXkCiSv6jYM584hFAVgX9kKHyaU=;
 b=IrIYQ0EtrLVmi3jPcGeXosd05uBTyXILGQZIsMElr00VVoTltLeWV66ncYbfe34GPI
 h5E7wyvU/kaRDkbAfFppevuWhJkrFGvd+GeFsj/NR2/31l96FIgmfH1nUkAxEh7FjC8G
 IfyzbEWJis5lf+IMn5ldUqNptcqYa2bE5hvbSnPvDDBj4R7jxj7Mi0oi9aI6hn8+mRqv
 rmuhehgUOBsDneIGHBcR6RzrFgCiG/+ipq+CbKuv2klORtA/DpQ6VB3+Be8uUFMKHJDt
 +R16EMyBl5LiOnMwfrxiX52kfIG5BGjJG6NhsIH8V7geX5dsy6zyi4RdoorWxE1kodxl
 wcGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=Dlj4o/csof07HDpddXkCiSv6jYM584hFAVgX9kKHyaU=;
 b=1nsxexvmox338UrtoswbgrSZ9aMuPQG4UqNMxVEuA1EdI7b/nuFjFM374X8CagmiLv
 OkDVFH/hMUtcBO9ymWjvdJbO1Lw2C3aLXexUu+Iq9Eo5IhyBvjbdkyP4S13WmzzcYDU5
 DIcuD2mHDQn+SBdRzjYp5vteO2aEFn//DeH2tbi/r+0cHfs+cfVbEhCEkDC2qTDmYAE+
 piW3IaJCJZPJjm7GyzxbSqpl9qk3hKyQFQnPRF6/1+TukhDHg0CCBbRivxPNb1mlPsiy
 R1uxOoi3Nxdt8129H6Y6ZKnICKkF3sTpOATbo+R0sYbfDxMYgxQDfkocVKQFKci9JBnk
 d8Bw==
X-Gm-Message-State: ACrzQf2bTz/UN4bghu+j7leDxzzHEvwYcSfjg/Xfzjn7Naa1i6TEeLeM
 6qUrz3MlnroFFfIMb5/scQfV6wCY21P5fxUCGfQ=
X-Google-Smtp-Source: AMsMyM4zdxAmFp40Z/m1LjvINu0uaJX4e1KwQTjB73OLFeFmdfipiMfTv7Rp/ZXH0wRqu0QL7zB9ejdnr3sneJIXBfQ=
X-Received: by 2002:a05:600c:1906:b0:3c6:f83e:d15f with SMTP id
 j6-20020a05600c190600b003c6f83ed15fmr11007809wmq.205.1666195995921; Wed, 19
 Oct 2022 09:13:15 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
 <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
 <CA+erpM4DUNpjVawVAdNqyLjDJwkLMc8jqrqng4nzSytai-1ngA@mail.gmail.com>
 <CA+erpM7hSYN2fY_Hqjgj5GJ18o3344j-Kz-aBtMdtSjRYfA2Yg@mail.gmail.com>
 <CAA93jw6EkMjMjGmHwiBYxG8jqJJhnwb7cUguJ5N4vntpMDnVbw@mail.gmail.com>
 <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
 <CA+erpM4mA6jJmVjnPD0DNQ7QrZRQOUzJ0zcN4_BWsoF4ZZwVxQ@mail.gmail.com>
 <CAOZyJoudaVbBp_yEPFrwqXvfo02=rArHr4zmPu5S8DeF-AgChw@mail.gmail.com>
 <CAA_JP8Wiu3qFALNQNTc-WAwfCcJxZ-9C6vn1CXzTbVqE-seD1Q@mail.gmail.com>
 <CA+erpM4ZWJ9ss-5gy1edc0TCROV_Qr3e4MZ-qcxAT2tYpXRzLQ@mail.gmail.com>
In-Reply-To: <CA+erpM4ZWJ9ss-5gy1edc0TCROV_Qr3e4MZ-qcxAT2tYpXRzLQ@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
Date: Wed, 19 Oct 2022 09:13:02 -0700
Message-ID: <CAA93jw7_t8+-WUU1rs-s7yXD4x=qZBbJ=ROjS9NRJ_Yc9eZZ8w@mail.gmail.com>
To: Herbert Wolverson <herberticus@gmail.com>
Cc: "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Wed, 19 Oct 2022 16:13:17 -0000

flent outputs a flent.gz file that I can parse and plot 20 differnt
ways. Also the graphing tools work on osx

On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
<libreqos@lists.bufferbloat.net> wrote:
>
> That's true. The 12th gen does seem to have some "special" features... ma=
kes for a nice writing platform
> (this box is primarily my "write books and articles" machine). I'll be do=
ing a wider test on a more normal
> platform, probably at the weekend (with real traffic, hence the delay - h=
ave to find a time in which I
> minimize disruption)
>
> On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>>
>> Those 'efficiency' threads in Intel 12th gen should probably be addresse=
d as well.  You can't turn them off in BIOS.
>>
>> On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQoS <libreqo=
s@lists.bufferbloat.net> wrote:
>>>
>>> Awesome work on this!
>>> I suspect there should be a slight performance bump once Hyperthreading=
 is disabled and efficient power management is off.
>>> Hyperthreading/SMT always messes with HTB performance when I leave it o=
n. Thank you for mentioning that - I now went ahead and added instructions =
on disabling hyperthreading on the Wiki for new users.
>>> Super promising results!
>>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-pping=
. So far in your VM setup it seems to be doing very well.
>>>
>>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <libreqo=
s@lists.bufferbloat.net> wrote:
>>>>
>>>> Also, I forgot to mention that I *think* the current version has remov=
ed the requirement that the inbound
>>>> and outbound classifiers be placed on the same CPU. I know interduo wa=
s particularly keen on packing
>>>> upload into fewer cores. I'll add that to my list of things to test.
>>>>
>>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.c=
om> wrote:
>>>>>
>>>>> I'll definitely take a look - that does look interesting. I don't hav=
e X11 on any of my test VMs, but
>>>>> it looks like it can work without the GUI.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote=
:
>>>>>>
>>>>>> could I coax you to adopt flent?
>>>>>>
>>>>>> apt-get install flent netperf irtt fping
>>>>>>
>>>>>> You sometimes have to compile netperf yourself with --enable-demo on
>>>>>> some systems.
>>>>>> There are a bunch of python libs neede for the gui, but only on the =
client.
>>>>>>
>>>>>> Then you can run a really gnarly test series and plot the results ov=
er time.
>>>>>>
>>>>>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H
>>>>>> the_server_name rrul # 110 other tests
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>>>>>> <libreqos@lists.bufferbloat.net> wrote:
>>>>>> >
>>>>>> > Hey,
>>>>>> >
>>>>>> > Testing the current version ( https://github.com/thebracket/cpumap=
-pping-hackjob ), it's doing better than I hoped. This build has shared (no=
t per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset st=
ats.
>>>>>> >
>>>>>> > My testing environment has grown a bit:
>>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpum=
ap-pping-hackjob version of xdp-cpumap.
>>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iper=
f server.
>>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. =
Hosts iperf client.
>>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. =
Hosts iperf client.
>>>>>> >
>>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM =
are on a virtual switch.
>>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a =
different virtual switch.
>>>>>> >
>>>>>> > These are all on a host machine running Windows 11, a core i7 12th=
 gen, 32 Gb RAM and fast SSD setup.
>>>>>> >
>>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>>>>>> >
>>>>>> > For this test, LibreQoS is configured:
>>>>>> > * Two APs, each with 5gbit/s max.
>>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 1=
00mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>>>>>> > * Set to use Cake
>>>>>> >
>>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t=
 500 (for a long run). Running xdp_pping yields correct results:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>>>>>> > {}]
>>>>>> >
>>>>>> > Or when I waited a while to gather/reset:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>>>>>> > {}]
>>>>>> >
>>>>>> > The ShaperVM shows no errors, just periodic logging that it is rec=
ording data.  CPU is about 2-3% on two CPUs, zero on the others (as expecte=
d).
>>>>>> >
>>>>>> > After 500 seconds of continual iperfing, each client reported a th=
roughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>>>>>> >
>>>>>> > So for smaller streams, I'd call this a success.
>>>>>> >
>>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>>>>>> >
>>>>>> > For this test, LibreQoS is configured:
>>>>>> > * Two APs, each with 5gb/s max.
>>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s=
! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>>>>>> >
>>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>>>>>> >
>>>>>> > xdp_pping shows results, too:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>>>>>> > {}]
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>>>>>> > {}]
>>>>>> >
>>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>>>>>> >
>>>>>> > After 500 seconds of continual iperfing, each client reported a th=
roughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.
>>>>>> >
>>>>>> > Maxing out HyperV like this is inducing a bit of latency (which is=
 to be expected), but it's not bad. I also forgot to disable hyperthreading=
, and looking at the host performance it is sometimes running the second vi=
rtual CPU on an underpowered "fake" CPU.
>>>>>> >
>>>>>> > So for two large streams, I think we're doing pretty well also!
>>>>>> >
>>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>>>>>> >
>>>>>> > This test is designed to try and blow things up. It's the same as =
test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and=
 1:6.
>>>>>> >
>>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle.=
 The pping stats start to show a bit of degradation in performance for poun=
ding it so hard:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>>>>>> > {}]
>>>>>> >
>>>>>> > For whatever reason, it smoothed out over time:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>>>>>> > {}]
>>>>>> >
>>>>>> > Surprisingly (to me), I didn't encounter errors. Each client recei=
ved 2.22 Gbit/s performance, over 129 Gbytes of data.
>>>>>> >
>>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>>>>>> >
>>>>>> > This test is also designed to break things. Same as test 3, but us=
ing iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax=
 the flow tracking. (Shorter time window because I really wanted to go and =
find coffee)
>>>>>> >
>>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping resu=
lts show that this torture test is worsening performance, and there's alway=
s lots of samples in the buffer:
>>>>>> >
>>>>>> > [
>>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>>>>>> > {}]
>>>>>> >
>>>>>> > This test also ran better than I expected. You can definitely see =
some latency creeping in as I make the system work hard. Each VM showed aro=
und 2.4 Gbit/s in total performance at the end of the iperf session. There'=
s definitely some latency creeping in, which is expected - but I'm not sure=
 I expected quite that much.
>>>>>> >
>>>>>> > WHAT'S NEXT & CONCLUSION
>>>>>> >
>>>>>> > I noticed that I forgot to turn off efficient power management on =
my VMs and host, and left Hyperthreading on by mistake. So that hurts overa=
ll performance.
>>>>>> >
>>>>>> > The base system seems to be working pretty solidly, at least for s=
mall tests.Next up, I'll be removing extraneous debug reporting code, remov=
ing some code paths that don't do anything but report, and looking for any =
small optimization opportunities. I'll then re-run these tests. Once that's=
 done, I hope to find a maintenance window on my WISP and try it with actua=
l traffic.
>>>>>> >
>>>>>> > I also need to re-run these tests without the pping system to prov=
ide some before/after analysis.
>>>>>> >
>>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@gma=
il.com> wrote:
>>>>>> >>
>>>>>> >> It's probably not entirely thread-safe right now (ran into some i=
ssues reading per_cpu maps back from userspace; hopefully, I'll get that fi=
gured out) - but the commits I just pushed have it basically working on sin=
gle-stream testing. :-)
>>>>>> >>
>>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives=
 you per-connection RTT information in JSON:
>>>>>> >>
>>>>>> >> [
>>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>>>>>> >> {}]
>>>>>> >>
>>>>>> >> (With the extra {} because I'm not tracking the tail and haven't =
done comma removal). The tool also empties the various maps used to gather =
data, acting as a "reset" point. There's a max of 60 samples per queue, in =
a ringbuffer setup (so newest will start to overwrite the oldest).
>>>>>> >>
>>>>>> >> I'll start trying to test on a larger scale now.
>>>>>> >>
>>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <robert.chacon=
@jackrabbitwireless.com> wrote:
>>>>>> >>>
>>>>>> >>> Hey Herbert,
>>>>>> >>>
>>>>>> >>> Fantastic work! Super exciting to see this coming together, espe=
cially so quickly.
>>>>>> >>> I'll test it soon.
>>>>>> >>> I understand and agree with your decision to omit certain featur=
es (ICMP tracking,DNS tracking, etc) to optimize performance for our use ca=
se. Like you said, in order to merge the functionality without a performanc=
e hit, merging them is sort of the only way right now. Otherwise there woul=
d be a lot of redundancy and lost throughput for an ISP's use. Though hopef=
ully long term there will be a way to keep all projects working independent=
ly but interoperably with a plugin system of some kind.
>>>>>> >>>
>>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing o=
n optimizations for high sub counts (8000+ subs) as well as stateful change=
s to the queue structure.
>>>>>> >>> I'm working to set up a physical lab to test high throughput and=
 high client count scenarios.
>>>>>> >>> When testing beyond ~32,000 filters we get "no space left on dev=
ice" from xdp-cpumap-tc, which I think relates to the bpf map size limitati=
on you mentioned. Maybe in the coming months we can take a look at that.
>>>>>> >>>
>>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see mo=
re on this.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> Robert
>>>>>> >>>
>>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS =
<libreqos@lists.bufferbloat.net> wrote:
>>>>>> >>>>
>>>>>> >>>> Hey,
>>>>>> >>>>
>>>>>> >>>> My current (unfinished) progress on this is now available here:=
 https://github.com/thebracket/cpumap-pping-hackjob
>>>>>> >>>>
>>>>>> >>>> I mean it about the warnings, this isn't at all stable, debugge=
d - and can't promise that it won't unleash the nasal demons
>>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>>>>>> >>>>
>>>>>> >>>> With that said, I'm pretty happy so far:
>>>>>> >>>>
>>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nice=
ly shunted onto a dedicated CPU. It has to run on both
>>>>>> >>>>   the inbound and outbound classifiers, since otherwise it woul=
d only see half the conversation.
>>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped t=
o the same interface; I do that anyway in BracketQoS. Not doing
>>>>>> >>>>   that opens up a potential world of pain, since writes to the =
shared maps would require a locking scheme. Too much locking, and you lose =
all of the benefit of using multiple CPUs to begin with.
>>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems =
I've worked with have lots of it.
>>>>>> >>>> * I've been gradually removing features that I don't want for B=
racketQoS. A hypothetical future "useful to everyone" version wouldn't do t=
hat.
>>>>>> >>>> * Rate limiting is working, but I removed the requirement for a=
 shared configuration provided from userland - so right now it's always set=
 to report at 1 second intervals per stream.
>>>>>> >>>>
>>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and "=
world", and a "shaper" VM in between running a slightly hacked-up LibreQoS.
>>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/=
s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM a=
nd fast SSDs)
>>>>>> >>>>
>>>>>> >>>> Output currently consists of debug messages reading:
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_pr=
intk: (tc) Flow open event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 374696
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_pr=
intk: (tc) Flow open event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 247069
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 5217155
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4515394
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4481289
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4255268
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 5249493
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 3795993
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 3949519
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4365335
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4154910
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_pr=
intk: (tc) Send performance event (5,1), 4405582
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_pr=
intk: (tc) Send flow event
>>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_pr=
intk: (tc) Send flow event
>>>>>> >>>>
>>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle majo=
r/minor, allocated by the xdp-cpumap parent.
>>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with s=
ome real-world data very soon.
>>>>>> >>>>
>>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> Herbert
>>>>>> >>>>
>>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg=
@kau.se> wrote:
>>>>>> >>>>>
>>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of=
 quick
>>>>>> >>>>> notes.
>>>>>> >>>>>
>>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rg=
ensen wrote:
>>>>>> >>>>> > [ Adding Simon to Cc ]
>>>>>> >>>>> >
>>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.n=
et> writes:
>>>>>> >>>>> >
>>>>>> >>>>> > > Hey,
>>>>>> >>>>> > >
>>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/pp=
ing/pping.h )
>>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cp=
umap-tc ).
>>>>>> >>>>> > >
>>>>>> >>>>> > > I ported over most of the xdp-pping code, and then changed=
 the entry point
>>>>>> >>>>> > > and packet parsing code to make use of the work already do=
ne in
>>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the pack=
et, no need to do
>>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and h=
ad to pin them -
>>>>>> >>>>> > > otherwise the two tc instances don't properly share data.
>>>>>> >>>>> > >
>>>>>> >>>>>
>>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is proces=
sed on
>>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a f=
low may
>>>>>> >>>>> be processed by different cores on ingress and egress the per-=
CPU maps
>>>>>> >>>>> will not really work reliably as each core will have a differe=
nt view
>>>>>> >>>>> on the state of the flow, if there's been a previous packet wi=
th a
>>>>>> >>>>> certain TSval from that flow etc.
>>>>>> >>>>>
>>>>>> >>>>> Furthermore, if a flow is always processed on the same core (o=
n both
>>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful=
 on
>>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are st=
ill
>>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own v=
alue. So
>>>>>> >>>>> all CPUs will then have their own data for each flow, but it's=
 only the
>>>>>> >>>>> CPU processing the flow that will have any relevant data for t=
he flow
>>>>>> >>>>> while the remaining CPUs will just have an empty state for tha=
t flow.
>>>>>> >>>>> Under the same assumption that packets within the same flow ar=
e always
>>>>>> >>>>> processed on the same core there should generally not be any
>>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either a=
s packets
>>>>>> >>>>> from the same flow cannot be processed concurrently then (and =
thus no
>>>>>> >>>>> concurrent access to the same value in the map). I am however =
still
>>>>>> >>>>> very unclear on if there's any considerable performance impact=
 between
>>>>>> >>>>> global and per-CPU map versions if the same key is not accesse=
d
>>>>>> >>>>> concurrently.
>>>>>> >>>>>
>>>>>> >>>>> > > Right now, output
>>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap outpu=
t code. Instead,
>>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe=
, so I can see
>>>>>> >>>>> > > roughly what the output would look like.
>>>>>> >>>>> > >
>>>>>> >>>>> > > With debug enabled and just logging I'm now getting about =
4.9 Gbits/sec on
>>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in t=
he middle). :-)
>>>>>> >>>>> >
>>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest source=
 of
>>>>>> >>>>> > overhead, then. What Simon found was that sending the data f=
rom kernel
>>>>>> >>>>> > to userspace is one of the most expensive bits of epping, at=
 least when
>>>>>> >>>>> > the number of data points goes up (which is does as addition=
al flows are
>>>>>> >>>>> > added).
>>>>>> >>>>>
>>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you=
 may get
>>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in te=
rms of
>>>>>> >>>>> direct overhead from the tool itself, but also becomes demandi=
ng for
>>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log, =
parse,
>>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal wit=
h that is
>>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate-l=
imit and
>>>>>> >>>>> -R/--rtt-rate
>>>>>> >>>>> >
>>>>>> >>>>> > > So my question: how would you prefer to receive this data?=
 I'll have to
>>>>>> >>>>> > > write a daemon that provides userspace control (periodic c=
leanup as well as
>>>>>> >>>>> > > reading the performance stream), so the world's kinda our =
oyster. I can
>>>>>> >>>>> > > stick to Kathie's original format (and dump it to a named =
pipe, perhaps?),
>>>>>> >>>>> > > a condensed format that only shows what you want to use, a=
n efficient
>>>>>> >>>>> > > binary format if you feel like parsing that...
>>>>>> >>>>> >
>>>>>> >>>>> > It would be great if we could combine efforts a bit here so =
we don't
>>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream" =
epping and
>>>>>> >>>>> > whatever daemon you end up writing can agree on data format =
etc that
>>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>>>>>> >>>>> >
>>>>>> >>>>> > Briefly what I've discussed before with Simon was to have th=
e ability to
>>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a =
userspace
>>>>>> >>>>> > utility periodically pull them out. What we discussed was do=
ing this
>>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea wou=
ld be that
>>>>>> >>>>> > userspace would populate the LPM map with the keys (prefixes=
) they
>>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be one=
 key per
>>>>>> >>>>> > customer, for instance). Epping would then do a map lookup i=
nto the LPM,
>>>>>> >>>>> > and if it gets a match it would update the statistics in tha=
t map entry
>>>>>> >>>>> > (keeping a histogram of latency values seen, basically). Sim=
on's PR
>>>>>> >>>>> > below uses this technique where userspace will "reset" the h=
istogram
>>>>>> >>>>> > every time it loads it by swapping out two different map ent=
ries when it
>>>>>> >>>>> > does a read; this allows you to control the sampling rate fr=
om
>>>>>> >>>>> > userspace, and you'll just get the data since the last time =
you polled.
>>>>>> >>>>>
>>>>>> >>>>> Thank's Toke for summarzing both the current state and the pla=
n going
>>>>>> >>>>> forward. I will just note that this PR (and all my other work =
with
>>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more=
 or less
>>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to finis=
h up a
>>>>>> >>>>> paper.
>>>>>> >>>>>
>>>>>> >>>>> > I was thinking that if we all can agree on the map format, t=
hen your
>>>>>> >>>>> > polling daemon could be one userspace "client" for that, and=
 the epping
>>>>>> >>>>> > binary itself could be another; but we could keep compatibil=
ity between
>>>>>> >>>>> > the two, so we don't duplicate effort.
>>>>>> >>>>> >
>>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can b=
e plugged
>>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>>>>>> >>>>>
>>>>>> >>>>> Should probably do that...at some point. In general I think it=
's a bit
>>>>>> >>>>> of an interesting problem to think about how to chain multiple=
 XDP/tc
>>>>>> >>>>> programs together in an efficent way. Most XDP and tc programs=
 will do
>>>>>> >>>>> some amount of packet parsing and when you have many chained p=
rograms
>>>>>> >>>>> parsing the same packets this obviously becomes a bit wasteful=
. In the
>>>>>> >>>>> same time it would be nice if one didn't need to manually merg=
e
>>>>>> >>>>> multiple programs together into a single one like this to get =
rid of
>>>>>> >>>>> this duplicated parsing, or at least make that process of merg=
ing those
>>>>>> >>>>> programs as simple as possible.
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> > -Toke
>>>>>> >>>>> >
>>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>>>>> >>>>>
>>>>>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet behandla=
r vi dina personuppgifter<https://www.kau.se/gdpr>.
>>>>>> >>>>> When you send an e-mail to Karlstad University, we will proces=
s your personal data<https://www.kau.se/en/gdpr>.
>>>>>> >>>>
>>>>>> >>>> _______________________________________________
>>>>>> >>>> LibreQoS mailing list
>>>>>> >>>> LibreQoS@lists.bufferbloat.net
>>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Robert Chac=C3=B3n
>>>>>> >>> CEO | JackRabbit Wireless LLC
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > LibreQoS mailing list
>>>>>> > LibreQoS@lists.bufferbloat.net
>>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> This song goes out to all the folk that thought Stadia would work:
>>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981=
366665607352320-FXtz
>>>>>> Dave T=C3=A4ht CEO, TekLibre, LLC
>>>>
>>>> _______________________________________________
>>>> LibreQoS mailing list
>>>> LibreQoS@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/libreqos
>>>
>>>
>>>
>>> --
>>> Robert Chac=C3=B3n
>>> CEO | JackRabbit Wireless LLC
>>> _______________________________________________
>>> LibreQoS mailing list
>>> LibreQoS@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/libreqos
>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos


--=20
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656=
07352320-FXtz
Dave T=C3=A4ht CEO, TekLibre, LLC