From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wr1-x430.google.com (mail-wr1-x430.google.com
 [IPv6:2a00:1450:4864:20::430])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 5CF693B29D
 for <libreqos@lists.bufferbloat.net>; Sat, 22 Oct 2022 10:44:32 -0400 (EDT)
Received: by mail-wr1-x430.google.com with SMTP id a10so8574703wrm.12
 for <libreqos@lists.bufferbloat.net>; Sat, 22 Oct 2022 07:44:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=m7NiiXQXMAN/mRtHG8l5pKl5hRh2Azi+R+sKc0yi5QU=;
 b=qbxb4gReWHCOTql7owdkfziwLHuczsTLDx5L9UTWfG42qvIsMZBILAni0ZBirz7EFo
 Uod4znVF3K5T05gRugcVfvTMfz6Qg0prOLKzcbynbBH0oDoR2Tvn4+mYP/2k8ZW9LFLL
 GtIL703wbm/BvAq9Fba+87BgqgW/aPuM8kSc3AGW6W17AhOCdYwFA/MEot9KxBiZB5bc
 DlrN9Iun+zxqwY7rGXYNVfQPvnL6rHZNXQ7y/0r5duxEbCG6ykWlnwKjhKPINGidBBEP
 AnmkD9e9hENxTSCd+bINvSQ5PLbH9sF53etD109tm7HFXk7bHKkUiDNre2KR054PKWIT
 vHPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=m7NiiXQXMAN/mRtHG8l5pKl5hRh2Azi+R+sKc0yi5QU=;
 b=O6XL/tRHpxAeyOBdmBMR4hnqLuY5Y9ji5YAXcA9YIK42weGM/n3n+c9FHkzT7j9pWj
 NJDt/bZ7HQ6twSUqKooWbW82biTnCWPryT2gpy8QOyWWCFDx5Gos/gfka4QBRlO4Ul8u
 x7HI66XuVucVD5KN8LyX924HqfD58oylp88+uKVBNThv5PTjov2GBDD26ZNT5IWGJVt6
 2ZhTqlInLT5ogkH4aZVIYw4M8LCZ2cmG25uMJVTaSfc4n3kIHvcmKBknJNuwrjJOtQIN
 BdJkfuj3hzyNwKLjao+rlQRFmd4fQ7FqolDuk73KZjYscomqeluIRoVMAgBF2K/rMK+x
 s01Q==
X-Gm-Message-State: ACrzQf18DTKKY+Nk/syDH7esJnHrM+Mps/hv9V7NVMVANM7SgTzROC0o
 SQZ/DyFhTAOBN1cp2IIX7bk+6a1bMT5uo/FmQ4FHSETc7AE=
X-Google-Smtp-Source: AMsMyM7CBhJMB35wW10GO+2UykylLiS5HY7pFkY/4Wash6Ed8FiIw7GelvQ6kItGyL8CWHPTndxy2AaVjqqZcOJWJ2k=
X-Received: by 2002:a5d:5109:0:b0:22f:ed4:65da with SMTP id
 s9-20020a5d5109000000b0022f0ed465damr14951250wrt.688.1666449870536; Sat, 22
 Oct 2022 07:44:30 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
 <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
 <CA+erpM4DUNpjVawVAdNqyLjDJwkLMc8jqrqng4nzSytai-1ngA@mail.gmail.com>
 <CA+erpM7hSYN2fY_Hqjgj5GJ18o3344j-Kz-aBtMdtSjRYfA2Yg@mail.gmail.com>
 <CAA93jw6EkMjMjGmHwiBYxG8jqJJhnwb7cUguJ5N4vntpMDnVbw@mail.gmail.com>
 <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
 <CA+erpM4mA6jJmVjnPD0DNQ7QrZRQOUzJ0zcN4_BWsoF4ZZwVxQ@mail.gmail.com>
 <CAOZyJoudaVbBp_yEPFrwqXvfo02=rArHr4zmPu5S8DeF-AgChw@mail.gmail.com>
 <CAA_JP8Wiu3qFALNQNTc-WAwfCcJxZ-9C6vn1CXzTbVqE-seD1Q@mail.gmail.com>
 <CA+erpM4ZWJ9ss-5gy1edc0TCROV_Qr3e4MZ-qcxAT2tYpXRzLQ@mail.gmail.com>
 <CAA93jw7_t8+-WUU1rs-s7yXD4x=qZBbJ=ROjS9NRJ_Yc9eZZ8w@mail.gmail.com>
 <CAA93jw7QaYag4YvK=cUei5xPpwxzfOZ1vR93QW1D8Z3w53v40A@mail.gmail.com>
 <CA+erpM7629_y8tD-0shzNjtiN-m8_W0RiOF3yba000Y0NCu8iw@mail.gmail.com>
In-Reply-To: <CA+erpM7629_y8tD-0shzNjtiN-m8_W0RiOF3yba000Y0NCu8iw@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
Date: Sat, 22 Oct 2022 07:44:16 -0700
Message-ID: <CAA93jw6-RqzjFq5xi0KnxDwxte-f91_KTRjiWQsoCKOFDtuBhw@mail.gmail.com>
To: Herbert Wolverson <herberticus@gmail.com>
Cc: "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sat, 22 Oct 2022 14:44:32 -0000

On Sat, Oct 22, 2022 at 7:32 AM Herbert Wolverson via LibreQoS
<libreqos@lists.bufferbloat.net> wrote:
>
> This morning I tested cpu-pping with live customers!
> A little over 1,200 mapped IP addresses, about 600 mbps of real traffic f=
lowing through a big
> hierarchy of 52 sites. (600 is our "quiet time" traffic)
>
> It started very well: the updated xdp-cpumap system dropped in place and =
the system worked as
> before. xdp_pping started to show data with correct mappings. CPU load fr=
om the mapping
> system is within 1% of where it was before.
>
> After about 20 minutes of continuous execution, it started to run into so=
me scaling issues.
> The shaping system continued to run wonderfully, and CPU load was still f=
ine. However,
> it stopped reporting latency data! A bit of debugging showed that once yo=
u exceed
> 16,384 in-flight TCP streams it isn't handling the "map full" situation g=
racefully - and
> clearing the map from userspace isn't working correctly. So I hacked away=
 and hacked
> away.
>
> Anyway, it turns out that it does in fact work fine at that scale. There'=
s just a one-line
> bug in the xdp_pping.c file. I forgot to actually *call* one line of pack=
et cleanup code.
> Adding that, and everything was awesome.
>
> The entire patch that fixed it consists of adding one line:
> cleanup_packet_ts(packet_ts);
>
> Oops.

:woot:

>
> Anyway, with that in place it's running superbly. I did identify a couple=
 of places in
> which it's being overly verbose with debug information, so I've patched t=
hat also.

> After reducing the overly eager warning about not being able to read a TC=
P header,
> CPU performance improved by another 2% on average.

I note I am VERY interested, science wize, about the actual incidence
of malformed TCP headers,
mistaken or incorrect ecn-related congestion responses, etc.

This is stuff endpoints just arbitrarily drop on the floor. Always
interested in the anomalies.

https://www.youtube.com/watch?v=3DK1jasTyGLr8

>
> Longer-term (i.e. not on a Saturday morning, when I'd rather be playing w=
ith my
> daughter!), I think I'll look at raising some of the buffer sizes.
>
> Thanks,
> Herbert
>
> On Wed, Oct 19, 2022 at 11:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>>
>> PS - today's (free) p99 conference is *REALLY AWESOME*. https://www.p99c=
onf.io/
>>
>> On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>> >
>> > flent outputs a flent.gz file that I can parse and plot 20 differnt
>> > ways. Also the graphing tools work on osx
>> >
>> > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
>> > <libreqos@lists.bufferbloat.net> wrote:
>> > >
>> > > That's true. The 12th gen does seem to have some "special" features.=
.. makes for a nice writing platform
>> > > (this box is primarily my "write books and articles" machine). I'll =
be doing a wider test on a more normal
>> > > platform, probably at the weekend (with real traffic, hence the dela=
y - have to find a time in which I
>> > > minimize disruption)
>> > >
>> > > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>> > >>
>> > >> Those 'efficiency' threads in Intel 12th gen should probably be add=
ressed as well.  You can't turn them off in BIOS.
>> > >>
>> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQoS <li=
breqos@lists.bufferbloat.net> wrote:
>> > >>>
>> > >>> Awesome work on this!
>> > >>> I suspect there should be a slight performance bump once Hyperthre=
ading is disabled and efficient power management is off.
>> > >>> Hyperthreading/SMT always messes with HTB performance when I leave=
 it on. Thank you for mentioning that - I now went ahead and added instruct=
ions on disabling hyperthreading on the Wiki for new users.
>> > >>> Super promising results!
>> > >>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-=
pping. So far in your VM setup it seems to be doing very well.
>> > >>>
>> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <li=
breqos@lists.bufferbloat.net> wrote:
>> > >>>>
>> > >>>> Also, I forgot to mention that I *think* the current version has =
removed the requirement that the inbound
>> > >>>> and outbound classifiers be placed on the same CPU. I know interd=
uo was particularly keen on packing
>> > >>>> upload into fewer cores. I'll add that to my list of things to te=
st.
>> > >>>>
>> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gm=
ail.com> wrote:
>> > >>>>>
>> > >>>>> I'll definitely take a look - that does look interesting. I don'=
t have X11 on any of my test VMs, but
>> > >>>>> it looks like it can work without the GUI.
>> > >>>>>
>> > >>>>> Thanks!
>> > >>>>>
>> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> =
wrote:
>> > >>>>>>
>> > >>>>>> could I coax you to adopt flent?
>> > >>>>>>
>> > >>>>>> apt-get install flent netperf irtt fping
>> > >>>>>>
>> > >>>>>> You sometimes have to compile netperf yourself with --enable-de=
mo on
>> > >>>>>> some systems.
>> > >>>>>> There are a bunch of python libs neede for the gui, but only on=
 the client.
>> > >>>>>>
>> > >>>>>> Then you can run a really gnarly test series and plot the resul=
ts over time.
>> > >>>>>>
>> > >>>>>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions'=
 -H
>> > >>>>>> the_server_name rrul # 110 other tests
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> > >>>>>> <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >
>> > >>>>>> > Hey,
>> > >>>>>> >
>> > >>>>>> > Testing the current version ( https://github.com/thebracket/c=
pumap-pping-hackjob ), it's doing better than I hoped. This build has share=
d (not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and res=
et stats.
>> > >>>>>> >
>> > >>>>>> > My testing environment has grown a bit:
>> > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new=
 cpumap-pping-hackjob version of xdp-cpumap.
>> > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an=
 iperf server.
>> > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.=
1.2. Hosts iperf client.
>> > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.=
1.3. Hosts iperf client.
>> > >>>>>> >
>> > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of Shap=
erVM are on a virtual switch.
>> > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are =
on a different virtual switch.
>> > >>>>>> >
>> > >>>>>> > These are all on a host machine running Windows 11, a core i7=
 12th gen, 32 Gb RAM and fast SSD setup.
>> > >>>>>> >
>> > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gbit/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to ab=
out 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> > * Set to use Cake
>> > >>>>>> >
>> > >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1=
.1 -t 500 (for a long run). Running xdp_pping yields correct results:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}=
,
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Or when I waited a while to gather/reset:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60}=
,
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows no errors, just periodic logging that it i=
s recording data.  CPU is about 2-3% on two CPUs, zero on the others (as ex=
pected).
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported=
 a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> > >>>>>> >
>> > >>>>>> > So for smaller streams, I'd call this a success.
>> > >>>>>> >
>> > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gb/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5G=
bit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> >
>> > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same ti=
me.
>> > >>>>>> >
>> > >>>>>> > xdp_pping shows results, too:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58=
},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13=
},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent=
.
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported=
 a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBy=
tes.
>> > >>>>>> >
>> > >>>>>> > Maxing out HyperV like this is inducing a bit of latency (whi=
ch is to be expected), but it's not bad. I also forgot to disable hyperthre=
ading, and looking at the host performance it is sometimes running the seco=
nd virtual CPU on an underpowered "fake" CPU.
>> > >>>>>> >
>> > >>>>>> > So for two large streams, I think we're doing pretty well als=
o!
>> > >>>>>> >
>> > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> > >>>>>> >
>> > >>>>>> > This test is designed to try and blow things up. It's the sam=
e as test 2, but both CPEs are set to the same CPU (1), using TC handles 1:=
5 and 1:6.
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were =
idle. The pping stats start to show a bit of degradation in performance for=
 pounding it so hard:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 2=
4},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 2=
4},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > For whatever reason, it smoothed out over time:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 5=
0},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 5=
0},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client =
received 2.22 Gbit/s performance, over 129 Gbytes of data.
>> > >>>>>> >
>> > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> > >>>>>> >
>> > >>>>>> > This test is also designed to break things. Same as test 3, b=
ut using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and reall=
y tax the flow tracking. (Shorter time window because I really wanted to go=
 and find coffee)
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping=
 results show that this torture test is worsening performance, and there's =
always lots of samples in the buffer:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : =
49},
>> > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : =
49},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > This test also ran better than I expected. You can definitely=
 see some latency creeping in as I make the system work hard. Each VM showe=
d around 2.4 Gbit/s in total performance at the end of the iperf session. T=
here's definitely some latency creeping in, which is expected - but I'm not=
 sure I expected quite that much.
>> > >>>>>> >
>> > >>>>>> > WHAT'S NEXT & CONCLUSION
>> > >>>>>> >
>> > >>>>>> > I noticed that I forgot to turn off efficient power managemen=
t on my VMs and host, and left Hyperthreading on by mistake. So that hurts =
overall performance.
>> > >>>>>> >
>> > >>>>>> > The base system seems to be working pretty solidly, at least =
for small tests.Next up, I'll be removing extraneous debug reporting code, =
removing some code paths that don't do anything but report, and looking for=
 any small optimization opportunities. I'll then re-run these tests. Once t=
hat's done, I hope to find a maintenance window on my WISP and try it with =
actual traffic.
>> > >>>>>> >
>> > >>>>>> > I also need to re-run these tests without the pping system to=
 provide some before/after analysis.
>> > >>>>>> >
>> > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticu=
s@gmail.com> wrote:
>> > >>>>>> >>
>> > >>>>>> >> It's probably not entirely thread-safe right now (ran into s=
ome issues reading per_cpu maps back from userspace; hopefully, I'll get th=
at figured out) - but the commits I just pushed have it basically working o=
n single-stream testing. :-)
>> > >>>>>> >>
>> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This =
gives you per-connection RTT information in JSON:
>> > >>>>>> >>
>> > >>>>>> >> [
>> > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1}=
,
>> > >>>>>> >> {}]
>> > >>>>>> >>
>> > >>>>>> >> (With the extra {} because I'm not tracking the tail and hav=
en't done comma removal). The tool also empties the various maps used to ga=
ther data, acting as a "reset" point. There's a max of 60 samples per queue=
, in a ringbuffer setup (so newest will start to overwrite the oldest).
>> > >>>>>> >>
>> > >>>>>> >> I'll start trying to test on a larger scale now.
>> > >>>>>> >>
>> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <robert.c=
hacon@jackrabbitwireless.com> wrote:
>> > >>>>>> >>>
>> > >>>>>> >>> Hey Herbert,
>> > >>>>>> >>>
>> > >>>>>> >>> Fantastic work! Super exciting to see this coming together,=
 especially so quickly.
>> > >>>>>> >>> I'll test it soon.
>> > >>>>>> >>> I understand and agree with your decision to omit certain f=
eatures (ICMP tracking,DNS tracking, etc) to optimize performance for our u=
se case. Like you said, in order to merge the functionality without a perfo=
rmance hit, merging them is sort of the only way right now. Otherwise there=
 would be a lot of redundancy and lost throughput for an ISP's use. Though =
hopefully long term there will be a way to keep all projects working indepe=
ndently but interoperably with a plugin system of some kind.
>> > >>>>>> >>>
>> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focus=
ing on optimizations for high sub counts (8000+ subs) as well as stateful c=
hanges to the queue structure.
>> > >>>>>> >>> I'm working to set up a physical lab to test high throughpu=
t and high client count scenarios.
>> > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left o=
n device" from xdp-cpumap-tc, which I think relates to the bpf map size lim=
itation you mentioned. Maybe in the coming months we can take a look at tha=
t.
>> > >>>>>> >>>
>> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to s=
ee more on this.
>> > >>>>>> >>>
>> > >>>>>> >>> Thanks,
>> > >>>>>> >>> Robert
>> > >>>>>> >>>
>> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via Libr=
eQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >>>>
>> > >>>>>> >>>> Hey,
>> > >>>>>> >>>>
>> > >>>>>> >>>> My current (unfinished) progress on this is now available =
here: https://github.com/thebracket/cpumap-pping-hackjob
>> > >>>>>> >>>>
>> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable, de=
bugged - and can't promise that it won't unleash the nasal demons
>> > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-=
)
>> > >>>>>> >>>>
>> > >>>>>> >>>> With that said, I'm pretty happy so far:
>> > >>>>>> >>>>
>> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has=
 nicely shunted onto a dedicated CPU. It has to run on both
>> > >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it=
 would only see half the conversation.
>> > >>>>>> >>>> * It does assume that your ingress and egress CPUs are map=
ped to the same interface; I do that anyway in BracketQoS. Not doing
>> > >>>>>> >>>>   that opens up a potential world of pain, since writes to=
 the shared maps would require a locking scheme. Too much locking, and you =
lose all of the benefit of using multiple CPUs to begin with.
>> > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper sys=
tems I've worked with have lots of it.
>> > >>>>>> >>>> * I've been gradually removing features that I don't want =
for BracketQoS. A hypothetical future "useful to everyone" version wouldn't=
 do that.
>> > >>>>>> >>>> * Rate limiting is working, but I removed the requirement =
for a shared configuration provided from userland - so right now it's alway=
s set to report at 1 second intervals per stream.
>> > >>>>>> >>>>
>> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" =
and "world", and a "shaper" VM in between running a slightly hacked-up Libr=
eQoS.
>> > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10=
gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on m=
y
>> > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb =
RAM and fast SSDs)
>> > >>>>>> >>>>
>> > >>>>>> >>>> Output currently consists of debug messages reading:
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_tra=
ce_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 374696
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_tra=
ce_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 247069
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 5217155
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4515394
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4481289
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4255268
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 5249493
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 3795993
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 3949519
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4365335
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4154910
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4405582
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_tra=
ce_printk: (tc) Send flow event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_tra=
ce_printk: (tc) Send flow event
>> > >>>>>> >>>>
>> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle=
 major/minor, allocated by the xdp-cpumap parent.
>> > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test w=
ith some real-world data very soon.
>> > >>>>>> >>>>
>> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> > >>>>>> >>>>
>> > >>>>>> >>>> Thanks,
>> > >>>>>> >>>> Herbert
>> > >>>>>> >>>>
>> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sun=
dberg@kau.se> wrote:
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a coup=
le of quick
>> > >>>>>> >>>>> notes.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=
=B8rgensen wrote:
>> > >>>>>> >>>>> > [ Adding Simon to Cc ]
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbl=
oat.net> writes:
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > Hey,
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I've had some pretty good success with merging xdp-pp=
ing (
>> > >>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/mast=
er/pping/pping.h )
>> > >>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/x=
dp-cpumap-tc ).
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then ch=
anged the entry point
>> > >>>>>> >>>>> > > and packet parsing code to make use of the work alrea=
dy done in
>> > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the=
 packet, no need to do
>> > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, =
and had to pin them -
>> > >>>>>> >>>>> > > otherwise the two tc instances don't properly share d=
ata.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is p=
rocessed on
>> > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, i=
f a flow may
>> > >>>>>> >>>>> be processed by different cores on ingress and egress the=
 per-CPU maps
>> > >>>>>> >>>>> will not really work reliably as each core will have a di=
fferent view
>> > >>>>>> >>>>> on the state of the flow, if there's been a previous pack=
et with a
>> > >>>>>> >>>>> certain TSval from that flow etc.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same co=
re (on both
>> > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit was=
teful on
>> > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps a=
re still
>> > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its =
own value. So
>> > >>>>>> >>>>> all CPUs will then have their own data for each flow, but=
 it's only the
>> > >>>>>> >>>>> CPU processing the flow that will have any relevant data =
for the flow
>> > >>>>>> >>>>> while the remaining CPUs will just have an empty state fo=
r that flow.
>> > >>>>>> >>>>> Under the same assumption that packets within the same fl=
ow are always
>> > >>>>>> >>>>> processed on the same core there should generally not be =
any
>> > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU) eit=
her as packets
>> > >>>>>> >>>>> from the same flow cannot be processed concurrently then =
(and thus no
>> > >>>>>> >>>>> concurrent access to the same value in the map). I am how=
ever still
>> > >>>>>> >>>>> very unclear on if there's any considerable performance i=
mpact between
>> > >>>>>> >>>>> global and per-CPU map versions if the same key is not ac=
cessed
>> > >>>>>> >>>>> concurrently.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > > Right now, output
>> > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap =
output code. Instead,
>> > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug=
 pipe, so I can see
>> > >>>>>> >>>>> > > roughly what the output would look like.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting a=
bout 4.9 Gbits/sec on
>> > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM=
 in the middle). :-)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest s=
ource of
>> > >>>>>> >>>>> > overhead, then. What Simon found was that sending the d=
ata from kernel
>> > >>>>>> >>>>> > to userspace is one of the most expensive bits of eppin=
g, at least when
>> > >>>>>> >>>>> > the number of data points goes up (which is does as add=
itional flows are
>> > >>>>>> >>>>> > added).
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them=
 (you may get
>> > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic =
in terms of
>> > >>>>>> >>>>> direct overhead from the tool itself, but also becomes de=
manding for
>> > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to =
log, parse,
>> > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to dea=
l with that is
>> > >>>>>> >>>>> of course to just apply some sort of sampling (the -r/--r=
ate-limit and
>> > >>>>>> >>>>> -R/--rtt-rate
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > So my question: how would you prefer to receive this =
data? I'll have to
>> > >>>>>> >>>>> > > write a daemon that provides userspace control (perio=
dic cleanup as well as
>> > >>>>>> >>>>> > > reading the performance stream), so the world's kinda=
 our oyster. I can
>> > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a n=
amed pipe, perhaps?),
>> > >>>>>> >>>>> > > a condensed format that only shows what you want to u=
se, an efficient
>> > >>>>>> >>>>> > > binary format if you feel like parsing that...
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > It would be great if we could combine efforts a bit her=
e so we don't
>> > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstr=
eam" epping and
>> > >>>>>> >>>>> > whatever daemon you end up writing can agree on data fo=
rmat etc that
>> > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate thi=
s :)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to ha=
ve the ability to
>> > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and ha=
ve a userspace
>> > >>>>>> >>>>> > utility periodically pull them out. What we discussed w=
as doing this
>> > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The ide=
a would be that
>> > >>>>>> >>>>> > userspace would populate the LPM map with the keys (pre=
fixes) they
>> > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could b=
e one key per
>> > >>>>>> >>>>> > customer, for instance). Epping would then do a map loo=
kup into the LPM,
>> > >>>>>> >>>>> > and if it gets a match it would update the statistics i=
n that map entry
>> > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically)=
. Simon's PR
>> > >>>>>> >>>>> > below uses this technique where userspace will "reset" =
the histogram
>> > >>>>>> >>>>> > every time it loads it by swapping out two different ma=
p entries when it
>> > >>>>>> >>>>> > does a read; this allows you to control the sampling ra=
te from
>> > >>>>>> >>>>> > userspace, and you'll just get the data since the last =
time you polled.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and th=
e plan going
>> > >>>>>> >>>>> forward. I will just note that this PR (and all my other =
work with
>> > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be=
 more or less
>> > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to =
finish up a
>> > >>>>>> >>>>> paper.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > I was thinking that if we all can agree on the map form=
at, then your
>> > >>>>>> >>>>> > polling daemon could be one userspace "client" for that=
, and the epping
>> > >>>>>> >>>>> > binary itself could be another; but we could keep compa=
tibility between
>> > >>>>>> >>>>> > the two, so we don't duplicate effort.
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it =
can be plugged
>> > >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Should probably do that...at some point. In general I thi=
nk it's a bit
>> > >>>>>> >>>>> of an interesting problem to think about how to chain mul=
tiple XDP/tc
>> > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc pro=
grams will do
>> > >>>>>> >>>>> some amount of packet parsing and when you have many chai=
ned programs
>> > >>>>>> >>>>> parsing the same packets this obviously becomes a bit was=
teful. In the
>> > >>>>>> >>>>> same time it would be nice if one didn't need to manually=
 merge
>> > >>>>>> >>>>> multiple programs together into a single one like this to=
 get rid of
>> > >>>>>> >>>>> this duplicated parsing, or at least make that process of=
 merging those
>> > >>>>>> >>>>> programs as simple as possible.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > -Toke
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet beh=
andlar vi dina personuppgifter<https://www.kau.se/gdpr>.
>> > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will p=
rocess your personal data<https://www.kau.se/en/gdpr>.
>> > >>>>>> >>>>
>> > >>>>>> >>>> _______________________________________________
>> > >>>>>> >>>> LibreQoS mailing list
>> > >>>>>> >>>> LibreQoS@lists.bufferbloat.net
>> > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>> --
>> > >>>>>> >>> Robert Chac=C3=B3n
>> > >>>>>> >>> CEO | JackRabbit Wireless LLC
>> > >>>>>> >
>> > >>>>>> > _______________________________________________
>> > >>>>>> > LibreQoS mailing list
>> > >>>>>> > LibreQoS@lists.bufferbloat.net
>> > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> This song goes out to all the folk that thought Stadia would wo=
rk:
>> > >>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity=
-6981366665607352320-FXtz
>> > >>>>>> Dave T=C3=A4ht CEO, TekLibre, LLC
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> LibreQoS mailing list
>> > >>>> LibreQoS@lists.bufferbloat.net
>> > >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Robert Chac=C3=B3n
>> > >>> CEO | JackRabbit Wireless LLC
>> > >>> _______________________________________________
>> > >>> LibreQoS mailing list
>> > >>> LibreQoS@lists.bufferbloat.net
>> > >>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >
>> > > _______________________________________________
>> > > LibreQoS mailing list
>> > > LibreQoS@lists.bufferbloat.net
>> > > https://lists.bufferbloat.net/listinfo/libreqos
>> >
>> >
>> >
>> > --
>> > This song goes out to all the folk that thought Stadia would work:
>> > https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-698136=
6665607352320-FXtz
>> > Dave T=C3=A4ht CEO, TekLibre, LLC
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666=
65607352320-FXtz
>> Dave T=C3=A4ht CEO, TekLibre, LLC
>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos


--=20
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656=
07352320-FXtz
Dave T=C3=A4ht CEO, TekLibre, LLC