From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wm1-x330.google.com (mail-wm1-x330.google.com
 [IPv6:2a00:1450:4864:20::330])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 517E13B29D
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 12:13:53 -0400 (EDT)
Received: by mail-wm1-x330.google.com with SMTP id l32so13243496wms.2
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 09:13:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=5aYl0urXTyJc1jDY9yT4LSPRX5PrCDOANSxkw1yC5co=;
 b=VxbiFkPxZdZkNAMMBvK6CzOvSLpnwRY+lj1NrTPsZZrtpfQEPL7gEnU9TKfsuUmU2B
 iwINjKwo/s0OAa79umDGtcvv7stZuYOL6liMCJq8Txc8gkQLwRGdCNVzSfhod8rf/7oh
 ViRGkiptYVVZ9LyUpxdM3v1FaJp/heSk/x21UNWGpnLss8ufvjh17Db7BCMcMzDf7ft3
 yyYoZ2+94Ub0b+VGK3mIFd2nZsbS28zuyawXUsM5fVG2VyAmTb+qjoc47Scpx7BjSInO
 3ISpRRgxSShy8Wff5qjMANV2s+5yuYCmLZ+e9sGEeUziEpQeLBOJUArkVvgtwkl2mC0r
 pRvQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=5aYl0urXTyJc1jDY9yT4LSPRX5PrCDOANSxkw1yC5co=;
 b=70NI2UjpsXbkXnc64ALmTLo+3yr8E4iYRkez9qQ4uxPca+ksJRKM0MFAPQX2VzbHn/
 Vm/+u7YSY8sDW8AdQrvrQIR4+DphGJte6FdHZYbqJW8+RG+jXGUvIJcKvIK41xtCeMNu
 BzTLp3sMq5qDjQ5f3izrX1p2ahrOn/tYSutKpTtsw7ycoG1Rz/uDoYQ2/LyLp71vf0Nt
 W8dkImFhEobBOfu2jwDWXaO4pLB+G092GpmpbiwZVC9TPZcB59dLFNZwEHSg05vHMgha
 aF580seLy/LJAk55DN/GgUZkEelhrs9jRVY30nzOI1eR2xRfab7DqX80VpANKsXskI/Y
 w2GQ==
X-Gm-Message-State: ACrzQf2hmhFDeANM4D+mZ1UFn0Diqfi4Z1KY5zGhM+ubUw1SdlRTaRUn
 gLc4DlyYBeeKnHkF9WBQLXtI7ceTNYmBVORo+KN4/vcevgs=
X-Google-Smtp-Source: AMsMyM4trrmtTlbQrC/EiFHALf9X+zY1McsFlBOCAzBs63RlbEDIrFcVS9yR9O1S4ivJgOroi5PsfOjRmAtcDLizjb0=
X-Received: by 2002:a7b:c3c4:0:b0:3c4:785a:36d7 with SMTP id
 t4-20020a7bc3c4000000b003c4785a36d7mr27802090wmj.138.1666196031853; Wed, 19
 Oct 2022 09:13:51 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
 <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
 <CA+erpM4DUNpjVawVAdNqyLjDJwkLMc8jqrqng4nzSytai-1ngA@mail.gmail.com>
 <CA+erpM7hSYN2fY_Hqjgj5GJ18o3344j-Kz-aBtMdtSjRYfA2Yg@mail.gmail.com>
 <CAA93jw6EkMjMjGmHwiBYxG8jqJJhnwb7cUguJ5N4vntpMDnVbw@mail.gmail.com>
 <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
 <CA+erpM4mA6jJmVjnPD0DNQ7QrZRQOUzJ0zcN4_BWsoF4ZZwVxQ@mail.gmail.com>
 <CAOZyJoudaVbBp_yEPFrwqXvfo02=rArHr4zmPu5S8DeF-AgChw@mail.gmail.com>
 <CAA_JP8Wiu3qFALNQNTc-WAwfCcJxZ-9C6vn1CXzTbVqE-seD1Q@mail.gmail.com>
 <CA+erpM4ZWJ9ss-5gy1edc0TCROV_Qr3e4MZ-qcxAT2tYpXRzLQ@mail.gmail.com>
 <CAA93jw7_t8+-WUU1rs-s7yXD4x=qZBbJ=ROjS9NRJ_Yc9eZZ8w@mail.gmail.com>
In-Reply-To: <CAA93jw7_t8+-WUU1rs-s7yXD4x=qZBbJ=ROjS9NRJ_Yc9eZZ8w@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
Date: Wed, 19 Oct 2022 09:13:38 -0700
Message-ID: <CAA93jw7QaYag4YvK=cUei5xPpwxzfOZ1vR93QW1D8Z3w53v40A@mail.gmail.com>
To: Herbert Wolverson <herberticus@gmail.com>
Cc: "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Wed, 19 Oct 2022 16:13:53 -0000

PS - today's (free) p99 conference is *REALLY AWESOME*. https://www.p99conf=
.io/

On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>
> flent outputs a flent.gz file that I can parse and plot 20 differnt
> ways. Also the graphing tools work on osx
>
> On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
> <libreqos@lists.bufferbloat.net> wrote:
> >
> > That's true. The 12th gen does seem to have some "special" features... =
makes for a nice writing platform
> > (this box is primarily my "write books and articles" machine). I'll be =
doing a wider test on a more normal
> > platform, probably at the weekend (with real traffic, hence the delay -=
 have to find a time in which I
> > minimize disruption)
> >
> > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
> >>
> >> Those 'efficiency' threads in Intel 12th gen should probably be addres=
sed as well.  You can't turn them off in BIOS.
> >>
> >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQoS <libre=
qos@lists.bufferbloat.net> wrote:
> >>>
> >>> Awesome work on this!
> >>> I suspect there should be a slight performance bump once Hyperthreadi=
ng is disabled and efficient power management is off.
> >>> Hyperthreading/SMT always messes with HTB performance when I leave it=
 on. Thank you for mentioning that - I now went ahead and added instruction=
s on disabling hyperthreading on the Wiki for new users.
> >>> Super promising results!
> >>> Interested to see what throughput is with xdp-cpumap-tc vs cpumap-ppi=
ng. So far in your VM setup it seems to be doing very well.
> >>>
> >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <libre=
qos@lists.bufferbloat.net> wrote:
> >>>>
> >>>> Also, I forgot to mention that I *think* the current version has rem=
oved the requirement that the inbound
> >>>> and outbound classifiers be placed on the same CPU. I know interduo =
was particularly keen on packing
> >>>> upload into fewer cores. I'll add that to my list of things to test.
> >>>>
> >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail=
.com> wrote:
> >>>>>
> >>>>> I'll definitely take a look - that does look interesting. I don't h=
ave X11 on any of my test VMs, but
> >>>>> it looks like it can work without the GUI.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wro=
te:
> >>>>>>
> >>>>>> could I coax you to adopt flent?
> >>>>>>
> >>>>>> apt-get install flent netperf irtt fping
> >>>>>>
> >>>>>> You sometimes have to compile netperf yourself with --enable-demo =
on
> >>>>>> some systems.
> >>>>>> There are a bunch of python libs neede for the gui, but only on th=
e client.
> >>>>>>
> >>>>>> Then you can run a really gnarly test series and plot the results =
over time.
> >>>>>>
> >>>>>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H
> >>>>>> the_server_name rrul # 110 other tests
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
> >>>>>> <libreqos@lists.bufferbloat.net> wrote:
> >>>>>> >
> >>>>>> > Hey,
> >>>>>> >
> >>>>>> > Testing the current version ( https://github.com/thebracket/cpum=
ap-pping-hackjob ), it's doing better than I hoped. This build has shared (=
not per-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset =
stats.
> >>>>>> >
> >>>>>> > My testing environment has grown a bit:
> >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new cp=
umap-pping-hackjob version of xdp-cpumap.
> >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an ip=
erf server.
> >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2=
. Hosts iperf client.
> >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3=
. Hosts iperf client.
> >>>>>> >
> >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperV=
M are on a virtual switch.
> >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are on =
a different virtual switch.
> >>>>>> >
> >>>>>> > These are all on a host machine running Windows 11, a core i7 12=
th gen, 32 Gb RAM and fast SSD setup.
> >>>>>> >
> >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
> >>>>>> >
> >>>>>> > For this test, LibreQoS is configured:
> >>>>>> > * Two APs, each with 5gbit/s max.
> >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about=
 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
> >>>>>> > * Set to use Cake
> >>>>>> >
> >>>>>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 =
-t 500 (for a long run). Running xdp_pping yields correct results:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > Or when I waited a while to gather/reset:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
> >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > The ShaperVM shows no errors, just periodic logging that it is r=
ecording data.  CPU is about 2-3% on two CPUs, zero on the others (as expec=
ted).
> >>>>>> >
> >>>>>> > After 500 seconds of continual iperfing, each client reported a =
throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
> >>>>>> >
> >>>>>> > So for smaller streams, I'd call this a success.
> >>>>>> >
> >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
> >>>>>> >
> >>>>>> > For this test, LibreQoS is configured:
> >>>>>> > * Two APs, each with 5gb/s max.
> >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit=
/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
> >>>>>> >
> >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
> >>>>>> >
> >>>>>> > xdp_pping shows results, too:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
> >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
> >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
> >>>>>> >
> >>>>>> > After 500 seconds of continual iperfing, each client reported a =
throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes=
.
> >>>>>> >
> >>>>>> > Maxing out HyperV like this is inducing a bit of latency (which =
is to be expected), but it's not bad. I also forgot to disable hyperthreadi=
ng, and looking at the host performance it is sometimes running the second =
virtual CPU on an underpowered "fake" CPU.
> >>>>>> >
> >>>>>> > So for two large streams, I think we're doing pretty well also!
> >>>>>> >
> >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
> >>>>>> >
> >>>>>> > This test is designed to try and blow things up. It's the same a=
s test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 a=
nd 1:6.
> >>>>>> >
> >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idl=
e. The pping stats start to show a bit of degradation in performance for po=
unding it so hard:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
> >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > For whatever reason, it smoothed out over time:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
> >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client rec=
eived 2.22 Gbit/s performance, over 129 Gbytes of data.
> >>>>>> >
> >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
> >>>>>> >
> >>>>>> > This test is also designed to break things. Same as test 3, but =
using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really t=
ax the flow tracking. (Shorter time window because I really wanted to go an=
d find coffee)
> >>>>>> >
> >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping re=
sults show that this torture test is worsening performance, and there's alw=
ays lots of samples in the buffer:
> >>>>>> >
> >>>>>> > [
> >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49}=
,
> >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49}=
,
> >>>>>> > {}]
> >>>>>> >
> >>>>>> > This test also ran better than I expected. You can definitely se=
e some latency creeping in as I make the system work hard. Each VM showed a=
round 2.4 Gbit/s in total performance at the end of the iperf session. Ther=
e's definitely some latency creeping in, which is expected - but I'm not su=
re I expected quite that much.
> >>>>>> >
> >>>>>> > WHAT'S NEXT & CONCLUSION
> >>>>>> >
> >>>>>> > I noticed that I forgot to turn off efficient power management o=
n my VMs and host, and left Hyperthreading on by mistake. So that hurts ove=
rall performance.
> >>>>>> >
> >>>>>> > The base system seems to be working pretty solidly, at least for=
 small tests.Next up, I'll be removing extraneous debug reporting code, rem=
oving some code paths that don't do anything but report, and looking for an=
y small optimization opportunities. I'll then re-run these tests. Once that=
's done, I hope to find a maintenance window on my WISP and try it with act=
ual traffic.
> >>>>>> >
> >>>>>> > I also need to re-run these tests without the pping system to pr=
ovide some before/after analysis.
> >>>>>> >
> >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <herberticus@g=
mail.com> wrote:
> >>>>>> >>
> >>>>>> >> It's probably not entirely thread-safe right now (ran into some=
 issues reading per_cpu maps back from userspace; hopefully, I'll get that =
figured out) - but the commits I just pushed have it basically working on s=
ingle-stream testing. :-)
> >>>>>> >>
> >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This giv=
es you per-connection RTT information in JSON:
> >>>>>> >>
> >>>>>> >> [
> >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
> >>>>>> >> {}]
> >>>>>> >>
> >>>>>> >> (With the extra {} because I'm not tracking the tail and haven'=
t done comma removal). The tool also empties the various maps used to gathe=
r data, acting as a "reset" point. There's a max of 60 samples per queue, i=
n a ringbuffer setup (so newest will start to overwrite the oldest).
> >>>>>> >>
> >>>>>> >> I'll start trying to test on a larger scale now.
> >>>>>> >>
> >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <robert.chac=
on@jackrabbitwireless.com> wrote:
> >>>>>> >>>
> >>>>>> >>> Hey Herbert,
> >>>>>> >>>
> >>>>>> >>> Fantastic work! Super exciting to see this coming together, es=
pecially so quickly.
> >>>>>> >>> I'll test it soon.
> >>>>>> >>> I understand and agree with your decision to omit certain feat=
ures (ICMP tracking,DNS tracking, etc) to optimize performance for our use =
case. Like you said, in order to merge the functionality without a performa=
nce hit, merging them is sort of the only way right now. Otherwise there wo=
uld be a lot of redundancy and lost throughput for an ISP's use. Though hop=
efully long term there will be a way to keep all projects working independe=
ntly but interoperably with a plugin system of some kind.
> >>>>>> >>>
> >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing=
 on optimizations for high sub counts (8000+ subs) as well as stateful chan=
ges to the queue structure.
> >>>>>> >>> I'm working to set up a physical lab to test high throughput a=
nd high client count scenarios.
> >>>>>> >>> When testing beyond ~32,000 filters we get "no space left on d=
evice" from xdp-cpumap-tc, which I think relates to the bpf map size limita=
tion you mentioned. Maybe in the coming months we can take a look at that.
> >>>>>> >>>
> >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to see =
more on this.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> Robert
> >>>>>> >>>
> >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQo=
S <libreqos@lists.bufferbloat.net> wrote:
> >>>>>> >>>>
> >>>>>> >>>> Hey,
> >>>>>> >>>>
> >>>>>> >>>> My current (unfinished) progress on this is now available her=
e: https://github.com/thebracket/cpumap-pping-hackjob
> >>>>>> >>>>
> >>>>>> >>>> I mean it about the warnings, this isn't at all stable, debug=
ged - and can't promise that it won't unleash the nasal demons
> >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
> >>>>>> >>>>
> >>>>>> >>>> With that said, I'm pretty happy so far:
> >>>>>> >>>>
> >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has ni=
cely shunted onto a dedicated CPU. It has to run on both
> >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it wo=
uld only see half the conversation.
> >>>>>> >>>> * It does assume that your ingress and egress CPUs are mapped=
 to the same interface; I do that anyway in BracketQoS. Not doing
> >>>>>> >>>>   that opens up a potential world of pain, since writes to th=
e shared maps would require a locking scheme. Too much locking, and you los=
e all of the benefit of using multiple CPUs to begin with.
> >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper system=
s I've worked with have lots of it.
> >>>>>> >>>> * I've been gradually removing features that I don't want for=
 BracketQoS. A hypothetical future "useful to everyone" version wouldn't do=
 that.
> >>>>>> >>>> * Rate limiting is working, but I removed the requirement for=
 a shared configuration provided from userland - so right now it's always s=
et to report at 1 second intervals per stream.
> >>>>>> >>>>
> >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and=
 "world", and a "shaper" VM in between running a slightly hacked-up LibreQo=
S.
> >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbi=
t/s max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
> >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM=
 and fast SSDs)
> >>>>>> >>>>
> >>>>>> >>>> Output currently consists of debug messages reading:
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_=
printk: (tc) Flow open event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_=
printk: (tc) Send performance event (5,1), 374696
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_=
printk: (tc) Flow open event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_=
printk: (tc) Send performance event (5,1), 247069
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_=
printk: (tc) Send performance event (5,1), 5217155
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_=
printk: (tc) Send performance event (5,1), 4515394
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_=
printk: (tc) Send performance event (5,1), 4481289
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_=
printk: (tc) Send performance event (5,1), 4255268
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_=
printk: (tc) Send performance event (5,1), 5249493
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_=
printk: (tc) Send performance event (5,1), 3795993
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_=
printk: (tc) Send performance event (5,1), 3949519
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_=
printk: (tc) Send performance event (5,1), 4365335
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_=
printk: (tc) Send performance event (5,1), 4154910
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_=
printk: (tc) Send performance event (5,1), 4405582
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_=
printk: (tc) Send flow event
> >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_=
printk: (tc) Send flow event
> >>>>>> >>>>
> >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle ma=
jor/minor, allocated by the xdp-cpumap parent.
> >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test with=
 some real-world data very soon.
> >>>>>> >>>>
> >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> Herbert
> >>>>>> >>>>
> >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundbe=
rg@kau.se> wrote:
> >>>>>> >>>>>
> >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a couple =
of quick
> >>>>>> >>>>> notes.
> >>>>>> >>>>>
> >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8=
rgensen wrote:
> >>>>>> >>>>> > [ Adding Simon to Cc ]
> >>>>>> >>>>> >
> >>>>>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat=
.net> writes:
> >>>>>> >>>>> >
> >>>>>> >>>>> > > Hey,
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > I've had some pretty good success with merging xdp-pping=
 (
> >>>>>> >>>>> > > https://github.com/xdp-project/bpf-examples/blob/master/=
pping/pping.h )
> >>>>>> >>>>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-=
cpumap-tc ).
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then chang=
ed the entry point
> >>>>>> >>>>> > > and packet parsing code to make use of the work already =
done in
> >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the pa=
cket, no need to do
> >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and=
 had to pin them -
> >>>>>> >>>>> > > otherwise the two tc instances don't properly share data=
.
> >>>>>> >>>>> > >
> >>>>>> >>>>>
> >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is proc=
essed on
> >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a=
 flow may
> >>>>>> >>>>> be processed by different cores on ingress and egress the pe=
r-CPU maps
> >>>>>> >>>>> will not really work reliably as each core will have a diffe=
rent view
> >>>>>> >>>>> on the state of the flow, if there's been a previous packet =
with a
> >>>>>> >>>>> certain TSval from that flow etc.
> >>>>>> >>>>>
> >>>>>> >>>>> Furthermore, if a flow is always processed on the same core =
(on both
> >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit wastef=
ul on
> >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps are =
still
> >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its own=
 value. So
> >>>>>> >>>>> all CPUs will then have their own data for each flow, but it=
's only the
> >>>>>> >>>>> CPU processing the flow that will have any relevant data for=
 the flow
> >>>>>> >>>>> while the remaining CPUs will just have an empty state for t=
hat flow.
> >>>>>> >>>>> Under the same assumption that packets within the same flow =
are always
> >>>>>> >>>>> processed on the same core there should generally not be any
> >>>>>> >>>>> concurrency issues with having a global (non-per-CPU) either=
 as packets
> >>>>>> >>>>> from the same flow cannot be processed concurrently then (an=
d thus no
> >>>>>> >>>>> concurrent access to the same value in the map). I am howeve=
r still
> >>>>>> >>>>> very unclear on if there's any considerable performance impa=
ct between
> >>>>>> >>>>> global and per-CPU map versions if the same key is not acces=
sed
> >>>>>> >>>>> concurrently.
> >>>>>> >>>>>
> >>>>>> >>>>> > > Right now, output
> >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap out=
put code. Instead,
> >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pi=
pe, so I can see
> >>>>>> >>>>> > > roughly what the output would look like.
> >>>>>> >>>>> > >
> >>>>>> >>>>> > > With debug enabled and just logging I'm now getting abou=
t 4.9 Gbits/sec on
> >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in=
 the middle). :-)
> >>>>>> >>>>> >
> >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest sour=
ce of
> >>>>>> >>>>> > overhead, then. What Simon found was that sending the data=
 from kernel
> >>>>>> >>>>> > to userspace is one of the most expensive bits of epping, =
at least when
> >>>>>> >>>>> > the number of data points goes up (which is does as additi=
onal flows are
> >>>>>> >>>>> > added).
> >>>>>> >>>>>
> >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them (y=
ou may get
> >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in =
terms of
> >>>>>> >>>>> direct overhead from the tool itself, but also becomes deman=
ding for
> >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to log=
, parse,
> >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to deal w=
ith that is
> >>>>>> >>>>> of course to just apply some sort of sampling (the -r/--rate=
-limit and
> >>>>>> >>>>> -R/--rtt-rate
> >>>>>> >>>>> >
> >>>>>> >>>>> > > So my question: how would you prefer to receive this dat=
a? I'll have to
> >>>>>> >>>>> > > write a daemon that provides userspace control (periodic=
 cleanup as well as
> >>>>>> >>>>> > > reading the performance stream), so the world's kinda ou=
r oyster. I can
> >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a name=
d pipe, perhaps?),
> >>>>>> >>>>> > > a condensed format that only shows what you want to use,=
 an efficient
> >>>>>> >>>>> > > binary format if you feel like parsing that...
> >>>>>> >>>>> >
> >>>>>> >>>>> > It would be great if we could combine efforts a bit here s=
o we don't
> >>>>>> >>>>> > fork the codebase more than we have to. I.e., if "upstream=
" epping and
> >>>>>> >>>>> > whatever daemon you end up writing can agree on data forma=
t etc that
> >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :=
)
> >>>>>> >>>>> >
> >>>>>> >>>>> > Briefly what I've discussed before with Simon was to have =
the ability to
> >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have =
a userspace
> >>>>>> >>>>> > utility periodically pull them out. What we discussed was =
doing this
> >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The idea w=
ould be that
> >>>>>> >>>>> > userspace would populate the LPM map with the keys (prefix=
es) they
> >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could be o=
ne key per
> >>>>>> >>>>> > customer, for instance). Epping would then do a map lookup=
 into the LPM,
> >>>>>> >>>>> > and if it gets a match it would update the statistics in t=
hat map entry
> >>>>>> >>>>> > (keeping a histogram of latency values seen, basically). S=
imon's PR
> >>>>>> >>>>> > below uses this technique where userspace will "reset" the=
 histogram
> >>>>>> >>>>> > every time it loads it by swapping out two different map e=
ntries when it
> >>>>>> >>>>> > does a read; this allows you to control the sampling rate =
from
> >>>>>> >>>>> > userspace, and you'll just get the data since the last tim=
e you polled.
> >>>>>> >>>>>
> >>>>>> >>>>> Thank's Toke for summarzing both the current state and the p=
lan going
> >>>>>> >>>>> forward. I will just note that this PR (and all my other wor=
k with
> >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be mo=
re or less
> >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to fin=
ish up a
> >>>>>> >>>>> paper.
> >>>>>> >>>>>
> >>>>>> >>>>> > I was thinking that if we all can agree on the map format,=
 then your
> >>>>>> >>>>> > polling daemon could be one userspace "client" for that, a=
nd the epping
> >>>>>> >>>>> > binary itself could be another; but we could keep compatib=
ility between
> >>>>>> >>>>> > the two, so we don't duplicate effort.
> >>>>>> >>>>> >
> >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it can=
 be plugged
> >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
> >>>>>> >>>>>
> >>>>>> >>>>> Should probably do that...at some point. In general I think =
it's a bit
> >>>>>> >>>>> of an interesting problem to think about how to chain multip=
le XDP/tc
> >>>>>> >>>>> programs together in an efficent way. Most XDP and tc progra=
ms will do
> >>>>>> >>>>> some amount of packet parsing and when you have many chained=
 programs
> >>>>>> >>>>> parsing the same packets this obviously becomes a bit wastef=
ul. In the
> >>>>>> >>>>> same time it would be nice if one didn't need to manually me=
rge
> >>>>>> >>>>> multiple programs together into a single one like this to ge=
t rid of
> >>>>>> >>>>> this duplicated parsing, or at least make that process of me=
rging those
> >>>>>> >>>>> programs as simple as possible.
> >>>>>> >>>>>
> >>>>>> >>>>>
> >>>>>> >>>>> > -Toke
> >>>>>> >>>>> >
> >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
> >>>>>> >>>>>
> >>>>>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet behand=
lar vi dina personuppgifter<https://www.kau.se/gdpr>.
> >>>>>> >>>>> When you send an e-mail to Karlstad University, we will proc=
ess your personal data<https://www.kau.se/en/gdpr>.
> >>>>>> >>>>
> >>>>>> >>>> _______________________________________________
> >>>>>> >>>> LibreQoS mailing list
> >>>>>> >>>> LibreQoS@lists.bufferbloat.net
> >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> --
> >>>>>> >>> Robert Chac=C3=B3n
> >>>>>> >>> CEO | JackRabbit Wireless LLC
> >>>>>> >
> >>>>>> > _______________________________________________
> >>>>>> > LibreQoS mailing list
> >>>>>> > LibreQoS@lists.bufferbloat.net
> >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> This song goes out to all the folk that thought Stadia would work:
> >>>>>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69=
81366665607352320-FXtz
> >>>>>> Dave T=C3=A4ht CEO, TekLibre, LLC
> >>>>
> >>>> _______________________________________________
> >>>> LibreQoS mailing list
> >>>> LibreQoS@lists.bufferbloat.net
> >>>> https://lists.bufferbloat.net/listinfo/libreqos
> >>>
> >>>
> >>>
> >>> --
> >>> Robert Chac=C3=B3n
> >>> CEO | JackRabbit Wireless LLC
> >>> _______________________________________________
> >>> LibreQoS mailing list
> >>> LibreQoS@lists.bufferbloat.net
> >>> https://lists.bufferbloat.net/listinfo/libreqos
> >
> > _______________________________________________
> > LibreQoS mailing list
> > LibreQoS@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/libreqos
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-698136666=
5607352320-FXtz
> Dave T=C3=A4ht CEO, TekLibre, LLC


--=20
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656=
07352320-FXtz
Dave T=C3=A4ht CEO, TekLibre, LLC