From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <robert.chacon@jackrabbitwireless.com>
Received: from mail-ed1-x532.google.com (mail-ed1-x532.google.com
 [IPv6:2a00:1450:4864:20::532])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 57BA43B29D
 for <libreqos@lists.bufferbloat.net>; Sat, 22 Oct 2022 10:47:48 -0400 (EDT)
Received: by mail-ed1-x532.google.com with SMTP id a13so16082294edj.0
 for <libreqos@lists.bufferbloat.net>; Sat, 22 Oct 2022 07:47:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=jackrabbitwireless.com; s=google;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:from:to:cc:subject:date:message-id:reply-to;
 bh=TwWnXWPmMqdGR9a6jFzfSEqQzynQkcxdzrTKAGIhlr0=;
 b=RMUi3SEjJORkIhdgQTObcR3vSYThnGZX4+uvM4vYWSZ8F/5XrO9jfxlMH6XWOf89eI
 xHCY4Rn1fhVe0NB3l0oHr/8n6lG/KqoNlCGtE+r8OhYuPZKmtW9QZ9uKCQwuxViy0Cv/
 Cb6cZJJEILWhDkzfamAKoZsQLuBkA5CjPWXCs=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=TwWnXWPmMqdGR9a6jFzfSEqQzynQkcxdzrTKAGIhlr0=;
 b=J3JOsDBefwhqhuCxvdisCrk4RAIdCvFZygOrKDFN5pHb93V/erIjrngM/Rt4HX33oA
 0R/2JhuImKm4oFoFyh9xhTRNn1/QURdj1smpoheW6b8K69JOAhzi0F6TealjaG/O8p/x
 xavwHkVFUIrgmV1ilusq7NNzQ2BwdUBt6uyXS2xh0GlZKg+EzjfwiLJZE2TTFKFiEnu7
 hetDNkuJvXymTPiA3Fy2YbsivQmyqAgjVTVyIuRY1dmCkmEfVwCkBFmj0YDCKkCWYKIZ
 YRCTvoF1gsLU39oy73vldDDnExpe2W1sgOafFswvN/DN6ZEjJfTUwcZa9VxwfmTMexNb
 JHzg==
X-Gm-Message-State: ACrzQf1CiZYQSrk5J9EtW01v+mr6hlOM70T9u8K8SlmO+BLE7Y8J9+0I
 ii1kGiLTyUb6B05H4TUWiVH0PmNtxSsqTT3mXggKzOAL7OI=
X-Google-Smtp-Source: AMsMyM5qJQ4NZKJIAtCC+vkrvFjEMPGUuTSR4xHH7LVyd5ySQJKdEbmSkwLgiaQc+ZPfh6nd241r+0IoRrrdSQAh8R8=
X-Received: by 2002:a17:907:7215:b0:791:a61f:56b3 with SMTP id
 dr21-20020a170907721500b00791a61f56b3mr17352977ejc.331.1666450066987; Sat, 22
 Oct 2022 07:47:46 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
 <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
 <CA+erpM4DUNpjVawVAdNqyLjDJwkLMc8jqrqng4nzSytai-1ngA@mail.gmail.com>
 <CA+erpM7hSYN2fY_Hqjgj5GJ18o3344j-Kz-aBtMdtSjRYfA2Yg@mail.gmail.com>
 <CAA93jw6EkMjMjGmHwiBYxG8jqJJhnwb7cUguJ5N4vntpMDnVbw@mail.gmail.com>
 <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
 <CA+erpM4mA6jJmVjnPD0DNQ7QrZRQOUzJ0zcN4_BWsoF4ZZwVxQ@mail.gmail.com>
 <CAOZyJoudaVbBp_yEPFrwqXvfo02=rArHr4zmPu5S8DeF-AgChw@mail.gmail.com>
 <CAA_JP8Wiu3qFALNQNTc-WAwfCcJxZ-9C6vn1CXzTbVqE-seD1Q@mail.gmail.com>
 <CA+erpM4ZWJ9ss-5gy1edc0TCROV_Qr3e4MZ-qcxAT2tYpXRzLQ@mail.gmail.com>
 <CAA93jw7_t8+-WUU1rs-s7yXD4x=qZBbJ=ROjS9NRJ_Yc9eZZ8w@mail.gmail.com>
 <CAA93jw7QaYag4YvK=cUei5xPpwxzfOZ1vR93QW1D8Z3w53v40A@mail.gmail.com>
 <CA+erpM7629_y8tD-0shzNjtiN-m8_W0RiOF3yba000Y0NCu8iw@mail.gmail.com>
In-Reply-To: <CA+erpM7629_y8tD-0shzNjtiN-m8_W0RiOF3yba000Y0NCu8iw@mail.gmail.com>
From: =?UTF-8?Q?Robert_Chac=C3=B3n?= <robert.chacon@jackrabbitwireless.com>
Date: Sat, 22 Oct 2022 08:47:36 -0600
Message-ID: <CAOZyJotV0-SOmvUeWCkHxehSLZ15Pz7v4pF4R4dzZ7Ru52E+=g@mail.gmail.com>
To: Herbert Wolverson <herberticus@gmail.com>
Cc: libreqos@lists.bufferbloat.net
Content-Type: multipart/alternative; boundary="00000000000047775205eba0a443"
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sat, 22 Oct 2022 14:47:48 -0000

--00000000000047775205eba0a443
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Awesome work! It's really amazing how little additional CPU the TCP
tracking adds. Super excited to start testing in production myself soon.
Have a great restful morning with your daughter. =F0=9F=98=8C

On Sat, Oct 22, 2022, 8:32 AM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> This morning I tested cpu-pping with live customers!
> A little over 1,200 mapped IP addresses, about 600 mbps of real traffic
> flowing through a big
> hierarchy of 52 sites. (600 is our "quiet time" traffic)
>
> It started very well: the updated xdp-cpumap system dropped in place and
> the system worked as
> before. xdp_pping started to show data with correct mappings. CPU load
> from the mapping
> system is within 1% of where it was before.
>
> After about 20 minutes of continuous execution, it started to run into
> some scaling issues.
> The shaping system continued to run wonderfully, and CPU load was still
> fine. However,
> it stopped reporting latency data! A bit of debugging showed that once yo=
u
> exceed
> 16,384 in-flight TCP streams it isn't handling the "map full" situation
> gracefully - and
> clearing the map from userspace isn't working correctly. So I hacked away
> and hacked
> away.
>
> Anyway, it turns out that it does in fact work fine at that scale. There'=
s
> just a one-line
> bug in the xdp_pping.c file. I forgot to actually *call* one line of
> packet cleanup code.
> Adding that, and everything was awesome.
>
> The entire patch that fixed it consists of adding one line:
> cleanup_packet_ts(packet_ts);
>
> Oops.
>
> Anyway, with that in place it's running superbly. I did identify a couple
> of places in
> which it's being overly verbose with debug information, so I've patched
> that also.
>
> After reducing the overly eager warning about not being able to read a TC=
P
> header,
> CPU performance improved by another 2% on average.
>
> Longer-term (i.e. not on a Saturday morning, when I'd rather be playing
> with my
> daughter!), I think I'll look at raising some of the buffer sizes.
>
> Thanks,
> Herbert
>
> On Wed, Oct 19, 2022 at 11:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>
>> PS - today's (free) p99 conference is *REALLY AWESOME*.
>> https://www.p99conf.io/
>>
>> On Wed, Oct 19, 2022 at 9:13 AM Dave Taht <dave.taht@gmail.com> wrote:
>> >
>> > flent outputs a flent.gz file that I can parse and plot 20 differnt
>> > ways. Also the graphing tools work on osx
>> >
>> > On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS
>> > <libreqos@lists.bufferbloat.net> wrote:
>> > >
>> > > That's true. The 12th gen does seem to have some "special"
>> features... makes for a nice writing platform
>> > > (this box is primarily my "write books and articles" machine). I'll
>> be doing a wider test on a more normal
>> > > platform, probably at the weekend (with real traffic, hence the dela=
y
>> - have to find a time in which I
>> > > minimize disruption)
>> > >
>> > > On Wed, Oct 19, 2022 at 10:49 AM dan <dandenson@gmail.com> wrote:
>> > >>
>> > >> Those 'efficiency' threads in Intel 12th gen should probably be
>> addressed as well.  You can't turn them off in BIOS.
>> > >>
>> > >> On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> > >>>
>> > >>> Awesome work on this!
>> > >>> I suspect there should be a slight performance bump once
>> Hyperthreading is disabled and efficient power management is off.
>> > >>> Hyperthreading/SMT always messes with HTB performance when I leave
>> it on. Thank you for mentioning that - I now went ahead and added
>> instructions on disabling hyperthreading on the Wiki for new users.
>> > >>> Super promising results!
>> > >>> Interested to see what throughput is with xdp-cpumap-tc vs
>> cpumap-pping. So far in your VM setup it seems to be doing very well.
>> > >>>
>> > >>> On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> > >>>>
>> > >>>> Also, I forgot to mention that I *think* the current version has
>> removed the requirement that the inbound
>> > >>>> and outbound classifiers be placed on the same CPU. I know
>> interduo was particularly keen on packing
>> > >>>> upload into fewer cores. I'll add that to my list of things to
>> test.
>> > >>>>
>> > >>>> On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> > >>>>>
>> > >>>>> I'll definitely take a look - that does look interesting. I don'=
t
>> have X11 on any of my test VMs, but
>> > >>>>> it looks like it can work without the GUI.
>> > >>>>>
>> > >>>>> Thanks!
>> > >>>>>
>> > >>>>> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com>
>> wrote:
>> > >>>>>>
>> > >>>>>> could I coax you to adopt flent?
>> > >>>>>>
>> > >>>>>> apt-get install flent netperf irtt fping
>> > >>>>>>
>> > >>>>>> You sometimes have to compile netperf yourself with
>> --enable-demo on
>> > >>>>>> some systems.
>> > >>>>>> There are a bunch of python libs neede for the gui, but only on
>> the client.
>> > >>>>>>
>> > >>>>>> Then you can run a really gnarly test series and plot the
>> results over time.
>> > >>>>>>
>> > >>>>>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions'=
 -H
>> > >>>>>> the_server_name rrul # 110 other tests
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> > >>>>>> <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >
>> > >>>>>> > Hey,
>> > >>>>>> >
>> > >>>>>> > Testing the current version (
>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
>> than I hoped. This build has shared (not per-cpu) maps, and a userspace
>> daemon (xdp_pping) to extract and reset stats.
>> > >>>>>> >
>> > >>>>>> > My testing environment has grown a bit:
>> > >>>>>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>> cpumap-pping-hackjob version of xdp-cpumap.
>> > >>>>>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an
>> iperf server.
>> > >>>>>> > * ClientInt1 - running Ubuntu Server (minimal), set as
>> 10.64.1.2. Hosts iperf client.
>> > >>>>>> > * ClientInt2 - running Ubuntu Server (minimal), set as
>> 10.64.1.3. Hosts iperf client.
>> > >>>>>> >
>> > >>>>>> > ClientInt1, ClientInt2 and one interface (LAN facing) of
>> ShaperVM are on a virtual switch.
>> > >>>>>> > ExtTest and the other interface (WAN facing) of ShaperVM are
>> on a different virtual switch.
>> > >>>>>> >
>> > >>>>>> > These are all on a host machine running Windows 11, a core i7
>> 12th gen, 32 Gb RAM and fast SSD setup.
>> > >>>>>> >
>> > >>>>>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gbit/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
>> about 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> > * Set to use Cake
>> > >>>>>> >
>> > >>>>>> > On each client, roughly simultaneously run: iperf -c
>> 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields correct
>> results:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11}=
,
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Or when I waited a while to gather/reset:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60}=
,
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows no errors, just periodic logging that it i=
s
>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>> expected).
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported
>> a throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> > >>>>>> >
>> > >>>>>> > So for smaller streams, I'd call this a success.
>> > >>>>>> >
>> > >>>>>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> > >>>>>> >
>> > >>>>>> > For this test, LibreQoS is configured:
>> > >>>>>> > * Two APs, each with 5gb/s max.
>> > >>>>>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to
>> 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> > >>>>>> >
>> > >>>>>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same
>> time.
>> > >>>>>> >
>> > >>>>>> > xdp_pping shows results, too:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58=
},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13}=
,
>> > >>>>>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13=
},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent=
.
>> > >>>>>> >
>> > >>>>>> > After 500 seconds of continual iperfing, each client reported
>> a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226
>> GBytes.
>> > >>>>>> >
>> > >>>>>> > Maxing out HyperV like this is inducing a bit of latency
>> (which is to be expected), but it's not bad. I also forgot to disable
>> hyperthreading, and looking at the host performance it is sometimes runn=
ing
>> the second virtual CPU on an underpowered "fake" CPU.
>> > >>>>>> >
>> > >>>>>> > So for two large streams, I think we're doing pretty well als=
o!
>> > >>>>>> >
>> > >>>>>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> > >>>>>> >
>> > >>>>>> > This test is designed to try and blow things up. It's the sam=
e
>> as test 2, but both CPEs are set to the same CPU (1), using TC handles 1=
:5
>> and 1:6.
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were
>> idle. The pping stats start to show a bit of degradation in performance =
for
>> pounding it so hard:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" :
>> 24},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" :
>> 24},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > For whatever reason, it smoothed out over time:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" :
>> 50},
>> > >>>>>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" :
>> 50},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > Surprisingly (to me), I didn't encounter errors. Each client
>> received 2.22 Gbit/s performance, over 129 Gbytes of data.
>> > >>>>>> >
>> > >>>>>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> > >>>>>> >
>> > >>>>>> > This test is also designed to break things. Same as test 3,
>> but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and
>> really tax the flow tracking. (Shorter time window because I really want=
ed
>> to go and find coffee)
>> > >>>>>> >
>> > >>>>>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping
>> results show that this torture test is worsening performance, and there'=
s
>> always lots of samples in the buffer:
>> > >>>>>> >
>> > >>>>>> > [
>> > >>>>>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" :
>> 49},
>> > >>>>>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" :
>> 49},
>> > >>>>>> > {}]
>> > >>>>>> >
>> > >>>>>> > This test also ran better than I expected. You can definitely
>> see some latency creeping in as I make the system work hard. Each VM sho=
wed
>> around 2.4 Gbit/s in total performance at the end of the iperf session.
>> There's definitely some latency creeping in, which is expected - but I'm
>> not sure I expected quite that much.
>> > >>>>>> >
>> > >>>>>> > WHAT'S NEXT & CONCLUSION
>> > >>>>>> >
>> > >>>>>> > I noticed that I forgot to turn off efficient power managemen=
t
>> on my VMs and host, and left Hyperthreading on by mistake. So that hurts
>> overall performance.
>> > >>>>>> >
>> > >>>>>> > The base system seems to be working pretty solidly, at least
>> for small tests.Next up, I'll be removing extraneous debug reporting cod=
e,
>> removing some code paths that don't do anything but report, and looking =
for
>> any small optimization opportunities. I'll then re-run these tests. Once
>> that's done, I hope to find a maintenance window on my WISP and try it w=
ith
>> actual traffic.
>> > >>>>>> >
>> > >>>>>> > I also need to re-run these tests without the pping system to
>> provide some before/after analysis.
>> > >>>>>> >
>> > >>>>>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> > >>>>>> >>
>> > >>>>>> >> It's probably not entirely thread-safe right now (ran into
>> some issues reading per_cpu maps back from userspace; hopefully, I'll ge=
t
>> that figured out) - but the commits I just pushed have it basically work=
ing
>> on single-stream testing. :-)
>> > >>>>>> >>
>> > >>>>>> >> Setup cpumap as usual, and periodically run xdp-pping. This
>> gives you per-connection RTT information in JSON:
>> > >>>>>> >>
>> > >>>>>> >> [
>> > >>>>>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1}=
,
>> > >>>>>> >> {}]
>> > >>>>>> >>
>> > >>>>>> >> (With the extra {} because I'm not tracking the tail and
>> haven't done comma removal). The tool also empties the various maps used=
 to
>> gather data, acting as a "reset" point. There's a max of 60 samples per
>> queue, in a ringbuffer setup (so newest will start to overwrite the olde=
st).
>> > >>>>>> >>
>> > >>>>>> >> I'll start trying to test on a larger scale now.
>> > >>>>>> >>
>> > >>>>>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <
>> robert.chacon@jackrabbitwireless.com> wrote:
>> > >>>>>> >>>
>> > >>>>>> >>> Hey Herbert,
>> > >>>>>> >>>
>> > >>>>>> >>> Fantastic work! Super exciting to see this coming together,
>> especially so quickly.
>> > >>>>>> >>> I'll test it soon.
>> > >>>>>> >>> I understand and agree with your decision to omit certain
>> features (ICMP tracking,DNS tracking, etc) to optimize performance for o=
ur
>> use case. Like you said, in order to merge the functionality without a
>> performance hit, merging them is sort of the only way right now. Otherwi=
se
>> there would be a lot of redundancy and lost throughput for an ISP's use.
>> Though hopefully long term there will be a way to keep all projects work=
ing
>> independently but interoperably with a plugin system of some kind.
>> > >>>>>> >>>
>> > >>>>>> >>> By the way, I'm making some headway on LibreQoS v1.3.
>> Focusing on optimizations for high sub counts (8000+ subs) as well as
>> stateful changes to the queue structure.
>> > >>>>>> >>> I'm working to set up a physical lab to test high throughpu=
t
>> and high client count scenarios.
>> > >>>>>> >>> When testing beyond ~32,000 filters we get "no space left o=
n
>> device" from xdp-cpumap-tc, which I think relates to the bpf map size
>> limitation you mentioned. Maybe in the coming months we can take a look =
at
>> that.
>> > >>>>>> >>>
>> > >>>>>> >>> Anyway great work on the cpumap-pping program! Excited to
>> see more on this.
>> > >>>>>> >>>
>> > >>>>>> >>> Thanks,
>> > >>>>>> >>> Robert
>> > >>>>>> >>>
>> > >>>>>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via
>> LibreQoS <libreqos@lists.bufferbloat.net> wrote:
>> > >>>>>> >>>>
>> > >>>>>> >>>> Hey,
>> > >>>>>> >>>>
>> > >>>>>> >>>> My current (unfinished) progress on this is now available
>> here: https://github.com/thebracket/cpumap-pping-hackjob
>> > >>>>>> >>>>
>> > >>>>>> >>>> I mean it about the warnings, this isn't at all stable,
>> debugged - and can't promise that it won't unleash the nasal demons
>> > >>>>>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-=
)
>> > >>>>>> >>>>
>> > >>>>>> >>>> With that said, I'm pretty happy so far:
>> > >>>>>> >>>>
>> > >>>>>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has
>> nicely shunted onto a dedicated CPU. It has to run on both
>> > >>>>>> >>>>   the inbound and outbound classifiers, since otherwise it
>> would only see half the conversation.
>> > >>>>>> >>>> * It does assume that your ingress and egress CPUs are
>> mapped to the same interface; I do that anyway in BracketQoS. Not doing
>> > >>>>>> >>>>   that opens up a potential world of pain, since writes to
>> the shared maps would require a locking scheme. Too much locking, and yo=
u
>> lose all of the benefit of using multiple CPUs to begin with.
>> > >>>>>> >>>> * It is pretty wasteful of RAM, but most of the shaper
>> systems I've worked with have lots of it.
>> > >>>>>> >>>> * I've been gradually removing features that I don't want
>> for BracketQoS. A hypothetical future "useful to everyone" version would=
n't
>> do that.
>> > >>>>>> >>>> * Rate limiting is working, but I removed the requirement
>> for a shared configuration provided from userland - so right now it's
>> always set to report at 1 second intervals per stream.
>> > >>>>>> >>>>
>> > >>>>>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client"
>> and "world", and a "shaper" VM in between running a slightly hacked-up
>> LibreQoS.
>> > >>>>>> >>>> iperf from "client" to "world" (with Libre set to allow
>> 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s at present,=
 on
>> my
>> > >>>>>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb
>> RAM and fast SSDs)
>> > >>>>>> >>>>
>> > >>>>>> >>>> Output currently consists of debug messages reading:
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222:
>> bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239:
>> bpf_trace_printk: (tc) Send performance event (5,1), 374696
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466:
>> bpf_trace_printk: (tc) Flow open event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475:
>> bpf_trace_printk: (tc) Send performance event (5,1), 247069
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151:
>> bpf_trace_printk: (tc) Send performance event (5,1), 5217155
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4515394
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4481289
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4255268
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864:
>> bpf_trace_printk: (tc) Send performance event (5,1), 5249493
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664:
>> bpf_trace_printk: (tc) Send performance event (5,1), 3795993
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469:
>> bpf_trace_printk: (tc) Send performance event (5,1), 3949519
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4365335
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4154910
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048:
>> bpf_trace_printk: (tc) Send performance event (5,1), 4405582
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080:
>> bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714:
>> bpf_trace_printk: (tc) Send flow event
>> > >>>>>> >>>>
>> > >>>>>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>> major/minor, allocated by the xdp-cpumap parent.
>> > >>>>>> >>>> I get pretty low latency between VMs; I'll set up a test
>> with some real-world data very soon.
>> > >>>>>> >>>>
>> > >>>>>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> > >>>>>> >>>>
>> > >>>>>> >>>> Thanks,
>> > >>>>>> >>>> Herbert
>> > >>>>>> >>>>
>> > >>>>>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>> Simon.Sundberg@kau.se> wrote:
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Hi, thanks for adding me to the conversation. Just a
>> couple of quick
>> > >>>>>> >>>>> notes.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=
=B8rgensen
>> wrote:
>> > >>>>>> >>>>> > [ Adding Simon to Cc ]
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> writes:
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > Hey,
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I've had some pretty good success with merging
>> xdp-pping (
>> > >>>>>> >>>>> > >
>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>> > >>>>>> >>>>> > > into xdp-cpumap-tc (
>> https://github.com/xdp-project/xdp-cpumap-tc ).
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > I ported over most of the xdp-pping code, and then
>> changed the entry point
>> > >>>>>> >>>>> > > and packet parsing code to make use of the work
>> already done in
>> > >>>>>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the
>> packet, no need to do
>> > >>>>>> >>>>> > > it twice). Then I switched the maps to per-cpu maps,
>> and had to pin them -
>> > >>>>>> >>>>> > > otherwise the two tc instances don't properly share
>> data.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is
>> processed on
>> > >>>>>> >>>>> the same CPU core at both ingress or egress. Otherwise, i=
f
>> a flow may
>> > >>>>>> >>>>> be processed by different cores on ingress and egress the
>> per-CPU maps
>> > >>>>>> >>>>> will not really work reliably as each core will have a
>> different view
>> > >>>>>> >>>>> on the state of the flow, if there's been a previous
>> packet with a
>> > >>>>>> >>>>> certain TSval from that flow etc.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Furthermore, if a flow is always processed on the same
>> core (on both
>> > >>>>>> >>>>> ingress and egress) I think per-CPU maps may be a bit
>> wasteful on
>> > >>>>>> >>>>> memory. From my understanding the keys for per-CPU maps
>> are still
>> > >>>>>> >>>>> shared across all CPUs, it's just that each CPU gets its
>> own value. So
>> > >>>>>> >>>>> all CPUs will then have their own data for each flow, but
>> it's only the
>> > >>>>>> >>>>> CPU processing the flow that will have any relevant data
>> for the flow
>> > >>>>>> >>>>> while the remaining CPUs will just have an empty state fo=
r
>> that flow.
>> > >>>>>> >>>>> Under the same assumption that packets within the same
>> flow are always
>> > >>>>>> >>>>> processed on the same core there should generally not be
>> any
>> > >>>>>> >>>>> concurrency issues with having a global (non-per-CPU)
>> either as packets
>> > >>>>>> >>>>> from the same flow cannot be processed concurrently then
>> (and thus no
>> > >>>>>> >>>>> concurrent access to the same value in the map). I am
>> however still
>> > >>>>>> >>>>> very unclear on if there's any considerable performance
>> impact between
>> > >>>>>> >>>>> global and per-CPU map versions if the same key is not
>> accessed
>> > >>>>>> >>>>> concurrently.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > > Right now, output
>> > >>>>>> >>>>> > > is just stubbed - I've still got to port the perfmap
>> output code. Instead,
>> > >>>>>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug
>> pipe, so I can see
>> > >>>>>> >>>>> > > roughly what the output would look like.
>> > >>>>>> >>>>> > >
>> > >>>>>> >>>>> > > With debug enabled and just logging I'm now getting
>> about 4.9 Gbits/sec on
>> > >>>>>> >>>>> > > single-stream iperf between two VMs (with a shaper VM
>> in the middle). :-)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Just FYI, that "just logging" is probably the biggest
>> source of
>> > >>>>>> >>>>> > overhead, then. What Simon found was that sending the
>> data from kernel
>> > >>>>>> >>>>> > to userspace is one of the most expensive bits of
>> epping, at least when
>> > >>>>>> >>>>> > the number of data points goes up (which is does as
>> additional flows are
>> > >>>>>> >>>>> > added).
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Yhea, reporting individual RTTs when there's lots of them
>> (you may get
>> > >>>>>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic
>> in terms of
>> > >>>>>> >>>>> direct overhead from the tool itself, but also becomes
>> demanding for
>> > >>>>>> >>>>> whatever you use all those RTT samples for (i.e. need to
>> log, parse,
>> > >>>>>> >>>>> analyze etc. a very large amount of RTTs). One way to dea=
l
>> with that is
>> > >>>>>> >>>>> of course to just apply some sort of sampling (the
>> -r/--rate-limit and
>> > >>>>>> >>>>> -R/--rtt-rate
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > > So my question: how would you prefer to receive this
>> data? I'll have to
>> > >>>>>> >>>>> > > write a daemon that provides userspace control
>> (periodic cleanup as well as
>> > >>>>>> >>>>> > > reading the performance stream), so the world's kinda
>> our oyster. I can
>> > >>>>>> >>>>> > > stick to Kathie's original format (and dump it to a
>> named pipe, perhaps?),
>> > >>>>>> >>>>> > > a condensed format that only shows what you want to
>> use, an efficient
>> > >>>>>> >>>>> > > binary format if you feel like parsing that...
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > It would be great if we could combine efforts a bit her=
e
>> so we don't
>> > >>>>>> >>>>> > fork the codebase more than we have to. I.e., if
>> "upstream" epping and
>> > >>>>>> >>>>> > whatever daemon you end up writing can agree on data
>> format etc that
>> > >>>>>> >>>>> > would be fantastic! Added Simon to Cc to facilitate thi=
s
>> :)
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Briefly what I've discussed before with Simon was to
>> have the ability to
>> > >>>>>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and
>> have a userspace
>> > >>>>>> >>>>> > utility periodically pull them out. What we discussed
>> was doing this
>> > >>>>>> >>>>> > using an LPM map (which is not in that PR yet). The ide=
a
>> would be that
>> > >>>>>> >>>>> > userspace would populate the LPM map with the keys
>> (prefixes) they
>> > >>>>>> >>>>> > wanted statistics for (in LibreQOS context that could b=
e
>> one key per
>> > >>>>>> >>>>> > customer, for instance). Epping would then do a map
>> lookup into the LPM,
>> > >>>>>> >>>>> > and if it gets a match it would update the statistics i=
n
>> that map entry
>> > >>>>>> >>>>> > (keeping a histogram of latency values seen, basically)=
.
>> Simon's PR
>> > >>>>>> >>>>> > below uses this technique where userspace will "reset"
>> the histogram
>> > >>>>>> >>>>> > every time it loads it by swapping out two different ma=
p
>> entries when it
>> > >>>>>> >>>>> > does a read; this allows you to control the sampling
>> rate from
>> > >>>>>> >>>>> > userspace, and you'll just get the data since the last
>> time you polled.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Thank's Toke for summarzing both the current state and th=
e
>> plan going
>> > >>>>>> >>>>> forward. I will just note that this PR (and all my other
>> work with
>> > >>>>>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be
>> more or less
>> > >>>>>> >>>>> on hold for a couple of weeks right now as I'm trying to
>> finish up a
>> > >>>>>> >>>>> paper.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > I was thinking that if we all can agree on the map
>> format, then your
>> > >>>>>> >>>>> > polling daemon could be one userspace "client" for that=
,
>> and the epping
>> > >>>>>> >>>>> > binary itself could be another; but we could keep
>> compatibility between
>> > >>>>>> >>>>> > the two, so we don't duplicate effort.
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > Similarly, refactoring of the epping code itself so it
>> can be plugged
>> > >>>>>> >>>>> > into the cpumap-tc code would be a good goal...
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> Should probably do that...at some point. In general I
>> think it's a bit
>> > >>>>>> >>>>> of an interesting problem to think about how to chain
>> multiple XDP/tc
>> > >>>>>> >>>>> programs together in an efficent way. Most XDP and tc
>> programs will do
>> > >>>>>> >>>>> some amount of packet parsing and when you have many
>> chained programs
>> > >>>>>> >>>>> parsing the same packets this obviously becomes a bit
>> wasteful. In the
>> > >>>>>> >>>>> same time it would be nice if one didn't need to manually
>> merge
>> > >>>>>> >>>>> multiple programs together into a single one like this to
>> get rid of
>> > >>>>>> >>>>> this duplicated parsing, or at least make that process of
>> merging those
>> > >>>>>> >>>>> programs as simple as possible.
>> > >>>>>> >>>>>
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> > -Toke
>> > >>>>>> >>>>> >
>> > >>>>>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> > >>>>>> >>>>>
>> > >>>>>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet beh=
andlar
>> vi dina personuppgifter<https://www.kau.se/gdpr>.
>> > >>>>>> >>>>> When you send an e-mail to Karlstad University, we will
>> process your personal data<https://www.kau.se/en/gdpr>.
>> > >>>>>> >>>>
>> > >>>>>> >>>> _______________________________________________
>> > >>>>>> >>>> LibreQoS mailing list
>> > >>>>>> >>>> LibreQoS@lists.bufferbloat.net
>> > >>>>>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>>
>> > >>>>>> >>> --
>> > >>>>>> >>> Robert Chac=C3=B3n
>> > >>>>>> >>> CEO | JackRabbit Wireless LLC
>> > >>>>>> >
>> > >>>>>> > _______________________________________________
>> > >>>>>> > LibreQoS mailing list
>> > >>>>>> > LibreQoS@lists.bufferbloat.net
>> > >>>>>> > https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> This song goes out to all the folk that thought Stadia would
>> work:
>> > >>>>>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666=
65607352320-FXtz
>> > >>>>>> Dave T=C3=A4ht CEO, TekLibre, LLC
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> LibreQoS mailing list
>> > >>>> LibreQoS@lists.bufferbloat.net
>> > >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Robert Chac=C3=B3n
>> > >>> CEO | JackRabbit Wireless LLC
>> > >>> _______________________________________________
>> > >>> LibreQoS mailing list
>> > >>> LibreQoS@lists.bufferbloat.net
>> > >>> https://lists.bufferbloat.net/listinfo/libreqos
>> > >
>> > > _______________________________________________
>> > > LibreQoS mailing list
>> > > LibreQoS@lists.bufferbloat.net
>> > > https://lists.bufferbloat.net/listinfo/libreqos
>> >
>> >
>> >
>> > --
>> > This song goes out to all the folk that thought Stadia would work:
>> >
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666=
65607352320-FXtz
>> > Dave T=C3=A4ht CEO, TekLibre, LLC
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666=
65607352320-FXtz
>> Dave T=C3=A4ht CEO, TekLibre, LLC
>>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>

--00000000000047775205eba0a443
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div>Awesome work! It&#39;s really amazing how little add=
itional CPU the TCP tracking adds. Super excited to start testing in produc=
tion myself soon. Have a great restful morning with your daughter. =F0=9F=
=98=8C<br><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_at=
tr">On Sat, Oct 22, 2022, 8:32 AM Herbert Wolverson via LibreQoS &lt;<a hre=
f=3D"mailto:libreqos@lists.bufferbloat.net">libreqos@lists.bufferbloat.net<=
/a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><di=
v>This morning I tested cpu-pping with live customers! <br></div><div>A lit=
tle over 1,200 mapped IP addresses, about 600 mbps of real traffic flowing =
through a big</div><div>hierarchy of 52 sites. (600 is our &quot;quiet time=
&quot; traffic)<br></div><div><br></div><div>It started very well: the upda=
ted xdp-cpumap system dropped in place and the system worked as</div><div>b=
efore. xdp_pping started to show data with correct mappings. CPU load from =
the mapping</div><div>system is within 1% of where it was before.</div><div=
><br></div><div>After about 20 minutes of continuous execution, it started =
to run into some scaling issues.</div><div>The shaping system continued to =
run wonderfully, and CPU load was still fine. However,</div><div>it stopped=
 reporting latency data! A bit of debugging showed that once you exceed</di=
v><div>16,384 in-flight TCP streams it isn&#39;t handling the &quot;map ful=
l&quot; situation gracefully - and</div><div>clearing the map from userspac=
e isn&#39;t working correctly. So I hacked away and hacked</div><div>away.<=
/div><div><br></div><div>Anyway, it turns out that it does in fact work fin=
e at that scale. There&#39;s just a one-line</div><div>bug in the xdp_pping=
.c file. I forgot to actually *call* one line of packet cleanup code. <br><=
/div><div>Adding that, and everything was awesome.</div><div><br></div><div=
>The entire patch that fixed it consists of adding one line:</div><div><spa=
n style=3D"font-family:monospace">cleanup_packet_ts(packet_ts);</span></div=
><div><br></div><div>Oops.</div><div><br></div><div>Anyway, with that in pl=
ace it&#39;s running superbly. I did identify a couple of places in</div><d=
iv>which it&#39;s being overly verbose with debug information, so I&#39;ve =
patched that also.</div><div><br></div><div>After reducing the overly eager=
 warning about not being able to read a TCP header,</div><div>CPU performan=
ce improved by another 2% on average.<br></div><div><br></div><div>Longer-t=
erm (i.e. not on a Saturday morning, when I&#39;d rather be playing with my=
</div><div>daughter!), I think I&#39;ll look at raising some of the buffer =
sizes.</div><div><br></div><div>Thanks,</div><div>Herbert<br></div></div><b=
r><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, =
Oct 19, 2022 at 11:13 AM Dave Taht &lt;<a href=3D"mailto:dave.taht@gmail.co=
m" target=3D"_blank" rel=3D"noreferrer">dave.taht@gmail.com</a>&gt; wrote:<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(204,204,204);padding-left:1ex">PS - today&#39;s=
 (free) p99 conference is *REALLY AWESOME*. <a href=3D"https://www.p99conf.=
io/" rel=3D"noreferrer noreferrer" target=3D"_blank">https://www.p99conf.io=
/</a><br>
<br>
On Wed, Oct 19, 2022 at 9:13 AM Dave Taht &lt;<a href=3D"mailto:dave.taht@g=
mail.com" target=3D"_blank" rel=3D"noreferrer">dave.taht@gmail.com</a>&gt; =
wrote:<br>
&gt;<br>
&gt; flent outputs a flent.gz file that I can parse and plot 20 differnt<br=
>
&gt; ways. Also the graphing tools work on osx<br>
&gt;<br>
&gt; On Wed, Oct 19, 2022 at 9:11 AM Herbert Wolverson via LibreQoS<br>
&gt; &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D"_blank=
" rel=3D"noreferrer">libreqos@lists.bufferbloat.net</a>&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt; That&#39;s true. The 12th gen does seem to have some &quot;specia=
l&quot; features... makes for a nice writing platform<br>
&gt; &gt; (this box is primarily my &quot;write books and articles&quot; ma=
chine). I&#39;ll be doing a wider test on a more normal<br>
&gt; &gt; platform, probably at the weekend (with real traffic, hence the d=
elay - have to find a time in which I<br>
&gt; &gt; minimize disruption)<br>
&gt; &gt;<br>
&gt; &gt; On Wed, Oct 19, 2022 at 10:49 AM dan &lt;<a href=3D"mailto:danden=
son@gmail.com" target=3D"_blank" rel=3D"noreferrer">dandenson@gmail.com</a>=
&gt; wrote:<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; Those &#39;efficiency&#39; threads in Intel 12th gen should p=
robably be addressed as well.=C2=A0 You can&#39;t turn them off in BIOS.<br=
>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; On Wed, Oct 19, 2022 at 8:48 AM Robert Chac=C3=B3n via LibreQ=
oS &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D"_blank" =
rel=3D"noreferrer">libreqos@lists.bufferbloat.net</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; Awesome work on this!<br>
&gt; &gt;&gt;&gt; I suspect there should be a slight performance bump once =
Hyperthreading is disabled and efficient power management is off.<br>
&gt; &gt;&gt;&gt; Hyperthreading/SMT always messes with HTB performance whe=
n I leave it on. Thank you for mentioning that - I now went ahead and added=
 instructions on disabling hyperthreading on the Wiki for new users.<br>
&gt; &gt;&gt;&gt; Super promising results!<br>
&gt; &gt;&gt;&gt; Interested to see what throughput is with xdp-cpumap-tc v=
s cpumap-pping. So far in your VM setup it seems to be doing very well.<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; On Wed, Oct 19, 2022 at 8:06 AM Herbert Wolverson via Lib=
reQoS &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D"_blan=
k" rel=3D"noreferrer">libreqos@lists.bufferbloat.net</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt; Also, I forgot to mention that I *think* the current =
version has removed the requirement that the inbound<br>
&gt; &gt;&gt;&gt;&gt; and outbound classifiers be placed on the same CPU. I=
 know interduo was particularly keen on packing<br>
&gt; &gt;&gt;&gt;&gt; upload into fewer cores. I&#39;ll add that to my list=
 of things to test.<br>
&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt; On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson &lt=
;<a href=3D"mailto:herberticus@gmail.com" target=3D"_blank" rel=3D"noreferr=
er">herberticus@gmail.com</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt; I&#39;ll definitely take a look - that does look =
interesting. I don&#39;t have X11 on any of my test VMs, but<br>
&gt; &gt;&gt;&gt;&gt;&gt; it looks like it can work without the GUI.<br>
&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt; Thanks!<br>
&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt; On Wed, Oct 19, 2022 at 8:58 AM Dave Taht &lt;<a =
href=3D"mailto:dave.taht@gmail.com" target=3D"_blank" rel=3D"noreferrer">da=
ve.taht@gmail.com</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; could I coax you to adopt flent?<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; apt-get install flent netperf irtt fping<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; You sometimes have to compile netperf yoursel=
f with --enable-demo on<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; some systems.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; There are a bunch of python libs neede for th=
e gui, but only on the client.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; Then you can run a really gnarly test series =
and plot the results over time.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; flent --socket-stats --step-size=3D.05 -t =
9;the-test-conditions&#39; -H<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; the_server_name rrul # 110 other tests<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolve=
rson via LibreQoS<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:libreqos@lists.bufferbl=
oat.net" target=3D"_blank" rel=3D"noreferrer">libreqos@lists.bufferbloat.ne=
t</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Hey,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Testing the current version ( <a href=3D=
"https://github.com/thebracket/cpumap-pping-hackjob" rel=3D"noreferrer nore=
ferrer" target=3D"_blank">https://github.com/thebracket/cpumap-pping-hackjo=
b</a> ), it&#39;s doing better than I hoped. This build has shared (not per=
-cpu) maps, and a userspace daemon (xdp_pping) to extract and reset stats.<=
br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; My testing environment has grown a bit:<=
br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * ShaperVM - running Ubuntu Server and L=
ibreQoS, with the new cpumap-pping-hackjob version of xdp-cpumap.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * ExtTest - running Ubuntu Server, set a=
s 10.64.1.1. Hosts an iperf server.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * ClientInt1 - running Ubuntu Server (mi=
nimal), set as 10.64.1.2. Hosts iperf client.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * ClientInt2 - running Ubuntu Server (mi=
nimal), set as 10.64.1.3. Hosts iperf client.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; ClientInt1, ClientInt2 and one interface=
 (LAN facing) of ShaperVM are on a virtual switch.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; ExtTest and the other interface (WAN fac=
ing) of ShaperVM are on a different virtual switch.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; These are all on a host machine running =
Windows 11, a core i7 12th gen, 32 Gb RAM and fast SSD setup.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; TEST 1: DUAL STREAMS, LOW THROUGHPUT<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; For this test, LibreQoS is configured:<b=
r>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * Two APs, each with 5gbit/s max.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * 100.64.1.2 and 100.64.1.3 setup as CPE=
s, each limited to about 100mbit/s. They map to 1:5 and 2:5 respectively (s=
eparate CPUs).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * Set to use Cake<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; On each client, roughly simultaneously r=
un: iperf -c 100.64.1.1 -t 500 (for a long run). Running xdp_pping yields c=
orrect results:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 4, &quot;min&quot; : 3, &quot;max&quot; : 5, &quot;samples&quot;=
 : 11},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;a=
vg&quot; : 4, &quot;min&quot; : 3, &quot;max&quot; : 5, &quot;samples&quot;=
 : 11},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Or when I waited a while to gather/reset=
:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 4, &quot;min&quot; : 3, &quot;max&quot; : 6, &quot;samples&quot;=
 : 60},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;a=
vg&quot; : 4, &quot;min&quot; : 3, &quot;max&quot; : 5, &quot;samples&quot;=
 : 60},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; The ShaperVM shows no errors, just perio=
dic logging that it is recording data.=C2=A0 CPU is about 2-3% on two CPUs,=
 zero on the others (as expected).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; After 500 seconds of continual iperfing,=
 each client reported a throughput of 104 Mbit/sec and 6.06 GBytes of data =
transmitted.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; So for smaller streams, I&#39;d call thi=
s a success.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; TEST 2: DUAL STREAMS, HIGH THROUGHPUT<br=
>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; For this test, LibreQoS is configured:<b=
r>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * Two APs, each with 5gb/s max.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; * 100.64.1.2 and 100.64.1.3 setup as CPE=
s, each limited to 5Gbit/s! Mapped to 1:5 and 2:5 respectively (separate CP=
Us).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Run iperfc -c 100.64.1.1 -t 500 on each =
client at the same time.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; xdp_pping shows results, too:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 4, &quot;min&quot; : 1, &quot;max&quot; : 7, &quot;samples&quot;=
 : 58},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;a=
vg&quot; : 7, &quot;min&quot; : 3, &quot;max&quot; : 11, &quot;samples&quot=
; : 58},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 5, &quot;min&quot; : 4, &quot;max&quot; : 8, &quot;samples&quot;=
 : 13},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;a=
vg&quot; : 8, &quot;min&quot; : 7, &quot;max&quot; : 10, &quot;samples&quot=
; : 13},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; The ShaperVM shows two CPUs pegging betw=
een 70 and 90 percent.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; After 500 seconds of continual iperfing,=
 each client reported a throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 =
Gbits/sec and 226 GBytes.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Maxing out HyperV like this is inducing =
a bit of latency (which is to be expected), but it&#39;s not bad. I also fo=
rgot to disable hyperthreading, and looking at the host performance it is s=
ometimes running the second virtual CPU on an underpowered &quot;fake&quot;=
 CPU.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; So for two large streams, I think we&#39=
;re doing pretty well also!<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; TEST 3: DUAL STREAMS, SINGLE CPU<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; This test is designed to try and blow th=
ings up. It&#39;s the same as test 2, but both CPEs are set to the same CPU=
 (1), using TC handles 1:5 and 1:6.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; ShaperVM CPU1 maxed out in the high 90s,=
 the other CPUs were idle. The pping stats start to show a bit of degradati=
on in performance for pounding it so hard:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;a=
vg&quot; : 10, &quot;min&quot; : 9, &quot;max&quot; : 19, &quot;samples&quo=
t; : 24},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 10, &quot;min&quot; : 8, &quot;max&quot; : 18, &quot;samples&quo=
t; : 24},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; For whatever reason, it smoothed out ove=
r time:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;a=
vg&quot; : 10, &quot;min&quot; : 9, &quot;max&quot; : 12, &quot;samples&quo=
t; : 50},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 10, &quot;min&quot; : 8, &quot;max&quot; : 13, &quot;samples&quo=
t; : 50},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; Surprisingly (to me), I didn&#39;t encou=
nter errors. Each client received 2.22 Gbit/s performance, over 129 Gbytes =
of data.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; TEST 4: DUAL STREAMS, 50 SUB-STREAMS<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; This test is also designed to break thin=
gs. Same as test 3, but using iperf -c 100.64.1.1 -P 50 -t 120 - 50 substre=
ams, to try and really tax the flow tracking. (Shorter time window because =
I really wanted to go and find coffee)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; ShaperVM CPU sat at around 80-97%, tendi=
ng towards 97%. pping results show that this torture test is worsening perf=
ormance, and there&#39;s always lots of samples in the buffer:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;a=
vg&quot; : 23, &quot;min&quot; : 19, &quot;max&quot; : 27, &quot;samples&qu=
ot; : 49},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;a=
vg&quot; : 24, &quot;min&quot; : 19, &quot;max&quot; : 27, &quot;samples&qu=
ot; : 49},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; This test also ran better than I expecte=
d. You can definitely see some latency creeping in as I make the system wor=
k hard. Each VM showed around 2.4 Gbit/s in total performance at the end of=
 the iperf session. There&#39;s definitely some latency creeping in, which =
is expected - but I&#39;m not sure I expected quite that much.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; WHAT&#39;S NEXT &amp; CONCLUSION<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; I noticed that I forgot to turn off effi=
cient power management on my VMs and host, and left Hyperthreading on by mi=
stake. So that hurts overall performance.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; The base system seems to be working pret=
ty solidly, at least for small tests.Next up, I&#39;ll be removing extraneo=
us debug reporting code, removing some code paths that don&#39;t do anythin=
g but report, and looking for any small optimization opportunities. I&#39;l=
l then re-run these tests. Once that&#39;s done, I hope to find a maintenan=
ce window on my WISP and try it with actual traffic.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; I also need to re-run these tests withou=
t the pping system to provide some before/after analysis.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; On Tue, Oct 18, 2022 at 1:01 PM Herbert =
Wolverson &lt;<a href=3D"mailto:herberticus@gmail.com" target=3D"_blank" re=
l=3D"noreferrer">herberticus@gmail.com</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; It&#39;s probably not entirely threa=
d-safe right now (ran into some issues reading per_cpu maps back from users=
pace; hopefully, I&#39;ll get that figured out) - but the commits I just pu=
shed have it basically working on single-stream testing. :-)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Setup cpumap as usual, and periodica=
lly run xdp-pping. This gives you per-connection RTT information in JSON:<b=
r>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; [<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; {&quot;tc&quot;:&quot;1:5&quot;, &qu=
ot;avg&quot; : 5, &quot;min&quot; : 5, &quot;max&quot; : 5, &quot;samples&q=
uot; : 1},<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; {}]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; (With the extra {} because I&#39;m n=
ot tracking the tail and haven&#39;t done comma removal). The tool also emp=
ties the various maps used to gather data, acting as a &quot;reset&quot; po=
int. There&#39;s a max of 60 samples per queue, in a ringbuffer setup (so n=
ewest will start to overwrite the oldest).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; I&#39;ll start trying to test on a l=
arger scale now.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; On Mon, Oct 17, 2022 at 3:34 PM Robe=
rt Chac=C3=B3n &lt;<a href=3D"mailto:robert.chacon@jackrabbitwireless.com" =
target=3D"_blank" rel=3D"noreferrer">robert.chacon@jackrabbitwireless.com</=
a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Hey Herbert,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Fantastic work! Super exciting t=
o see this coming together, especially so quickly.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; I&#39;ll test it soon.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; I understand and agree with your=
 decision to omit certain features (ICMP tracking,DNS tracking, etc) to opt=
imize performance for our use case. Like you said, in order to merge the fu=
nctionality without a performance hit, merging them is sort of the only way=
 right now. Otherwise there would be a lot of redundancy and lost throughpu=
t for an ISP&#39;s use. Though hopefully long term there will be a way to k=
eep all projects working independently but interoperably with a plugin syst=
em of some kind.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; By the way, I&#39;m making some =
headway on LibreQoS v1.3. Focusing on optimizations for high sub counts (80=
00+ subs) as well as stateful changes to the queue structure.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; I&#39;m working to set up a phys=
ical lab to test high throughput and high client count scenarios.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; When testing beyond ~32,000 filt=
ers we get &quot;no space left on device&quot; from xdp-cpumap-tc, which I =
think relates to the bpf map size limitation you mentioned. Maybe in the co=
ming months we can take a look at that.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Anyway great work on the cpumap-=
pping program! Excited to see more on this.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Thanks,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Robert<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; On Mon, Oct 17, 2022 at 12:45 PM=
 Herbert Wolverson via LibreQoS &lt;<a href=3D"mailto:libreqos@lists.buffer=
bloat.net" target=3D"_blank" rel=3D"noreferrer">libreqos@lists.bufferbloat.=
net</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; Hey,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; My current (unfinished) prog=
ress on this is now available here: <a href=3D"https://github.com/thebracke=
t/cpumap-pping-hackjob" rel=3D"noreferrer noreferrer" target=3D"_blank">htt=
ps://github.com/thebracket/cpumap-pping-hackjob</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; I mean it about the warnings=
, this isn&#39;t at all stable, debugged - and can&#39;t promise that it wo=
n&#39;t unleash the nasal demons<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; (to use a popular C++ phrase=
). The name is descriptive! ;-)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; With that said, I&#39;m pret=
ty happy so far:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; * It runs only on the classi=
fier - which xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has =
to run on both<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0the inbound and =
outbound classifiers, since otherwise it would only see half the conversati=
on.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; * It does assume that your i=
ngress and egress CPUs are mapped to the same interface; I do that anyway i=
n BracketQoS. Not doing<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0that opens up a =
potential world of pain, since writes to the shared maps would require a lo=
cking scheme. Too much locking, and you lose all of the benefit of using mu=
ltiple CPUs to begin with.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; * It is pretty wasteful of R=
AM, but most of the shaper systems I&#39;ve worked with have lots of it.<br=
>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; * I&#39;ve been gradually re=
moving features that I don&#39;t want for BracketQoS. A hypothetical future=
 &quot;useful to everyone&quot; version wouldn&#39;t do that.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; * Rate limiting is working, =
but I removed the requirement for a shared configuration provided from user=
land - so right now it&#39;s always set to report at 1 second intervals per=
 stream.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; My testbed is currently 3 Hy=
per-V VMs - a simple &quot;client&quot; and &quot;world&quot;, and a &quot;=
shaper&quot; VM in between running a slightly hacked-up LibreQoS.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; iperf from &quot;client&quot=
; to &quot;world&quot; (with Libre set to allow 10gbit/s max, via a cake/HT=
B queue setup) is around 5 gbit/s at present, on my<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; test PC (the host is a core =
i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; Output currently consists of=
 debug messages reading:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0515.399222: bpf_trace_printk: (tc)=
 Flow open event<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0515.399239: bpf_trace_printk: (tc)=
 Send performance event (5,1), 374696<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0515.399466: bpf_trace_printk: (tc)=
 Flow open event<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0515.399475: bpf_trace_printk: (tc)=
 Send performance event (5,1), 247069<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0516.405151: bpf_trace_printk: (tc)=
 Send performance event (5,1), 5217155<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0517.405248: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4515394<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0518.406117: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4481289<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0519.406255: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4255268<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0520.407864: bpf_trace_printk: (tc)=
 Send performance event (5,1), 5249493<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0521.406664: bpf_trace_printk: (tc)=
 Send performance event (5,1), 3795993<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0522.407469: bpf_trace_printk: (tc)=
 Send performance event (5,1), 3949519<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0523.408126: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4365335<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0524.408929: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4154910<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0525.410048: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4405582<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0525.434080: bpf_trace_printk: (tc)=
 Send flow event<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1=
371=C2=A0 =C2=A0 [000] D..2.=C2=A0 =C2=A0525.482714: bpf_trace_printk: (tc)=
 Send flow event<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; The times haven&#39;t been t=
weaked yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap=
 parent.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; I get pretty low latency bet=
ween VMs; I&#39;ll set up a test with some real-world data very soon.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; I plan to keep hacking away,=
 but feel free to take a peek.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; Thanks,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; Herbert<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; On Mon, Oct 17, 2022 at 10:1=
4 AM Simon Sundberg &lt;<a href=3D"mailto:Simon.Sundberg@kau.se" target=3D"=
_blank" rel=3D"noreferrer">Simon.Sundberg@kau.se</a>&gt; wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Hi, thanks for adding me=
 to the conversation. Just a couple of quick<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; notes.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; On Mon, 2022-10-17 at 16=
:13 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; [ Adding Simon to C=
c ]<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; Herbert Wolverson v=
ia LibreQoS &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D=
"_blank" rel=3D"noreferrer">libreqos@lists.bufferbloat.net</a>&gt; writes:<=
br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; Hey,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; I&#39;ve had s=
ome pretty good success with merging xdp-pping (<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; <a href=3D"htt=
ps://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h" rel=3D"=
noreferrer noreferrer" target=3D"_blank">https://github.com/xdp-project/bpf=
-examples/blob/master/pping/pping.h</a> )<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; into xdp-cpuma=
p-tc ( <a href=3D"https://github.com/xdp-project/xdp-cpumap-tc" rel=3D"nore=
ferrer noreferrer" target=3D"_blank">https://github.com/xdp-project/xdp-cpu=
map-tc</a> ).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; I ported over =
most of the xdp-pping code, and then changed the entry point<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; and packet par=
sing code to make use of the work already done in<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; xdp-cpumap-tc =
(it&#39;s already parsed a big chunk of the packet, no need to do<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; it twice). The=
n I switched the maps to per-cpu maps, and had to pin them -<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; otherwise the =
two tc instances don&#39;t properly share data.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; I guess the xdp-cpumap-t=
c ensures that the same flow is processed on<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; the same CPU core at bot=
h ingress or egress. Otherwise, if a flow may<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; be processed by differen=
t cores on ingress and egress the per-CPU maps<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; will not really work rel=
iably as each core will have a different view<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; on the state of the flow=
, if there&#39;s been a previous packet with a<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; certain TSval from that =
flow etc.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Furthermore, if a flow i=
s always processed on the same core (on both<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; ingress and egress) I th=
ink per-CPU maps may be a bit wasteful on<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; memory. From my understa=
nding the keys for per-CPU maps are still<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; shared across all CPUs, =
it&#39;s just that each CPU gets its own value. So<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; all CPUs will then have =
their own data for each flow, but it&#39;s only the<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; CPU processing the flow =
that will have any relevant data for the flow<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; while the remaining CPUs=
 will just have an empty state for that flow.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Under the same assumptio=
n that packets within the same flow are always<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; processed on the same co=
re there should generally not be any<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; concurrency issues with =
having a global (non-per-CPU) either as packets<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; from the same flow canno=
t be processed concurrently then (and thus no<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; concurrent access to the=
 same value in the map). I am however still<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; very unclear on if there=
&#39;s any considerable performance impact between<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; global and per-CPU map v=
ersions if the same key is not accessed<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; concurrently.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; Right now, out=
put<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; is just stubbe=
d - I&#39;ve still got to port the perfmap output code. Instead,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; I&#39;m dumpin=
g a bunch of extra data to the kernel debug pipe, so I can see<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; roughly what t=
he output would look like.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; With debug ena=
bled and just logging I&#39;m now getting about 4.9 Gbits/sec on<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; single-stream =
iperf between two VMs (with a shaper VM in the middle). :-)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; Just FYI, that &quo=
t;just logging&quot; is probably the biggest source of<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; overhead, then. Wha=
t Simon found was that sending the data from kernel<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; to userspace is one=
 of the most expensive bits of epping, at least when<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; the number of data =
points goes up (which is does as additional flows are<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; added).<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Yhea, reporting individu=
al RTTs when there&#39;s lots of them (you may get<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; upwards of 1000 RTTs/s p=
er flow) is not only problematic in terms of<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; direct overhead from the=
 tool itself, but also becomes demanding for<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; whatever you use all tho=
se RTT samples for (i.e. need to log, parse,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; analyze etc. a very larg=
e amount of RTTs). One way to deal with that is<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; of course to just apply =
some sort of sampling (the -r/--rate-limit and<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; -R/--rtt-rate<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; So my question=
: how would you prefer to receive this data? I&#39;ll have to<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; write a daemon=
 that provides userspace control (periodic cleanup as well as<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; reading the pe=
rformance stream), so the world&#39;s kinda our oyster. I can<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; stick to Kathi=
e&#39;s original format (and dump it to a named pipe, perhaps?),<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; a condensed fo=
rmat that only shows what you want to use, an efficient<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; &gt; binary format =
if you feel like parsing that...<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; It would be great i=
f we could combine efforts a bit here so we don&#39;t<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; fork the codebase m=
ore than we have to. I.e., if &quot;upstream&quot; epping and<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; whatever daemon you=
 end up writing can agree on data format etc that<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; would be fantastic!=
 Added Simon to Cc to facilitate this :)<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; Briefly what I&#39;=
ve discussed before with Simon was to have the ability to<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; aggregate the metri=
cs in the kernel (WiP PR [0]) and have a userspace<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; utility periodicall=
y pull them out. What we discussed was doing this<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; using an LPM map (w=
hich is not in that PR yet). The idea would be that<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; userspace would pop=
ulate the LPM map with the keys (prefixes) they<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; wanted statistics f=
or (in LibreQOS context that could be one key per<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; customer, for insta=
nce). Epping would then do a map lookup into the LPM,<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; and if it gets a ma=
tch it would update the statistics in that map entry<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; (keeping a histogra=
m of latency values seen, basically). Simon&#39;s PR<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; below uses this tec=
hnique where userspace will &quot;reset&quot; the histogram<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; every time it loads=
 it by swapping out two different map entries when it<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; does a read; this a=
llows you to control the sampling rate from<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; userspace, and you&=
#39;ll just get the data since the last time you polled.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Thank&#39;s Toke for sum=
marzing both the current state and the plan going<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; forward. I will just not=
e that this PR (and all my other work with<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; ePPing/BPF-PPing/XDP-PPi=
ng/I-suck-at-names-PPing) will be more or less<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; on hold for a couple of =
weeks right now as I&#39;m trying to finish up a<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; paper.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; I was thinking that=
 if we all can agree on the map format, then your<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; polling daemon coul=
d be one userspace &quot;client&quot; for that, and the epping<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; binary itself could=
 be another; but we could keep compatibility between<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; the two, so we don&=
#39;t duplicate effort.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; Similarly, refactor=
ing of the epping code itself so it can be plugged<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; into the cpumap-tc =
code would be a good goal...<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; Should probably do that.=
..at some point. In general I think it&#39;s a bit<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; of an interesting proble=
m to think about how to chain multiple XDP/tc<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; programs together in an =
efficent way. Most XDP and tc programs will do<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; some amount of packet pa=
rsing and when you have many chained programs<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; parsing the same packets=
 this obviously becomes a bit wasteful. In the<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; same time it would be ni=
ce if one didn&#39;t need to manually merge<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; multiple programs togeth=
er into a single one like this to get rid of<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; this duplicated parsing,=
 or at least make that process of merging those<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; programs as simple as po=
ssible.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; -Toke<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt; [0] <a href=3D"http=
s://github.com/xdp-project/bpf-examples/pull/59" rel=3D"noreferrer noreferr=
er" target=3D"_blank">https://github.com/xdp-project/bpf-examples/pull/59</=
a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; N=C3=A4r du skickar e-po=
st till Karlstads universitet behandlar vi dina personuppgifter&lt;<a href=
=3D"https://www.kau.se/gdpr" rel=3D"noreferrer noreferrer" target=3D"_blank=
">https://www.kau.se/gdpr</a>&gt;.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; When you send an e-mail =
to Karlstad University, we will process your personal data&lt;<a href=3D"ht=
tps://www.kau.se/en/gdpr" rel=3D"noreferrer noreferrer" target=3D"_blank">h=
ttps://www.kau.se/en/gdpr</a>&gt;.<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; ____________________________=
___________________<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; LibreQoS mailing list<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; <a href=3D"mailto:LibreQoS@l=
ists.bufferbloat.net" target=3D"_blank" rel=3D"noreferrer">LibreQoS@lists.b=
ufferbloat.net</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; <a href=3D"https://lists.buf=
ferbloat.net/listinfo/libreqos" rel=3D"noreferrer noreferrer" target=3D"_bl=
ank">https://lists.bufferbloat.net/listinfo/libreqos</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; --<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; Robert Chac=C3=B3n<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; CEO | JackRabbit Wireless LLC<br=
>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; ________________________________________=
_______<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; LibreQoS mailing list<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; <a href=3D"mailto:LibreQoS@lists.bufferb=
loat.net" target=3D"_blank" rel=3D"noreferrer">LibreQoS@lists.bufferbloat.n=
et</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt; <a href=3D"https://lists.bufferbloat.net=
/listinfo/libreqos" rel=3D"noreferrer noreferrer" target=3D"_blank">https:/=
/lists.bufferbloat.net/listinfo/libreqos</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; --<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; This song goes out to all the folk that thoug=
ht Stadia would work:<br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; <a href=3D"https://www.linkedin.com/posts/dta=
ht_the-mushroom-song-activity-6981366665607352320-FXtz" rel=3D"noreferrer n=
oreferrer" target=3D"_blank">https://www.linkedin.com/posts/dtaht_the-mushr=
oom-song-activity-6981366665607352320-FXtz</a><br>
&gt; &gt;&gt;&gt;&gt;&gt;&gt; Dave T=C3=A4ht CEO, TekLibre, LLC<br>
&gt; &gt;&gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;&gt; _______________________________________________<br>
&gt; &gt;&gt;&gt;&gt; LibreQoS mailing list<br>
&gt; &gt;&gt;&gt;&gt; <a href=3D"mailto:LibreQoS@lists.bufferbloat.net" tar=
get=3D"_blank" rel=3D"noreferrer">LibreQoS@lists.bufferbloat.net</a><br>
&gt; &gt;&gt;&gt;&gt; <a href=3D"https://lists.bufferbloat.net/listinfo/lib=
reqos" rel=3D"noreferrer noreferrer" target=3D"_blank">https://lists.buffer=
bloat.net/listinfo/libreqos</a><br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt;<br>
&gt; &gt;&gt;&gt; --<br>
&gt; &gt;&gt;&gt; Robert Chac=C3=B3n<br>
&gt; &gt;&gt;&gt; CEO | JackRabbit Wireless LLC<br>
&gt; &gt;&gt;&gt; _______________________________________________<br>
&gt; &gt;&gt;&gt; LibreQoS mailing list<br>
&gt; &gt;&gt;&gt; <a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=
=3D"_blank" rel=3D"noreferrer">LibreQoS@lists.bufferbloat.net</a><br>
&gt; &gt;&gt;&gt; <a href=3D"https://lists.bufferbloat.net/listinfo/libreqo=
s" rel=3D"noreferrer noreferrer" target=3D"_blank">https://lists.bufferbloa=
t.net/listinfo/libreqos</a><br>
&gt; &gt;<br>
&gt; &gt; _______________________________________________<br>
&gt; &gt; LibreQoS mailing list<br>
&gt; &gt; <a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=3D"_blan=
k" rel=3D"noreferrer">LibreQoS@lists.bufferbloat.net</a><br>
&gt; &gt; <a href=3D"https://lists.bufferbloat.net/listinfo/libreqos" rel=
=3D"noreferrer noreferrer" target=3D"_blank">https://lists.bufferbloat.net/=
listinfo/libreqos</a><br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; This song goes out to all the folk that thought Stadia would work:<br>
&gt; <a href=3D"https://www.linkedin.com/posts/dtaht_the-mushroom-song-acti=
vity-6981366665607352320-FXtz" rel=3D"noreferrer noreferrer" target=3D"_bla=
nk">https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366=
665607352320-FXtz</a><br>
&gt; Dave T=C3=A4ht CEO, TekLibre, LLC<br>
<br>
<br>
<br>
-- <br>
This song goes out to all the folk that thought Stadia would work:<br>
<a href=3D"https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-=
6981366665607352320-FXtz" rel=3D"noreferrer noreferrer" target=3D"_blank">h=
ttps://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-698136666560=
7352320-FXtz</a><br>
Dave T=C3=A4ht CEO, TekLibre, LLC<br>
</blockquote></div>
_______________________________________________<br>
LibreQoS mailing list<br>
<a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=3D"_blank" rel=3D"=
noreferrer">LibreQoS@lists.bufferbloat.net</a><br>
<a href=3D"https://lists.bufferbloat.net/listinfo/libreqos" rel=3D"noreferr=
er noreferrer" target=3D"_blank">https://lists.bufferbloat.net/listinfo/lib=
reqos</a><br>
</blockquote></div></div></div>

--00000000000047775205eba0a443--