From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <herberticus@gmail.com>
Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com
 [IPv6:2607:f8b0:4864:20::533])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 6E35E3B29D
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 10:06:01 -0400 (EDT)
Received: by mail-pg1-x533.google.com with SMTP id s196so15046075pgs.3
 for <libreqos@lists.bufferbloat.net>; Wed, 19 Oct 2022 07:06:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=cc:subject:message-id:date:from:in-reply-to:references:mime-version
 :from:to:cc:subject:date:message-id:reply-to;
 bh=Njbk+4y8CNAUnf3YnekXvT4sweRF0rY+RrSrvaD/nf8=;
 b=EkFa0sGSIzzroRlh6+mt+k+wI81xLtNE2SUTIOY8UUwVih/j3B7+C+TE4Z3EGqZ5dK
 OlKUZXntucwxvMbv1oWck6oJ7obM88icsUir/MxJ2u3S+aGh07AOkyp8B6cUn2tPEvUn
 piFetbQRVWNVDdUHgUCC5H1tWTkkoct9THSPDWElO0Rh1s+sksXJqTbZyGCtLEKdtBiZ
 zv/NicKQqRJkuJWeKJBAjIJPpx+SJKUcPV+AsSqEuqMd+kKRckaW7LT/AiGnhXukyka/
 3zxKa40jMAbXGH5uqKnmaU7vhKmMj8DlrAlMsjbkqt9fIrlmTNRiTKSLhoOfeZh/cv0Y
 tuPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=cc:subject:message-id:date:from:in-reply-to:references:mime-version
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=Njbk+4y8CNAUnf3YnekXvT4sweRF0rY+RrSrvaD/nf8=;
 b=1J6sOTRQfcyuw42TdSC81AllcO1eINRq89f8HRSt5AsPXMnkDsHIvhuT66BsOu2Nzn
 3yES2P1jr1skwv2w2f1WDC59DXdrVOAtOE0lJxqESkwh4rCquEpStNAPaZ+Sb1AZFe9Z
 8utuRolUbIIVuZQUlD0iEpiRvYsvChOvsHJKzAKC6tZMhbTzjOqoeEAsXgDPujNbxkkc
 uCG1FZjJUe6U9Kfw2d//WVj+dfs1YGWQVFhi6J00KxPiKF68Ina7AekzPMt6EyXbl/Np
 1amaRYY6Yz0ZBk/UfvgQzrsEMV9f2tXUTSGVRjEsdtLiO/jTzJlnrVdDXf6SoQaDLIhV
 iw5w==
X-Gm-Message-State: ACrzQf1cWek+swFdiOWENyB8ujd5VUqVvp/aGmx0GLxudRKAshGWBS44
 qpV1x0ydQdoeBelEv8bv++fnaLsZ3gjMWMQ+UdHiTkMy
X-Google-Smtp-Source: AMsMyM5ul6q3gVtF8MzUdaki691cDnhqHJwZ/N4eyzeFe4B8NkaepLNXmxDcKns2a0iF8jAUtkAN5hSpem5+q9qXk9Y=
X-Received: by 2002:a65:6e0e:0:b0:434:59e0:27d3 with SMTP id
 bd14-20020a656e0e000000b0043459e027d3mr7158607pgb.185.1666188359940; Wed, 19
 Oct 2022 07:05:59 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
 <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
 <CA+erpM4DUNpjVawVAdNqyLjDJwkLMc8jqrqng4nzSytai-1ngA@mail.gmail.com>
 <CA+erpM7hSYN2fY_Hqjgj5GJ18o3344j-Kz-aBtMdtSjRYfA2Yg@mail.gmail.com>
 <CAA93jw6EkMjMjGmHwiBYxG8jqJJhnwb7cUguJ5N4vntpMDnVbw@mail.gmail.com>
 <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
In-Reply-To: <CA+erpM5UDLr-VrwYJ3h2fciuc1CQsWs2WZOjD2QzFWNN7xcvbQ@mail.gmail.com>
From: Herbert Wolverson <herberticus@gmail.com>
Date: Wed, 19 Oct 2022 09:05:49 -0500
Message-ID: <CA+erpM4mA6jJmVjnPD0DNQ7QrZRQOUzJ0zcN4_BWsoF4ZZwVxQ@mail.gmail.com>
Cc: "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: multipart/alternative; boundary="00000000000052c0b705eb63b52c"
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Wed, 19 Oct 2022 14:06:01 -0000

--00000000000052c0b705eb63b52c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Also, I forgot to mention that I *think* the current version has removed
the requirement that the inbound
and outbound classifiers be placed on the same CPU. I know interduo was
particularly keen on packing
upload into fewer cores. I'll add that to my list of things to test.

On Wed, Oct 19, 2022 at 9:01 AM Herbert Wolverson <herberticus@gmail.com>
wrote:

> I'll definitely take a look - that does look interesting. I don't have X1=
1
> on any of my test VMs, but
> it looks like it can work without the GUI.
>
> Thanks!
>
> On Wed, Oct 19, 2022 at 8:58 AM Dave Taht <dave.taht@gmail.com> wrote:
>
>> could I coax you to adopt flent?
>>
>> apt-get install flent netperf irtt fping
>>
>> You sometimes have to compile netperf yourself with --enable-demo on
>> some systems.
>> There are a bunch of python libs neede for the gui, but only on the
>> client.
>>
>> Then you can run a really gnarly test series and plot the results over
>> time.
>>
>> flent --socket-stats --step-size=3D.05 -t 'the-test-conditions' -H
>> the_server_name rrul # 110 other tests
>>
>>
>> On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS
>> <libreqos@lists.bufferbloat.net> wrote:
>> >
>> > Hey,
>> >
>> > Testing the current version (
>> https://github.com/thebracket/cpumap-pping-hackjob ), it's doing better
>> than I hoped. This build has shared (not per-cpu) maps, and a userspace
>> daemon (xdp_pping) to extract and reset stats.
>> >
>> > My testing environment has grown a bit:
>> > * ShaperVM - running Ubuntu Server and LibreQoS, with the new
>> cpumap-pping-hackjob version of xdp-cpumap.
>> > * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf
>> server.
>> > * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Host=
s
>> iperf client.
>> > * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Host=
s
>> iperf client.
>> >
>> > ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are
>> on a virtual switch.
>> > ExtTest and the other interface (WAN facing) of ShaperVM are on a
>> different virtual switch.
>> >
>> > These are all on a host machine running Windows 11, a core i7 12th gen=
,
>> 32 Gb RAM and fast SSD setup.
>> >
>> > TEST 1: DUAL STREAMS, LOW THROUGHPUT
>> >
>> > For this test, LibreQoS is configured:
>> > * Two APs, each with 5gbit/s max.
>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about
>> 100mbit/s. They map to 1:5 and 2:5 respectively (separate CPUs).
>> > * Set to use Cake
>> >
>> > On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500
>> (for a long run). Running xdp_pping yields correct results:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 11},
>> > {}]
>> >
>> > Or when I waited a while to gather/reset:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 3, "max" : 6, "samples" : 60},
>> > {"tc":"2:5", "avg" : 4, "min" : 3, "max" : 5, "samples" : 60},
>> > {}]
>> >
>> > The ShaperVM shows no errors, just periodic logging that it is
>> recording data.  CPU is about 2-3% on two CPUs, zero on the others (as
>> expected).
>> >
>> > After 500 seconds of continual iperfing, each client reported a
>> throughput of 104 Mbit/sec and 6.06 GBytes of data transmitted.
>> >
>> > So for smaller streams, I'd call this a success.
>> >
>> > TEST 2: DUAL STREAMS, HIGH THROUGHPUT
>> >
>> > For this test, LibreQoS is configured:
>> > * Two APs, each with 5gb/s max.
>> > * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s!
>> Mapped to 1:5 and 2:5 respectively (separate CPUs).
>> >
>> > Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.
>> >
>> > xdp_pping shows results, too:
>> >
>> > [
>> > {"tc":"1:5", "avg" : 4, "min" : 1, "max" : 7, "samples" : 58},
>> > {"tc":"2:5", "avg" : 7, "min" : 3, "max" : 11, "samples" : 58},
>> > {}]
>> >
>> > [
>> > {"tc":"1:5", "avg" : 5, "min" : 4, "max" : 8, "samples" : 13},
>> > {"tc":"2:5", "avg" : 8, "min" : 7, "max" : 10, "samples" : 13},
>> > {}]
>> >
>> > The ShaperVM shows two CPUs pegging between 70 and 90 percent.
>> >
>> > After 500 seconds of continual iperfing, each client reported a
>> throughput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBy=
tes.
>> >
>> > Maxing out HyperV like this is inducing a bit of latency (which is to
>> be expected), but it's not bad. I also forgot to disable hyperthreading,
>> and looking at the host performance it is sometimes running the second
>> virtual CPU on an underpowered "fake" CPU.
>> >
>> > So for two large streams, I think we're doing pretty well also!
>> >
>> > TEST 3: DUAL STREAMS, SINGLE CPU
>> >
>> > This test is designed to try and blow things up. It's the same as test
>> 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and 1=
:6.
>> >
>> > ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The
>> pping stats start to show a bit of degradation in performance for poundi=
ng
>> it so hard:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 19, "samples" : 24},
>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 18, "samples" : 24},
>> > {}]
>> >
>> > For whatever reason, it smoothed out over time:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 10, "min" : 9, "max" : 12, "samples" : 50},
>> > {"tc":"1:5", "avg" : 10, "min" : 8, "max" : 13, "samples" : 50},
>> > {}]
>> >
>> > Surprisingly (to me), I didn't encounter errors. Each client received
>> 2.22 Gbit/s performance, over 129 Gbytes of data.
>> >
>> > TEST 4: DUAL STREAMS, 50 SUB-STREAMS
>> >
>> > This test is also designed to break things. Same as test 3, but using
>> iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax =
the
>> flow tracking. (Shorter time window because I really wanted to go and fi=
nd
>> coffee)
>> >
>> > ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results
>> show that this torture test is worsening performance, and there's always
>> lots of samples in the buffer:
>> >
>> > [
>> > {"tc":"1:6", "avg" : 23, "min" : 19, "max" : 27, "samples" : 49},
>> > {"tc":"1:5", "avg" : 24, "min" : 19, "max" : 27, "samples" : 49},
>> > {}]
>> >
>> > This test also ran better than I expected. You can definitely see some
>> latency creeping in as I make the system work hard. Each VM showed aroun=
d
>> 2.4 Gbit/s in total performance at the end of the iperf session. There's
>> definitely some latency creeping in, which is expected - but I'm not sur=
e I
>> expected quite that much.
>> >
>> > WHAT'S NEXT & CONCLUSION
>> >
>> > I noticed that I forgot to turn off efficient power management on my
>> VMs and host, and left Hyperthreading on by mistake. So that hurts overa=
ll
>> performance.
>> >
>> > The base system seems to be working pretty solidly, at least for small
>> tests.Next up, I'll be removing extraneous debug reporting code, removin=
g
>> some code paths that don't do anything but report, and looking for any
>> small optimization opportunities. I'll then re-run these tests. Once tha=
t's
>> done, I hope to find a maintenance window on my WISP and try it with act=
ual
>> traffic.
>> >
>> > I also need to re-run these tests without the pping system to provide
>> some before/after analysis.
>> >
>> > On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson <
>> herberticus@gmail.com> wrote:
>> >>
>> >> It's probably not entirely thread-safe right now (ran into some issue=
s
>> reading per_cpu maps back from userspace; hopefully, I'll get that figur=
ed
>> out) - but the commits I just pushed have it basically working on
>> single-stream testing. :-)
>> >>
>> >> Setup cpumap as usual, and periodically run xdp-pping. This gives you
>> per-connection RTT information in JSON:
>> >>
>> >> [
>> >> {"tc":"1:5", "avg" : 5, "min" : 5, "max" : 5, "samples" : 1},
>> >> {}]
>> >>
>> >> (With the extra {} because I'm not tracking the tail and haven't done
>> comma removal). The tool also empties the various maps used to gather da=
ta,
>> acting as a "reset" point. There's a max of 60 samples per queue, in a
>> ringbuffer setup (so newest will start to overwrite the oldest).
>> >>
>> >> I'll start trying to test on a larger scale now.
>> >>
>> >> On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n <
>> robert.chacon@jackrabbitwireless.com> wrote:
>> >>>
>> >>> Hey Herbert,
>> >>>
>> >>> Fantastic work! Super exciting to see this coming together,
>> especially so quickly.
>> >>> I'll test it soon.
>> >>> I understand and agree with your decision to omit certain features
>> (ICMP tracking,DNS tracking, etc) to optimize performance for our use ca=
se.
>> Like you said, in order to merge the functionality without a performance
>> hit, merging them is sort of the only way right now. Otherwise there wou=
ld
>> be a lot of redundancy and lost throughput for an ISP's use. Though
>> hopefully long term there will be a way to keep all projects working
>> independently but interoperably with a plugin system of some kind.
>> >>>
>> >>> By the way, I'm making some headway on LibreQoS v1.3. Focusing on
>> optimizations for high sub counts (8000+ subs) as well as stateful chang=
es
>> to the queue structure.
>> >>> I'm working to set up a physical lab to test high throughput and hig=
h
>> client count scenarios.
>> >>> When testing beyond ~32,000 filters we get "no space left on device"
>> from xdp-cpumap-tc, which I think relates to the bpf map size limitation
>> you mentioned. Maybe in the coming months we can take a look at that.
>> >>>
>> >>> Anyway great work on the cpumap-pping program! Excited to see more o=
n
>> this.
>> >>>
>> >>> Thanks,
>> >>> Robert
>> >>>
>> >>> On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
>> libreqos@lists.bufferbloat.net> wrote:
>> >>>>
>> >>>> Hey,
>> >>>>
>> >>>> My current (unfinished) progress on this is now available here:
>> https://github.com/thebracket/cpumap-pping-hackjob
>> >>>>
>> >>>> I mean it about the warnings, this isn't at all stable, debugged -
>> and can't promise that it won't unleash the nasal demons
>> >>>> (to use a popular C++ phrase). The name is descriptive! ;-)
>> >>>>
>> >>>> With that said, I'm pretty happy so far:
>> >>>>
>> >>>> * It runs only on the classifier - which xdp-cpumap-tc has nicely
>> shunted onto a dedicated CPU. It has to run on both
>> >>>>   the inbound and outbound classifiers, since otherwise it would
>> only see half the conversation.
>> >>>> * It does assume that your ingress and egress CPUs are mapped to th=
e
>> same interface; I do that anyway in BracketQoS. Not doing
>> >>>>   that opens up a potential world of pain, since writes to the
>> shared maps would require a locking scheme. Too much locking, and you lo=
se
>> all of the benefit of using multiple CPUs to begin with.
>> >>>> * It is pretty wasteful of RAM, but most of the shaper systems I've
>> worked with have lots of it.
>> >>>> * I've been gradually removing features that I don't want for
>> BracketQoS. A hypothetical future "useful to everyone" version wouldn't =
do
>> that.
>> >>>> * Rate limiting is working, but I removed the requirement for a
>> shared configuration provided from userland - so right now it's always s=
et
>> to report at 1 second intervals per stream.
>> >>>>
>> >>>> My testbed is currently 3 Hyper-V VMs - a simple "client" and
>> "world", and a "shaper" VM in between running a slightly hacked-up Libre=
QoS.
>> >>>> iperf from "client" to "world" (with Libre set to allow 10gbit/s
>> max, via a cake/HTB queue setup) is around 5 gbit/s at present, on my
>> >>>> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and
>> fast SSDs)
>> >>>>
>> >>>> Output currently consists of debug messages reading:
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk=
:
>> (tc) Flow open event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 374696
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk=
:
>> (tc) Flow open event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 247069
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 5217155
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4515394
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4481289
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4255268
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 5249493
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 3795993
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 3949519
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4365335
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4154910
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk=
:
>> (tc) Send performance event (5,1), 4405582
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk=
:
>> (tc) Send flow event
>> >>>>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk=
:
>> (tc) Send flow event
>> >>>>
>> >>>> The times haven't been tweaked yet. The (5,1) is tc handle
>> major/minor, allocated by the xdp-cpumap parent.
>> >>>> I get pretty low latency between VMs; I'll set up a test with some
>> real-world data very soon.
>> >>>>
>> >>>> I plan to keep hacking away, but feel free to take a peek.
>> >>>>
>> >>>> Thanks,
>> >>>> Herbert
>> >>>>
>> >>>> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <
>> Simon.Sundberg@kau.se> wrote:
>> >>>>>
>> >>>>> Hi, thanks for adding me to the conversation. Just a couple of qui=
ck
>> >>>>> notes.
>> >>>>>
>> >>>>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgense=
n wrote:
>> >>>>> > [ Adding Simon to Cc ]
>> >>>>> >
>> >>>>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net>
>> writes:
>> >>>>> >
>> >>>>> > > Hey,
>> >>>>> > >
>> >>>>> > > I've had some pretty good success with merging xdp-pping (
>> >>>>> > >
>> https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h )
>> >>>>> > > into xdp-cpumap-tc (
>> https://github.com/xdp-project/xdp-cpumap-tc ).
>> >>>>> > >
>> >>>>> > > I ported over most of the xdp-pping code, and then changed the
>> entry point
>> >>>>> > > and packet parsing code to make use of the work already done i=
n
>> >>>>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet,
>> no need to do
>> >>>>> > > it twice). Then I switched the maps to per-cpu maps, and had t=
o
>> pin them -
>> >>>>> > > otherwise the two tc instances don't properly share data.
>> >>>>> > >
>> >>>>>
>> >>>>> I guess the xdp-cpumap-tc ensures that the same flow is processed =
on
>> >>>>> the same CPU core at both ingress or egress. Otherwise, if a flow
>> may
>> >>>>> be processed by different cores on ingress and egress the per-CPU
>> maps
>> >>>>> will not really work reliably as each core will have a different
>> view
>> >>>>> on the state of the flow, if there's been a previous packet with a
>> >>>>> certain TSval from that flow etc.
>> >>>>>
>> >>>>> Furthermore, if a flow is always processed on the same core (on bo=
th
>> >>>>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>> >>>>> memory. From my understanding the keys for per-CPU maps are still
>> >>>>> shared across all CPUs, it's just that each CPU gets its own value=
.
>> So
>> >>>>> all CPUs will then have their own data for each flow, but it's onl=
y
>> the
>> >>>>> CPU processing the flow that will have any relevant data for the
>> flow
>> >>>>> while the remaining CPUs will just have an empty state for that
>> flow.
>> >>>>> Under the same assumption that packets within the same flow are
>> always
>> >>>>> processed on the same core there should generally not be any
>> >>>>> concurrency issues with having a global (non-per-CPU) either as
>> packets
>> >>>>> from the same flow cannot be processed concurrently then (and thus
>> no
>> >>>>> concurrent access to the same value in the map). I am however stil=
l
>> >>>>> very unclear on if there's any considerable performance impact
>> between
>> >>>>> global and per-CPU map versions if the same key is not accessed
>> >>>>> concurrently.
>> >>>>>
>> >>>>> > > Right now, output
>> >>>>> > > is just stubbed - I've still got to port the perfmap output
>> code. Instead,
>> >>>>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so
>> I can see
>> >>>>> > > roughly what the output would look like.
>> >>>>> > >
>> >>>>> > > With debug enabled and just logging I'm now getting about 4.9
>> Gbits/sec on
>> >>>>> > > single-stream iperf between two VMs (with a shaper VM in the
>> middle). :-)
>> >>>>> >
>> >>>>> > Just FYI, that "just logging" is probably the biggest source of
>> >>>>> > overhead, then. What Simon found was that sending the data from
>> kernel
>> >>>>> > to userspace is one of the most expensive bits of epping, at
>> least when
>> >>>>> > the number of data points goes up (which is does as additional
>> flows are
>> >>>>> > added).
>> >>>>>
>> >>>>> Yhea, reporting individual RTTs when there's lots of them (you may
>> get
>> >>>>> upwards of 1000 RTTs/s per flow) is not only problematic in terms =
of
>> >>>>> direct overhead from the tool itself, but also becomes demanding f=
or
>> >>>>> whatever you use all those RTT samples for (i.e. need to log, pars=
e,
>> >>>>> analyze etc. a very large amount of RTTs). One way to deal with
>> that is
>> >>>>> of course to just apply some sort of sampling (the -r/--rate-limit
>> and
>> >>>>> -R/--rtt-rate
>> >>>>> >
>> >>>>> > > So my question: how would you prefer to receive this data? I'l=
l
>> have to
>> >>>>> > > write a daemon that provides userspace control (periodic
>> cleanup as well as
>> >>>>> > > reading the performance stream), so the world's kinda our
>> oyster. I can
>> >>>>> > > stick to Kathie's original format (and dump it to a named pipe=
,
>> perhaps?),
>> >>>>> > > a condensed format that only shows what you want to use, an
>> efficient
>> >>>>> > > binary format if you feel like parsing that...
>> >>>>> >
>> >>>>> > It would be great if we could combine efforts a bit here so we
>> don't
>> >>>>> > fork the codebase more than we have to. I.e., if "upstream"
>> epping and
>> >>>>> > whatever daemon you end up writing can agree on data format etc
>> that
>> >>>>> > would be fantastic! Added Simon to Cc to facilitate this :)
>> >>>>> >
>> >>>>> > Briefly what I've discussed before with Simon was to have the
>> ability to
>> >>>>> > aggregate the metrics in the kernel (WiP PR [0]) and have a
>> userspace
>> >>>>> > utility periodically pull them out. What we discussed was doing
>> this
>> >>>>> > using an LPM map (which is not in that PR yet). The idea would b=
e
>> that
>> >>>>> > userspace would populate the LPM map with the keys (prefixes) th=
ey
>> >>>>> > wanted statistics for (in LibreQOS context that could be one key
>> per
>> >>>>> > customer, for instance). Epping would then do a map lookup into
>> the LPM,
>> >>>>> > and if it gets a match it would update the statistics in that ma=
p
>> entry
>> >>>>> > (keeping a histogram of latency values seen, basically). Simon's
>> PR
>> >>>>> > below uses this technique where userspace will "reset" the
>> histogram
>> >>>>> > every time it loads it by swapping out two different map entries
>> when it
>> >>>>> > does a read; this allows you to control the sampling rate from
>> >>>>> > userspace, and you'll just get the data since the last time you
>> polled.
>> >>>>>
>> >>>>> Thank's Toke for summarzing both the current state and the plan
>> going
>> >>>>> forward. I will just note that this PR (and all my other work with
>> >>>>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or
>> less
>> >>>>> on hold for a couple of weeks right now as I'm trying to finish up=
 a
>> >>>>> paper.
>> >>>>>
>> >>>>> > I was thinking that if we all can agree on the map format, then
>> your
>> >>>>> > polling daemon could be one userspace "client" for that, and the
>> epping
>> >>>>> > binary itself could be another; but we could keep compatibility
>> between
>> >>>>> > the two, so we don't duplicate effort.
>> >>>>> >
>> >>>>> > Similarly, refactoring of the epping code itself so it can be
>> plugged
>> >>>>> > into the cpumap-tc code would be a good goal...
>> >>>>>
>> >>>>> Should probably do that...at some point. In general I think it's a
>> bit
>> >>>>> of an interesting problem to think about how to chain multiple
>> XDP/tc
>> >>>>> programs together in an efficent way. Most XDP and tc programs wil=
l
>> do
>> >>>>> some amount of packet parsing and when you have many chained
>> programs
>> >>>>> parsing the same packets this obviously becomes a bit wasteful. In
>> the
>> >>>>> same time it would be nice if one didn't need to manually merge
>> >>>>> multiple programs together into a single one like this to get rid =
of
>> >>>>> this duplicated parsing, or at least make that process of merging
>> those
>> >>>>> programs as simple as possible.
>> >>>>>
>> >>>>>
>> >>>>> > -Toke
>> >>>>> >
>> >>>>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>> >>>>>
>> >>>>> N=C3=A4r du skickar e-post till Karlstads universitet behandlar vi=
 dina
>> personuppgifter<https://www.kau.se/gdpr>.
>> >>>>> When you send an e-mail to Karlstad University, we will process
>> your personal data<https://www.kau.se/en/gdpr>.
>> >>>>
>> >>>> _______________________________________________
>> >>>> LibreQoS mailing list
>> >>>> LibreQoS@lists.bufferbloat.net
>> >>>> https://lists.bufferbloat.net/listinfo/libreqos
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Robert Chac=C3=B3n
>> >>> CEO | JackRabbit Wireless LLC
>> >
>> > _______________________________________________
>> > LibreQoS mailing list
>> > LibreQoS@lists.bufferbloat.net
>> > https://lists.bufferbloat.net/listinfo/libreqos
>>
>>
>>
>> --
>> This song goes out to all the folk that thought Stadia would work:
>>
>> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666=
65607352320-FXtz
>> Dave T=C3=A4ht CEO, TekLibre, LLC
>>
>

--00000000000052c0b705eb63b52c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Also, I forgot to mention that I *think* the current =
version has removed the requirement that the inbound <br></div><div>and out=
bound classifiers be placed on the same CPU. I know interduo was particular=
ly keen on packing <br></div><div>upload into fewer cores. I&#39;ll add tha=
t to my list of things to test.<br></div></div><br><div class=3D"gmail_quot=
e"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Oct 19, 2022 at 9:01 AM He=
rbert Wolverson &lt;<a href=3D"mailto:herberticus@gmail.com">herberticus@gm=
ail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr"><div>I&#39;ll definitely take a look - that does lo=
ok interesting. I don&#39;t have X11 on any of my test VMs, but</div><div>i=
t looks like it can work without the GUI.</div><div><br></div><div>Thanks!<=
br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gma=
il_attr">On Wed, Oct 19, 2022 at 8:58 AM Dave Taht &lt;<a href=3D"mailto:da=
ve.taht@gmail.com" target=3D"_blank">dave.taht@gmail.com</a>&gt; wrote:<br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">could I coax you to=
 adopt flent?<br>
<br>
apt-get install flent netperf irtt fping<br>
<br>
You sometimes have to compile netperf yourself with --enable-demo on<br>
some systems.<br>
There are a bunch of python libs neede for the gui, but only on the client.=
<br>
<br>
Then you can run a really gnarly test series and plot the results over time=
.<br>
<br>
flent --socket-stats --step-size=3D.05 -t &#39;the-test-conditions&#39; -H<=
br>
the_server_name rrul # 110 other tests<br>
<br>
<br>
On Wed, Oct 19, 2022 at 6:44 AM Herbert Wolverson via LibreQoS<br>
&lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D"_blank">lib=
reqos@lists.bufferbloat.net</a>&gt; wrote:<br>
&gt;<br>
&gt; Hey,<br>
&gt;<br>
&gt; Testing the current version ( <a href=3D"https://github.com/thebracket=
/cpumap-pping-hackjob" rel=3D"noreferrer" target=3D"_blank">https://github.=
com/thebracket/cpumap-pping-hackjob</a> ), it&#39;s doing better than I hop=
ed. This build has shared (not per-cpu) maps, and a userspace daemon (xdp_p=
ping) to extract and reset stats.<br>
&gt;<br>
&gt; My testing environment has grown a bit:<br>
&gt; * ShaperVM - running Ubuntu Server and LibreQoS, with the new cpumap-p=
ping-hackjob version of xdp-cpumap.<br>
&gt; * ExtTest - running Ubuntu Server, set as 10.64.1.1. Hosts an iperf se=
rver.<br>
&gt; * ClientInt1 - running Ubuntu Server (minimal), set as 10.64.1.2. Host=
s iperf client.<br>
&gt; * ClientInt2 - running Ubuntu Server (minimal), set as 10.64.1.3. Host=
s iperf client.<br>
&gt;<br>
&gt; ClientInt1, ClientInt2 and one interface (LAN facing) of ShaperVM are =
on a virtual switch.<br>
&gt; ExtTest and the other interface (WAN facing) of ShaperVM are on a diff=
erent virtual switch.<br>
&gt;<br>
&gt; These are all on a host machine running Windows 11, a core i7 12th gen=
, 32 Gb RAM and fast SSD setup.<br>
&gt;<br>
&gt; TEST 1: DUAL STREAMS, LOW THROUGHPUT<br>
&gt;<br>
&gt; For this test, LibreQoS is configured:<br>
&gt; * Two APs, each with 5gbit/s max.<br>
&gt; * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to about 100mb=
it/s. They map to 1:5 and 2:5 respectively (separate CPUs).<br>
&gt; * Set to use Cake<br>
&gt;<br>
&gt; On each client, roughly simultaneously run: iperf -c 100.64.1.1 -t 500=
 (for a long run). Running xdp_pping yields correct results:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 4, &quot;min&quot; =
: 3, &quot;max&quot; : 5, &quot;samples&quot; : 11},<br>
&gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;avg&quot; : 4, &quot;min&quot; =
: 3, &quot;max&quot; : 5, &quot;samples&quot; : 11},<br>
&gt; {}]<br>
&gt;<br>
&gt; Or when I waited a while to gather/reset:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 4, &quot;min&quot; =
: 3, &quot;max&quot; : 6, &quot;samples&quot; : 60},<br>
&gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;avg&quot; : 4, &quot;min&quot; =
: 3, &quot;max&quot; : 5, &quot;samples&quot; : 60},<br>
&gt; {}]<br>
&gt;<br>
&gt; The ShaperVM shows no errors, just periodic logging that it is recordi=
ng data.=C2=A0 CPU is about 2-3% on two CPUs, zero on the others (as expect=
ed).<br>
&gt;<br>
&gt; After 500 seconds of continual iperfing, each client reported a throug=
hput of 104 Mbit/sec and 6.06 GBytes of data transmitted.<br>
&gt;<br>
&gt; So for smaller streams, I&#39;d call this a success.<br>
&gt;<br>
&gt; TEST 2: DUAL STREAMS, HIGH THROUGHPUT<br>
&gt;<br>
&gt; For this test, LibreQoS is configured:<br>
&gt; * Two APs, each with 5gb/s max.<br>
&gt; * 100.64.1.2 and 100.64.1.3 setup as CPEs, each limited to 5Gbit/s! Ma=
pped to 1:5 and 2:5 respectively (separate CPUs).<br>
&gt;<br>
&gt; Run iperfc -c 100.64.1.1 -t 500 on each client at the same time.<br>
&gt;<br>
&gt; xdp_pping shows results, too:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 4, &quot;min&quot; =
: 1, &quot;max&quot; : 7, &quot;samples&quot; : 58},<br>
&gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;avg&quot; : 7, &quot;min&quot; =
: 3, &quot;max&quot; : 11, &quot;samples&quot; : 58},<br>
&gt; {}]<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 5, &quot;min&quot; =
: 4, &quot;max&quot; : 8, &quot;samples&quot; : 13},<br>
&gt; {&quot;tc&quot;:&quot;2:5&quot;, &quot;avg&quot; : 8, &quot;min&quot; =
: 7, &quot;max&quot; : 10, &quot;samples&quot; : 13},<br>
&gt; {}]<br>
&gt;<br>
&gt; The ShaperVM shows two CPUs pegging between 70 and 90 percent.<br>
&gt;<br>
&gt; After 500 seconds of continual iperfing, each client reported a throug=
hput of 2.72 Gbits/sec (158 GBytes) and 3.89 Gbits/sec and 226 GBytes.<br>
&gt;<br>
&gt; Maxing out HyperV like this is inducing a bit of latency (which is to =
be expected), but it&#39;s not bad. I also forgot to disable hyperthreading=
, and looking at the host performance it is sometimes running the second vi=
rtual CPU on an underpowered &quot;fake&quot; CPU.<br>
&gt;<br>
&gt; So for two large streams, I think we&#39;re doing pretty well also!<br=
>
&gt;<br>
&gt; TEST 3: DUAL STREAMS, SINGLE CPU<br>
&gt;<br>
&gt; This test is designed to try and blow things up. It&#39;s the same as =
test 2, but both CPEs are set to the same CPU (1), using TC handles 1:5 and=
 1:6.<br>
&gt;<br>
&gt; ShaperVM CPU1 maxed out in the high 90s, the other CPUs were idle. The=
 pping stats start to show a bit of degradation in performance for pounding=
 it so hard:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;avg&quot; : 10, &quot;min&quot;=
 : 9, &quot;max&quot; : 19, &quot;samples&quot; : 24},<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 10, &quot;min&quot;=
 : 8, &quot;max&quot; : 18, &quot;samples&quot; : 24},<br>
&gt; {}]<br>
&gt;<br>
&gt; For whatever reason, it smoothed out over time:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;avg&quot; : 10, &quot;min&quot;=
 : 9, &quot;max&quot; : 12, &quot;samples&quot; : 50},<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 10, &quot;min&quot;=
 : 8, &quot;max&quot; : 13, &quot;samples&quot; : 50},<br>
&gt; {}]<br>
&gt;<br>
&gt; Surprisingly (to me), I didn&#39;t encounter errors. Each client recei=
ved 2.22 Gbit/s performance, over 129 Gbytes of data.<br>
&gt;<br>
&gt; TEST 4: DUAL STREAMS, 50 SUB-STREAMS<br>
&gt;<br>
&gt; This test is also designed to break things. Same as test 3, but using =
iperf -c 100.64.1.1 -P 50 -t 120 - 50 substreams, to try and really tax the=
 flow tracking. (Shorter time window because I really wanted to go and find=
 coffee)<br>
&gt;<br>
&gt; ShaperVM CPU sat at around 80-97%, tending towards 97%. pping results =
show that this torture test is worsening performance, and there&#39;s alway=
s lots of samples in the buffer:<br>
&gt;<br>
&gt; [<br>
&gt; {&quot;tc&quot;:&quot;1:6&quot;, &quot;avg&quot; : 23, &quot;min&quot;=
 : 19, &quot;max&quot; : 27, &quot;samples&quot; : 49},<br>
&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 24, &quot;min&quot;=
 : 19, &quot;max&quot; : 27, &quot;samples&quot; : 49},<br>
&gt; {}]<br>
&gt;<br>
&gt; This test also ran better than I expected. You can definitely see some=
 latency creeping in as I make the system work hard. Each VM showed around =
2.4 Gbit/s in total performance at the end of the iperf session. There&#39;=
s definitely some latency creeping in, which is expected - but I&#39;m not =
sure I expected quite that much.<br>
&gt;<br>
&gt; WHAT&#39;S NEXT &amp; CONCLUSION<br>
&gt;<br>
&gt; I noticed that I forgot to turn off efficient power management on my V=
Ms and host, and left Hyperthreading on by mistake. So that hurts overall p=
erformance.<br>
&gt;<br>
&gt; The base system seems to be working pretty solidly, at least for small=
 tests.Next up, I&#39;ll be removing extraneous debug reporting code, remov=
ing some code paths that don&#39;t do anything but report, and looking for =
any small optimization opportunities. I&#39;ll then re-run these tests. Onc=
e that&#39;s done, I hope to find a maintenance window on my WISP and try i=
t with actual traffic.<br>
&gt;<br>
&gt; I also need to re-run these tests without the pping system to provide =
some before/after analysis.<br>
&gt;<br>
&gt; On Tue, Oct 18, 2022 at 1:01 PM Herbert Wolverson &lt;<a href=3D"mailt=
o:herberticus@gmail.com" target=3D"_blank">herberticus@gmail.com</a>&gt; wr=
ote:<br>
&gt;&gt;<br>
&gt;&gt; It&#39;s probably not entirely thread-safe right now (ran into som=
e issues reading per_cpu maps back from userspace; hopefully, I&#39;ll get =
that figured out) - but the commits I just pushed have it basically working=
 on single-stream testing. :-)<br>
&gt;&gt;<br>
&gt;&gt; Setup cpumap as usual, and periodically run xdp-pping. This gives =
you per-connection RTT information in JSON:<br>
&gt;&gt;<br>
&gt;&gt; [<br>
&gt;&gt; {&quot;tc&quot;:&quot;1:5&quot;, &quot;avg&quot; : 5, &quot;min&qu=
ot; : 5, &quot;max&quot; : 5, &quot;samples&quot; : 1},<br>
&gt;&gt; {}]<br>
&gt;&gt;<br>
&gt;&gt; (With the extra {} because I&#39;m not tracking the tail and haven=
&#39;t done comma removal). The tool also empties the various maps used to =
gather data, acting as a &quot;reset&quot; point. There&#39;s a max of 60 s=
amples per queue, in a ringbuffer setup (so newest will start to overwrite =
the oldest).<br>
&gt;&gt;<br>
&gt;&gt; I&#39;ll start trying to test on a larger scale now.<br>
&gt;&gt;<br>
&gt;&gt; On Mon, Oct 17, 2022 at 3:34 PM Robert Chac=C3=B3n &lt;<a href=3D"=
mailto:robert.chacon@jackrabbitwireless.com" target=3D"_blank">robert.chaco=
n@jackrabbitwireless.com</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Hey Herbert,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Fantastic work! Super exciting to see this coming together, es=
pecially so quickly.<br>
&gt;&gt;&gt; I&#39;ll test it soon.<br>
&gt;&gt;&gt; I understand and agree with your decision to omit certain feat=
ures (ICMP tracking,DNS tracking, etc) to optimize performance for our use =
case. Like you said, in order to merge the functionality without a performa=
nce hit, merging them is sort of the only way right now. Otherwise there wo=
uld be a lot of redundancy and lost throughput for an ISP&#39;s use. Though=
 hopefully long term there will be a way to keep all projects working indep=
endently but interoperably with a plugin system of some kind.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; By the way, I&#39;m making some headway on LibreQoS v1.3. Focu=
sing on optimizations for high sub counts (8000+ subs) as well as stateful =
changes to the queue structure.<br>
&gt;&gt;&gt; I&#39;m working to set up a physical lab to test high throughp=
ut and high client count scenarios.<br>
&gt;&gt;&gt; When testing beyond ~32,000 filters we get &quot;no space left=
 on device&quot; from xdp-cpumap-tc, which I think relates to the bpf map s=
ize limitation you mentioned. Maybe in the coming months we can take a look=
 at that.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Anyway great work on the cpumap-pping program! Excited to see =
more on this.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks,<br>
&gt;&gt;&gt; Robert<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQo=
S &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net" target=3D"_blank">l=
ibreqos@lists.bufferbloat.net</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hey,<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; My current (unfinished) progress on this is now available =
here: <a href=3D"https://github.com/thebracket/cpumap-pping-hackjob" rel=3D=
"noreferrer" target=3D"_blank">https://github.com/thebracket/cpumap-pping-h=
ackjob</a><br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I mean it about the warnings, this isn&#39;t at all stable=
, debugged - and can&#39;t promise that it won&#39;t unleash the nasal demo=
ns<br>
&gt;&gt;&gt;&gt; (to use a popular C++ phrase). The name is descriptive! ;-=
)<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; With that said, I&#39;m pretty happy so far:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; * It runs only on the classifier - which xdp-cpumap-tc has=
 nicely shunted onto a dedicated CPU. It has to run on both<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0the inbound and outbound classifiers, since ot=
herwise it would only see half the conversation.<br>
&gt;&gt;&gt;&gt; * It does assume that your ingress and egress CPUs are map=
ped to the same interface; I do that anyway in BracketQoS. Not doing<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0that opens up a potential world of pain, since=
 writes to the shared maps would require a locking scheme. Too much locking=
, and you lose all of the benefit of using multiple CPUs to begin with.<br>
&gt;&gt;&gt;&gt; * It is pretty wasteful of RAM, but most of the shaper sys=
tems I&#39;ve worked with have lots of it.<br>
&gt;&gt;&gt;&gt; * I&#39;ve been gradually removing features that I don&#39=
;t want for BracketQoS. A hypothetical future &quot;useful to everyone&quot=
; version wouldn&#39;t do that.<br>
&gt;&gt;&gt;&gt; * Rate limiting is working, but I removed the requirement =
for a shared configuration provided from userland - so right now it&#39;s a=
lways set to report at 1 second intervals per stream.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; My testbed is currently 3 Hyper-V VMs - a simple &quot;cli=
ent&quot; and &quot;world&quot;, and a &quot;shaper&quot; VM in between run=
ning a slightly hacked-up LibreQoS.<br>
&gt;&gt;&gt;&gt; iperf from &quot;client&quot; to &quot;world&quot; (with L=
ibre set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbi=
t/s at present, on my<br>
&gt;&gt;&gt;&gt; test PC (the host is a core i7, 12th gen, 12 cores - 64gb =
RAM and fast SSDs)<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Output currently consists of debug messages reading:<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0515.399222: bpf_trace_printk: (tc) Flow open event<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0515.399239: bpf_trace_printk: (tc) Send performance event (5,1=
), 374696<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0515.399466: bpf_trace_printk: (tc) Flow open event<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0515.399475: bpf_trace_printk: (tc) Send performance event (5,1=
), 247069<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0516.405151: bpf_trace_printk: (tc) Send performance event (5,1=
), 5217155<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0517.405248: bpf_trace_printk: (tc) Send performance event (5,1=
), 4515394<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0518.406117: bpf_trace_printk: (tc) Send performance event (5,1=
), 4481289<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0519.406255: bpf_trace_printk: (tc) Send performance event (5,1=
), 4255268<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0520.407864: bpf_trace_printk: (tc) Send performance event (5,1=
), 5249493<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0521.406664: bpf_trace_printk: (tc) Send performance event (5,1=
), 3795993<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0522.407469: bpf_trace_printk: (tc) Send performance event (5,1=
), 3949519<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0523.408126: bpf_trace_printk: (tc) Send performance event (5,1=
), 4365335<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0524.408929: bpf_trace_printk: (tc) Send performance event (5,1=
), 4154910<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0525.410048: bpf_trace_printk: (tc) Send performance event (5,1=
), 4405582<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0525.434080: bpf_trace_printk: (tc) Send flow event<br>
&gt;&gt;&gt;&gt;=C2=A0 =C2=A0cpumap/0/map:4-1371=C2=A0 =C2=A0 [000] D..2.=
=C2=A0 =C2=A0525.482714: bpf_trace_printk: (tc) Send flow event<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; The times haven&#39;t been tweaked yet. The (5,1) is tc ha=
ndle major/minor, allocated by the xdp-cpumap parent.<br>
&gt;&gt;&gt;&gt; I get pretty low latency between VMs; I&#39;ll set up a te=
st with some real-world data very soon.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I plan to keep hacking away, but feel free to take a peek.=
<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Thanks,<br>
&gt;&gt;&gt;&gt; Herbert<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg &lt;<a hre=
f=3D"mailto:Simon.Sundberg@kau.se" target=3D"_blank">Simon.Sundberg@kau.se<=
/a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Hi, thanks for adding me to the conversation. Just a c=
ouple of quick<br>
&gt;&gt;&gt;&gt;&gt; notes.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=
=C3=B8rgensen wrote:<br>
&gt;&gt;&gt;&gt;&gt; &gt; [ Adding Simon to Cc ]<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; Herbert Wolverson via LibreQoS &lt;<a href=3D"mai=
lto:libreqos@lists.bufferbloat.net" target=3D"_blank">libreqos@lists.buffer=
bloat.net</a>&gt; writes:<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; Hey,<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; I&#39;ve had some pretty good success with m=
erging xdp-pping (<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; <a href=3D"https://github.com/xdp-project/bp=
f-examples/blob/master/pping/pping.h" rel=3D"noreferrer" target=3D"_blank">=
https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h</a> )=
<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; into xdp-cpumap-tc ( <a href=3D"https://gith=
ub.com/xdp-project/xdp-cpumap-tc" rel=3D"noreferrer" target=3D"_blank">http=
s://github.com/xdp-project/xdp-cpumap-tc</a> ).<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; I ported over most of the xdp-pping code, an=
d then changed the entry point<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; and packet parsing code to make use of the w=
ork already done in<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; xdp-cpumap-tc (it&#39;s already parsed a big=
 chunk of the packet, no need to do<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; it twice). Then I switched the maps to per-c=
pu maps, and had to pin them -<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; otherwise the two tc instances don&#39;t pro=
perly share data.<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; I guess the xdp-cpumap-tc ensures that the same flow i=
s processed on<br>
&gt;&gt;&gt;&gt;&gt; the same CPU core at both ingress or egress. Otherwise=
, if a flow may<br>
&gt;&gt;&gt;&gt;&gt; be processed by different cores on ingress and egress =
the per-CPU maps<br>
&gt;&gt;&gt;&gt;&gt; will not really work reliably as each core will have a=
 different view<br>
&gt;&gt;&gt;&gt;&gt; on the state of the flow, if there&#39;s been a previo=
us packet with a<br>
&gt;&gt;&gt;&gt;&gt; certain TSval from that flow etc.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Furthermore, if a flow is always processed on the same=
 core (on both<br>
&gt;&gt;&gt;&gt;&gt; ingress and egress) I think per-CPU maps may be a bit =
wasteful on<br>
&gt;&gt;&gt;&gt;&gt; memory. From my understanding the keys for per-CPU map=
s are still<br>
&gt;&gt;&gt;&gt;&gt; shared across all CPUs, it&#39;s just that each CPU ge=
ts its own value. So<br>
&gt;&gt;&gt;&gt;&gt; all CPUs will then have their own data for each flow, =
but it&#39;s only the<br>
&gt;&gt;&gt;&gt;&gt; CPU processing the flow that will have any relevant da=
ta for the flow<br>
&gt;&gt;&gt;&gt;&gt; while the remaining CPUs will just have an empty state=
 for that flow.<br>
&gt;&gt;&gt;&gt;&gt; Under the same assumption that packets within the same=
 flow are always<br>
&gt;&gt;&gt;&gt;&gt; processed on the same core there should generally not =
be any<br>
&gt;&gt;&gt;&gt;&gt; concurrency issues with having a global (non-per-CPU) =
either as packets<br>
&gt;&gt;&gt;&gt;&gt; from the same flow cannot be processed concurrently th=
en (and thus no<br>
&gt;&gt;&gt;&gt;&gt; concurrent access to the same value in the map). I am =
however still<br>
&gt;&gt;&gt;&gt;&gt; very unclear on if there&#39;s any considerable perfor=
mance impact between<br>
&gt;&gt;&gt;&gt;&gt; global and per-CPU map versions if the same key is not=
 accessed<br>
&gt;&gt;&gt;&gt;&gt; concurrently.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; Right now, output<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; is just stubbed - I&#39;ve still got to port=
 the perfmap output code. Instead,<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; I&#39;m dumping a bunch of extra data to the=
 kernel debug pipe, so I can see<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; roughly what the output would look like.<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; With debug enabled and just logging I&#39;m =
now getting about 4.9 Gbits/sec on<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; single-stream iperf between two VMs (with a =
shaper VM in the middle). :-)<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; Just FYI, that &quot;just logging&quot; is probab=
ly the biggest source of<br>
&gt;&gt;&gt;&gt;&gt; &gt; overhead, then. What Simon found was that sending=
 the data from kernel<br>
&gt;&gt;&gt;&gt;&gt; &gt; to userspace is one of the most expensive bits of=
 epping, at least when<br>
&gt;&gt;&gt;&gt;&gt; &gt; the number of data points goes up (which is does =
as additional flows are<br>
&gt;&gt;&gt;&gt;&gt; &gt; added).<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Yhea, reporting individual RTTs when there&#39;s lots =
of them (you may get<br>
&gt;&gt;&gt;&gt;&gt; upwards of 1000 RTTs/s per flow) is not only problemat=
ic in terms of<br>
&gt;&gt;&gt;&gt;&gt; direct overhead from the tool itself, but also becomes=
 demanding for<br>
&gt;&gt;&gt;&gt;&gt; whatever you use all those RTT samples for (i.e. need =
to log, parse,<br>
&gt;&gt;&gt;&gt;&gt; analyze etc. a very large amount of RTTs). One way to =
deal with that is<br>
&gt;&gt;&gt;&gt;&gt; of course to just apply some sort of sampling (the -r/=
--rate-limit and<br>
&gt;&gt;&gt;&gt;&gt; -R/--rtt-rate<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; So my question: how would you prefer to rece=
ive this data? I&#39;ll have to<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; write a daemon that provides userspace contr=
ol (periodic cleanup as well as<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; reading the performance stream), so the worl=
d&#39;s kinda our oyster. I can<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; stick to Kathie&#39;s original format (and d=
ump it to a named pipe, perhaps?),<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; a condensed format that only shows what you =
want to use, an efficient<br>
&gt;&gt;&gt;&gt;&gt; &gt; &gt; binary format if you feel like parsing that.=
..<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; It would be great if we could combine efforts a b=
it here so we don&#39;t<br>
&gt;&gt;&gt;&gt;&gt; &gt; fork the codebase more than we have to. I.e., if =
&quot;upstream&quot; epping and<br>
&gt;&gt;&gt;&gt;&gt; &gt; whatever daemon you end up writing can agree on d=
ata format etc that<br>
&gt;&gt;&gt;&gt;&gt; &gt; would be fantastic! Added Simon to Cc to facilita=
te this :)<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; Briefly what I&#39;ve discussed before with Simon=
 was to have the ability to<br>
&gt;&gt;&gt;&gt;&gt; &gt; aggregate the metrics in the kernel (WiP PR [0]) =
and have a userspace<br>
&gt;&gt;&gt;&gt;&gt; &gt; utility periodically pull them out. What we discu=
ssed was doing this<br>
&gt;&gt;&gt;&gt;&gt; &gt; using an LPM map (which is not in that PR yet). T=
he idea would be that<br>
&gt;&gt;&gt;&gt;&gt; &gt; userspace would populate the LPM map with the key=
s (prefixes) they<br>
&gt;&gt;&gt;&gt;&gt; &gt; wanted statistics for (in LibreQOS context that c=
ould be one key per<br>
&gt;&gt;&gt;&gt;&gt; &gt; customer, for instance). Epping would then do a m=
ap lookup into the LPM,<br>
&gt;&gt;&gt;&gt;&gt; &gt; and if it gets a match it would update the statis=
tics in that map entry<br>
&gt;&gt;&gt;&gt;&gt; &gt; (keeping a histogram of latency values seen, basi=
cally). Simon&#39;s PR<br>
&gt;&gt;&gt;&gt;&gt; &gt; below uses this technique where userspace will &q=
uot;reset&quot; the histogram<br>
&gt;&gt;&gt;&gt;&gt; &gt; every time it loads it by swapping out two differ=
ent map entries when it<br>
&gt;&gt;&gt;&gt;&gt; &gt; does a read; this allows you to control the sampl=
ing rate from<br>
&gt;&gt;&gt;&gt;&gt; &gt; userspace, and you&#39;ll just get the data since=
 the last time you polled.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Thank&#39;s Toke for summarzing both the current state=
 and the plan going<br>
&gt;&gt;&gt;&gt;&gt; forward. I will just note that this PR (and all my oth=
er work with<br>
&gt;&gt;&gt;&gt;&gt; ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will=
 be more or less<br>
&gt;&gt;&gt;&gt;&gt; on hold for a couple of weeks right now as I&#39;m try=
ing to finish up a<br>
&gt;&gt;&gt;&gt;&gt; paper.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; I was thinking that if we all can agree on the ma=
p format, then your<br>
&gt;&gt;&gt;&gt;&gt; &gt; polling daemon could be one userspace &quot;clien=
t&quot; for that, and the epping<br>
&gt;&gt;&gt;&gt;&gt; &gt; binary itself could be another; but we could keep=
 compatibility between<br>
&gt;&gt;&gt;&gt;&gt; &gt; the two, so we don&#39;t duplicate effort.<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; Similarly, refactoring of the epping code itself =
so it can be plugged<br>
&gt;&gt;&gt;&gt;&gt; &gt; into the cpumap-tc code would be a good goal...<b=
r>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Should probably do that...at some point. In general I =
think it&#39;s a bit<br>
&gt;&gt;&gt;&gt;&gt; of an interesting problem to think about how to chain =
multiple XDP/tc<br>
&gt;&gt;&gt;&gt;&gt; programs together in an efficent way. Most XDP and tc =
programs will do<br>
&gt;&gt;&gt;&gt;&gt; some amount of packet parsing and when you have many c=
hained programs<br>
&gt;&gt;&gt;&gt;&gt; parsing the same packets this obviously becomes a bit =
wasteful. In the<br>
&gt;&gt;&gt;&gt;&gt; same time it would be nice if one didn&#39;t need to m=
anually merge<br>
&gt;&gt;&gt;&gt;&gt; multiple programs together into a single one like this=
 to get rid of<br>
&gt;&gt;&gt;&gt;&gt; this duplicated parsing, or at least make that process=
 of merging those<br>
&gt;&gt;&gt;&gt;&gt; programs as simple as possible.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; -Toke<br>
&gt;&gt;&gt;&gt;&gt; &gt;<br>
&gt;&gt;&gt;&gt;&gt; &gt; [0] <a href=3D"https://github.com/xdp-project/bpf=
-examples/pull/59" rel=3D"noreferrer" target=3D"_blank">https://github.com/=
xdp-project/bpf-examples/pull/59</a><br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; N=C3=A4r du skickar e-post till Karlstads universitet =
behandlar vi dina personuppgifter&lt;<a href=3D"https://www.kau.se/gdpr" re=
l=3D"noreferrer" target=3D"_blank">https://www.kau.se/gdpr</a>&gt;.<br>
&gt;&gt;&gt;&gt;&gt; When you send an e-mail to Karlstad University, we wil=
l process your personal data&lt;<a href=3D"https://www.kau.se/en/gdpr" rel=
=3D"noreferrer" target=3D"_blank">https://www.kau.se/en/gdpr</a>&gt;.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; _______________________________________________<br>
&gt;&gt;&gt;&gt; LibreQoS mailing list<br>
&gt;&gt;&gt;&gt; <a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=
=3D"_blank">LibreQoS@lists.bufferbloat.net</a><br>
&gt;&gt;&gt;&gt; <a href=3D"https://lists.bufferbloat.net/listinfo/libreqos=
" rel=3D"noreferrer" target=3D"_blank">https://lists.bufferbloat.net/listin=
fo/libreqos</a><br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; --<br>
&gt;&gt;&gt; Robert Chac=C3=B3n<br>
&gt;&gt;&gt; CEO | JackRabbit Wireless LLC<br>
&gt;<br>
&gt; _______________________________________________<br>
&gt; LibreQoS mailing list<br>
&gt; <a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=3D"_blank">Li=
breQoS@lists.bufferbloat.net</a><br>
&gt; <a href=3D"https://lists.bufferbloat.net/listinfo/libreqos" rel=3D"nor=
eferrer" target=3D"_blank">https://lists.bufferbloat.net/listinfo/libreqos<=
/a><br>
<br>
<br>
<br>
-- <br>
This song goes out to all the folk that thought Stadia would work:<br>
<a href=3D"https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-=
6981366665607352320-FXtz" rel=3D"noreferrer" target=3D"_blank">https://www.=
linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXt=
z</a><br>
Dave T=C3=A4ht CEO, TekLibre, LLC<br>
</blockquote></div>
</blockquote></div>

--00000000000052c0b705eb63b52c--