From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <robert.chacon@jackrabbitwireless.com>
Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com
 [IPv6:2a00:1450:4864:20::534])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 7BA333B2A4
 for <libreqos@lists.bufferbloat.net>; Mon, 17 Oct 2022 16:34:10 -0400 (EDT)
Received: by mail-ed1-x534.google.com with SMTP id e18so17718423edj.3
 for <libreqos@lists.bufferbloat.net>; Mon, 17 Oct 2022 13:34:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=jackrabbitwireless.com; s=google;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:from:to:cc:subject:date:message-id:reply-to;
 bh=a68ESucLctxPQ76u8KE7IECxXlVyAW8jAdQ9cTOcnvE=;
 b=O3e49UpmmI2P5d7ixtvEu51kjuSOnLgg7aME0CpxBpDPgzJeLlb+HKN6rW6EPpczCB
 GGAnqxgpVCGPQPDxAFNyCOPm+XsdpkKdH0ixgvyHRHTSqxkqPkqw4/KIm5BOK/tJLBiq
 mDZ3PTRZDTq4hKQClPrdFHUwNOtmpFSoU49YU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=a68ESucLctxPQ76u8KE7IECxXlVyAW8jAdQ9cTOcnvE=;
 b=DBycWTd4OnXliXuEdrYqyApBlnf0T/e35SkKJ1IdL5KznvWhBOrvoyEGnxmrhNPmAR
 ykn9d7YBdd5MHjDU0aLAHnIeU796efUmipQHwY67E9tuWsN6g75zoTAucDmyYYERjguL
 8oZAZWPGPhMl1EmpySv6uzHmBKpntQamplCV85jTaoWmGizX9LRMZ4KxDQUgX7gwyTAY
 W6ZiKY23JK0pvT1PFMlIP/clQAjFSYmOblUAqZXBI/NIdOo8UOtpHmJ9Rv5AP4L8/rFj
 wegStWm7mPEMtR6BB0BY7qz1PED0HWP9TqaI3B8C9Qpmy3Lrw3X94zOLHv3Cl7nCroJi
 k+6w==
X-Gm-Message-State: ACrzQf2wOA0ihpTE1/skweAZlBti1Yd+9nJi+qgmGusgNWOm6HHw/KWI
 BilWInBmfL7mYZ/jxAVQJHck0nfQIGkelstRhLjj36xNdhT4IQ==
X-Google-Smtp-Source: AMsMyM5lRLSdVNZ1jWVs2kmnVYrHQ45GdpSE4cYmq8nSfCJ725acSKyTrUKu/ySLuWlS12DpNGM41/zkX+J4q5Y2pXk=
X-Received: by 2002:a05:6402:11d4:b0:45d:8733:7cc2 with SMTP id
 j20-20020a05640211d400b0045d87337cc2mr7174134edw.409.1666038849234; Mon, 17
 Oct 2022 13:34:09 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
 <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
In-Reply-To: <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
From: =?UTF-8?Q?Robert_Chac=C3=B3n?= <robert.chacon@jackrabbitwireless.com>
Date: Mon, 17 Oct 2022 14:33:57 -0600
Message-ID: <CAOZyJosbcuHO0SYqraSAddJw_9FYtp_qUCh65Y9Vo6MHOeG5_g@mail.gmail.com>
To: Herbert Wolverson <herberticus@gmail.com>
Cc: Simon Sundberg <Simon.Sundberg@kau.se>, 
 "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: multipart/alternative; boundary="000000000000ca82c305eb40e533"
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 17 Oct 2022 20:34:10 -0000

--000000000000ca82c305eb40e533
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hey Herbert,

Fantastic work! Super exciting to see this coming together, especially so
quickly.
I'll test it soon.
I understand and agree with your decision to omit certain features (ICMP
tracking,DNS tracking, etc) to optimize performance for our use case. Like
you said, in order to merge the functionality without a performance hit,
merging them is sort of the only way right now. Otherwise there would be a
lot of redundancy and lost throughput for an ISP's use. Though hopefully
long term there will be a way to keep all projects working independently
but interoperably with a plugin system of some kind.

By the way, I'm making some headway on LibreQoS v1.3. Focusing on
optimizations for high sub counts (8000+ subs) as well as stateful changes
to the queue structure.
I'm working to set up a physical lab to test high throughput and high
client count scenarios.
When testing beyond ~32,000 filters we get "no space left on device" from
xdp-cpumap-tc, which I think relates to the bpf map size limitation you
mentioned. Maybe in the coming months we can take a look at that.

Anyway great work on the cpumap-pping program! Excited to see more on this.

Thanks,
Robert

On Mon, Oct 17, 2022 at 12:45 PM Herbert Wolverson via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

> Hey,
>
> My current (unfinished) progress on this is now available here:
> https://github.com/thebracket/cpumap-pping-hackjob
>
> I mean it about the warnings, this isn't at all stable, debugged - and
> can't promise that it won't unleash the nasal demons
> (to use a popular C++ phrase). The name is descriptive! ;-)
>
> With that said, I'm pretty happy so far:
>
> * It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
> onto a dedicated CPU. It has to run on both
>   the inbound and outbound classifiers, since otherwise it would only see
> half the conversation.
> * It does assume that your ingress and egress CPUs are mapped to the same
> interface; I do that anyway in BracketQoS. Not doing
>   that opens up a potential world of pain, since writes to the shared map=
s
> would require a locking scheme. Too much locking, and you lose all of the
> benefit of using multiple CPUs to begin with.
> * It is pretty wasteful of RAM, but most of the shaper systems I've worke=
d
> with have lots of it.
> * I've been gradually removing features that I don't want for BracketQoS.
> A hypothetical future "useful to everyone" version wouldn't do that.
> * Rate limiting is working, but I removed the requirement for a shared
> configuration provided from userland - so right now it's always set to
> report at 1 second intervals per stream.
>
> My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", an=
d
> a "shaper" VM in between running a slightly hacked-up LibreQoS.
> iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via
> a cake/HTB queue setup) is around 5 gbit/s at present, on my
> test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast
> SSDs)
>
> Output currently consists of debug messages reading:
>   cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
> Flow open event
>   cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
> Send performance event (5,1), 374696
>   cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
> Flow open event
>   cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
> Send performance event (5,1), 247069
>   cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
> Send performance event (5,1), 5217155
>   cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
> Send performance event (5,1), 4515394
>   cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
> Send performance event (5,1), 4481289
>   cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
> Send performance event (5,1), 4255268
>   cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
> Send performance event (5,1), 5249493
>   cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
> Send performance event (5,1), 3795993
>   cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
> Send performance event (5,1), 3949519
>   cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
> Send performance event (5,1), 4365335
>   cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
> Send performance event (5,1), 4154910
>   cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
> Send performance event (5,1), 4405582
>   cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
> Send flow event
>   cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
> Send flow event
>
> The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
> allocated by the xdp-cpumap parent.
> I get pretty low latency between VMs; I'll set up a test with some
> real-world data very soon.
>
> I plan to keep hacking away, but feel free to take a peek.
>
> Thanks,
> Herbert
>
> On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
> wrote:
>
>> Hi, thanks for adding me to the conversation. Just a couple of quick
>> notes.
>>
>> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrot=
e:
>> > [ Adding Simon to Cc ]
>> >
>> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes=
:
>> >
>> > > Hey,
>> > >
>> > > I've had some pretty good success with merging xdp-pping (
>> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.=
h
>> )
>> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
>> > >
>> > > I ported over most of the xdp-pping code, and then changed the entry
>> point
>> > > and packet parsing code to make use of the work already done in
>> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no nee=
d
>> to do
>> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
>> them -
>> > > otherwise the two tc instances don't properly share data.
>> > >
>>
>> I guess the xdp-cpumap-tc ensures that the same flow is processed on
>> the same CPU core at both ingress or egress. Otherwise, if a flow may
>> be processed by different cores on ingress and egress the per-CPU maps
>> will not really work reliably as each core will have a different view
>> on the state of the flow, if there's been a previous packet with a
>> certain TSval from that flow etc.
>>
>> Furthermore, if a flow is always processed on the same core (on both
>> ingress and egress) I think per-CPU maps may be a bit wasteful on
>> memory. From my understanding the keys for per-CPU maps are still
>> shared across all CPUs, it's just that each CPU gets its own value. So
>> all CPUs will then have their own data for each flow, but it's only the
>> CPU processing the flow that will have any relevant data for the flow
>> while the remaining CPUs will just have an empty state for that flow.
>> Under the same assumption that packets within the same flow are always
>> processed on the same core there should generally not be any
>> concurrency issues with having a global (non-per-CPU) either as packets
>> from the same flow cannot be processed concurrently then (and thus no
>> concurrent access to the same value in the map). I am however still
>> very unclear on if there's any considerable performance impact between
>> global and per-CPU map versions if the same key is not accessed
>> concurrently.
>>
>> > > Right now, output
>> > > is just stubbed - I've still got to port the perfmap output code.
>> Instead,
>> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
>> see
>> > > roughly what the output would look like.
>> > >
>> > > With debug enabled and just logging I'm now getting about 4.9
>> Gbits/sec on
>> > > single-stream iperf between two VMs (with a shaper VM in the middle)=
.
>> :-)
>> >
>> > Just FYI, that "just logging" is probably the biggest source of
>> > overhead, then. What Simon found was that sending the data from kernel
>> > to userspace is one of the most expensive bits of epping, at least whe=
n
>> > the number of data points goes up (which is does as additional flows a=
re
>> > added).
>>
>> Yhea, reporting individual RTTs when there's lots of them (you may get
>> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
>> direct overhead from the tool itself, but also becomes demanding for
>> whatever you use all those RTT samples for (i.e. need to log, parse,
>> analyze etc. a very large amount of RTTs). One way to deal with that is
>> of course to just apply some sort of sampling (the -r/--rate-limit and
>> -R/--rtt-rate
>> >
>> > > So my question: how would you prefer to receive this data? I'll have
>> to
>> > > write a daemon that provides userspace control (periodic cleanup as
>> well as
>> > > reading the performance stream), so the world's kinda our oyster. I
>> can
>> > > stick to Kathie's original format (and dump it to a named pipe,
>> perhaps?),
>> > > a condensed format that only shows what you want to use, an efficien=
t
>> > > binary format if you feel like parsing that...
>> >
>> > It would be great if we could combine efforts a bit here so we don't
>> > fork the codebase more than we have to. I.e., if "upstream" epping and
>> > whatever daemon you end up writing can agree on data format etc that
>> > would be fantastic! Added Simon to Cc to facilitate this :)
>> >
>> > Briefly what I've discussed before with Simon was to have the ability =
to
>> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
>> > utility periodically pull them out. What we discussed was doing this
>> > using an LPM map (which is not in that PR yet). The idea would be that
>> > userspace would populate the LPM map with the keys (prefixes) they
>> > wanted statistics for (in LibreQOS context that could be one key per
>> > customer, for instance). Epping would then do a map lookup into the LP=
M,
>> > and if it gets a match it would update the statistics in that map entr=
y
>> > (keeping a histogram of latency values seen, basically). Simon's PR
>> > below uses this technique where userspace will "reset" the histogram
>> > every time it loads it by swapping out two different map entries when =
it
>> > does a read; this allows you to control the sampling rate from
>> > userspace, and you'll just get the data since the last time you polled=
.
>>
>> Thank's Toke for summarzing both the current state and the plan going
>> forward. I will just note that this PR (and all my other work with
>> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
>> on hold for a couple of weeks right now as I'm trying to finish up a
>> paper.
>>
>> > I was thinking that if we all can agree on the map format, then your
>> > polling daemon could be one userspace "client" for that, and the eppin=
g
>> > binary itself could be another; but we could keep compatibility betwee=
n
>> > the two, so we don't duplicate effort.
>> >
>> > Similarly, refactoring of the epping code itself so it can be plugged
>> > into the cpumap-tc code would be a good goal...
>>
>> Should probably do that...at some point. In general I think it's a bit
>> of an interesting problem to think about how to chain multiple XDP/tc
>> programs together in an efficent way. Most XDP and tc programs will do
>> some amount of packet parsing and when you have many chained programs
>> parsing the same packets this obviously becomes a bit wasteful. In the
>> same time it would be nice if one didn't need to manually merge
>> multiple programs together into a single one like this to get rid of
>> this duplicated parsing, or at least make that process of merging those
>> programs as simple as possible.
>>
>>
>> > -Toke
>> >
>> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>>
>> N=C3=A4r du skickar e-post till Karlstads universitet behandlar vi dina
>> personuppgifter<https://www.kau.se/gdpr>.
>> When you send an e-mail to Karlstad University, we will process your
>> personal data<https://www.kau.se/en/gdpr>.
>>
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>


--=20
Robert Chac=C3=B3n
CEO | JackRabbit Wireless LLC <http://jackrabbitwireless.com>

--000000000000ca82c305eb40e533
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hey Herbert,</div><div><br></div><div>Fantastic work!=
 Super exciting to see this coming together, especially so quickly.</div><d=
iv>I&#39;ll test it soon.</div><div>I understand and agree with your decisi=
on to omit certain features (ICMP tracking,DNS tracking, etc) to optimize p=
erformance for our use case. Like you said, in order to merge the functiona=
lity without a performance hit, merging them is sort of the only way right =
now. Otherwise there would be a lot of redundancy and lost throughput for a=
n ISP&#39;s use. Though hopefully long term there will be a way to keep all=
 projects working independently but interoperably with a plugin system of s=
ome kind. <br></div><div><br></div><div>By the way, I&#39;m making some hea=
dway on LibreQoS v1.3. Focusing on optimizations for high sub counts (8000+=
 subs) as well as stateful changes to the queue structure.</div><div>I&#39;=
m working to set up a physical lab to test high throughput and high client =
count scenarios.</div><div>When testing beyond ~32,000 filters we get &quot=
;no space left on device&quot; from xdp-cpumap-tc, which I think relates to=
 the bpf map size limitation you mentioned. Maybe in the coming months we c=
an take a look at that.</div><div><br></div><div>Anyway great work on the c=
pumap-pping program! Excited to see more on this.</div><div><br></div><div>=
Thanks,</div><div>Robert<br></div></div><br><div class=3D"gmail_quote"><div=
 dir=3D"ltr" class=3D"gmail_attr">On Mon, Oct 17, 2022 at 12:45 PM Herbert =
Wolverson via LibreQoS &lt;<a href=3D"mailto:libreqos@lists.bufferbloat.net=
">libreqos@lists.bufferbloat.net</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Hey,</div><div><br><=
/div><div>My current (unfinished) progress on this is now available here: <=
a href=3D"https://github.com/thebracket/cpumap-pping-hackjob" target=3D"_bl=
ank">https://github.com/thebracket/cpumap-pping-hackjob</a></div><div><br><=
/div><div>I mean it about the warnings, this isn&#39;t at all stable, debug=
ged - and can&#39;t promise that it won&#39;t unleash the nasal demons</div=
><div>(to use a popular C++ phrase). The name is descriptive! ;-)</div><div=
><br></div><div>With that said, I&#39;m pretty happy so far:</div><div><br>=
</div><div>* It runs only on the classifier - which xdp-cpumap-tc has nicel=
y shunted onto a dedicated CPU. It has to run on both</div><div>=C2=A0 the =
inbound and outbound classifiers, since otherwise it would only see half th=
e conversation.<br></div><div>* It does assume that your ingress and egress=
 CPUs are mapped to the same interface; I do that anyway in BracketQoS. Not=
 doing</div><div>=C2=A0 that opens up a potential world of pain, since writ=
es to the shared maps would require a locking scheme. Too much locking, and=
 you lose all of the benefit of using multiple CPUs to begin with.</div><di=
v>* It is pretty wasteful of RAM, but most of the shaper systems I&#39;ve w=
orked with have lots of it.</div><div>* I&#39;ve been gradually removing fe=
atures that I don&#39;t want for BracketQoS. A hypothetical future &quot;us=
eful to everyone&quot; version wouldn&#39;t do that.</div><div>* Rate limit=
ing is working, but I removed the requirement for a shared configuration pr=
ovided from userland - so right now it&#39;s always set to report at 1 seco=
nd intervals per stream.<br></div><div><br></div><div>My testbed is current=
ly 3 Hyper-V VMs - a simple &quot;client&quot; and &quot;world&quot;, and a=
 &quot;shaper&quot; VM in between running a slightly hacked-up LibreQoS. <b=
r></div><div>iperf from &quot;client&quot; to &quot;world&quot; (with Libre=
 set to allow 10gbit/s max, via a cake/HTB queue setup) is around 5 gbit/s =
at present, on my <br></div><div>test PC (the host is a core i7, 12th gen, =
12 cores - 64gb RAM and fast SSDs)</div><div><br></div><div>Output currentl=
y consists of debug messages reading:</div><div>=C2=A0 cpumap/0/map:4-1371 =
=C2=A0 =C2=A0[000] D..2. =C2=A0 515.399222: bpf_trace_printk: (tc) Flow ope=
n event<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 515.3=
99239: bpf_trace_printk: (tc) Send performance event (5,1), 374696<br>=C2=
=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 515.399466: bpf_tra=
ce_printk: (tc) Flow open event<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0=
[000] D..2. =C2=A0 515.399475: bpf_trace_printk: (tc) Send performance even=
t (5,1), 247069<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=
=A0 516.405151: bpf_trace_printk: (tc) Send performance event (5,1), 521715=
5<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 517.405248:=
 bpf_trace_printk: (tc) Send performance event (5,1), 4515394<br>=C2=A0 cpu=
map/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 518.406117: bpf_trace_prin=
tk: (tc) Send performance event (5,1), 4481289<br>=C2=A0 cpumap/0/map:4-137=
1 =C2=A0 =C2=A0[000] D..2. =C2=A0 519.406255: bpf_trace_printk: (tc) Send p=
erformance event (5,1), 4255268<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0=
[000] D..2. =C2=A0 520.407864: bpf_trace_printk: (tc) Send performance even=
t (5,1), 5249493<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=
=A0 521.406664: bpf_trace_printk: (tc) Send performance event (5,1), 379599=
3<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 522.407469:=
 bpf_trace_printk: (tc) Send performance event (5,1), 3949519<br>=C2=A0 cpu=
map/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 523.408126: bpf_trace_prin=
tk: (tc) Send performance event (5,1), 4365335<br>=C2=A0 cpumap/0/map:4-137=
1 =C2=A0 =C2=A0[000] D..2. =C2=A0 524.408929: bpf_trace_printk: (tc) Send p=
erformance event (5,1), 4154910<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0=
[000] D..2. =C2=A0 525.410048: bpf_trace_printk: (tc) Send performance even=
t (5,1), 4405582<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=
=A0 525.434080: bpf_trace_printk: (tc) Send flow event<br>=C2=A0 cpumap/0/m=
ap:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 525.482714: bpf_trace_printk: (tc=
) Send flow event</div><div><br></div><div>The times haven&#39;t been tweak=
ed yet. The (5,1) is tc handle major/minor, allocated by the xdp-cpumap par=
ent. <br></div><div>I get pretty low latency between VMs; I&#39;ll set up a=
 test with some real-world data very soon.<br></div><div><br></div><div>I p=
lan to keep hacking away, but feel free to take a peek.</div><div><br></div=
><div>Thanks,</div><div>Herbert<br></div></div><br><div class=3D"gmail_quot=
e"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Oct 17, 2022 at 10:14 AM S=
imon Sundberg &lt;<a href=3D"mailto:Simon.Sundberg@kau.se" target=3D"_blank=
">Simon.Sundberg@kau.se</a>&gt; wrote:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex">Hi, thanks for adding me to the conversation. Just a=
 couple of quick<br>
notes.<br>
<br>
On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote:<=
br>
&gt; [ Adding Simon to Cc ]<br>
&gt;<br>
&gt; Herbert Wolverson via LibreQoS &lt;<a href=3D"mailto:libreqos@lists.bu=
fferbloat.net" target=3D"_blank">libreqos@lists.bufferbloat.net</a>&gt; wri=
tes:<br>
&gt;<br>
&gt; &gt; Hey,<br>
&gt; &gt;<br>
&gt; &gt; I&#39;ve had some pretty good success with merging xdp-pping (<br=
>
&gt; &gt; <a href=3D"https://github.com/xdp-project/bpf-examples/blob/maste=
r/pping/pping.h" rel=3D"noreferrer" target=3D"_blank">https://github.com/xd=
p-project/bpf-examples/blob/master/pping/pping.h</a> )<br>
&gt; &gt; into xdp-cpumap-tc ( <a href=3D"https://github.com/xdp-project/xd=
p-cpumap-tc" rel=3D"noreferrer" target=3D"_blank">https://github.com/xdp-pr=
oject/xdp-cpumap-tc</a> ).<br>
&gt; &gt;<br>
&gt; &gt; I ported over most of the xdp-pping code, and then changed the en=
try point<br>
&gt; &gt; and packet parsing code to make use of the work already done in<b=
r>
&gt; &gt; xdp-cpumap-tc (it&#39;s already parsed a big chunk of the packet,=
 no need to do<br>
&gt; &gt; it twice). Then I switched the maps to per-cpu maps, and had to p=
in them -<br>
&gt; &gt; otherwise the two tc instances don&#39;t properly share data.<br>
&gt; &gt;<br>
<br>
I guess the xdp-cpumap-tc ensures that the same flow is processed on<br>
the same CPU core at both ingress or egress. Otherwise, if a flow may<br>
be processed by different cores on ingress and egress the per-CPU maps<br>
will not really work reliably as each core will have a different view<br>
on the state of the flow, if there&#39;s been a previous packet with a<br>
certain TSval from that flow etc.<br>
<br>
Furthermore, if a flow is always processed on the same core (on both<br>
ingress and egress) I think per-CPU maps may be a bit wasteful on<br>
memory. From my understanding the keys for per-CPU maps are still<br>
shared across all CPUs, it&#39;s just that each CPU gets its own value. So<=
br>
all CPUs will then have their own data for each flow, but it&#39;s only the=
<br>
CPU processing the flow that will have any relevant data for the flow<br>
while the remaining CPUs will just have an empty state for that flow.<br>
Under the same assumption that packets within the same flow are always<br>
processed on the same core there should generally not be any<br>
concurrency issues with having a global (non-per-CPU) either as packets<br>
from the same flow cannot be processed concurrently then (and thus no<br>
concurrent access to the same value in the map). I am however still<br>
very unclear on if there&#39;s any considerable performance impact between<=
br>
global and per-CPU map versions if the same key is not accessed<br>
concurrently.<br>
<br>
&gt; &gt; Right now, output<br>
&gt; &gt; is just stubbed - I&#39;ve still got to port the perfmap output c=
ode. Instead,<br>
&gt; &gt; I&#39;m dumping a bunch of extra data to the kernel debug pipe, s=
o I can see<br>
&gt; &gt; roughly what the output would look like.<br>
&gt; &gt;<br>
&gt; &gt; With debug enabled and just logging I&#39;m now getting about 4.9=
 Gbits/sec on<br>
&gt; &gt; single-stream iperf between two VMs (with a shaper VM in the midd=
le). :-)<br>
&gt;<br>
&gt; Just FYI, that &quot;just logging&quot; is probably the biggest source=
 of<br>
&gt; overhead, then. What Simon found was that sending the data from kernel=
<br>
&gt; to userspace is one of the most expensive bits of epping, at least whe=
n<br>
&gt; the number of data points goes up (which is does as additional flows a=
re<br>
&gt; added).<br>
<br>
Yhea, reporting individual RTTs when there&#39;s lots of them (you may get<=
br>
upwards of 1000 RTTs/s per flow) is not only problematic in terms of<br>
direct overhead from the tool itself, but also becomes demanding for<br>
whatever you use all those RTT samples for (i.e. need to log, parse,<br>
analyze etc. a very large amount of RTTs). One way to deal with that is<br>
of course to just apply some sort of sampling (the -r/--rate-limit and<br>
-R/--rtt-rate<br>
&gt;<br>
&gt; &gt; So my question: how would you prefer to receive this data? I&#39;=
ll have to<br>
&gt; &gt; write a daemon that provides userspace control (periodic cleanup =
as well as<br>
&gt; &gt; reading the performance stream), so the world&#39;s kinda our oys=
ter. I can<br>
&gt; &gt; stick to Kathie&#39;s original format (and dump it to a named pip=
e, perhaps?),<br>
&gt; &gt; a condensed format that only shows what you want to use, an effic=
ient<br>
&gt; &gt; binary format if you feel like parsing that...<br>
&gt;<br>
&gt; It would be great if we could combine efforts a bit here so we don&#39=
;t<br>
&gt; fork the codebase more than we have to. I.e., if &quot;upstream&quot; =
epping and<br>
&gt; whatever daemon you end up writing can agree on data format etc that<b=
r>
&gt; would be fantastic! Added Simon to Cc to facilitate this :)<br>
&gt;<br>
&gt; Briefly what I&#39;ve discussed before with Simon was to have the abil=
ity to<br>
&gt; aggregate the metrics in the kernel (WiP PR [0]) and have a userspace<=
br>
&gt; utility periodically pull them out. What we discussed was doing this<b=
r>
&gt; using an LPM map (which is not in that PR yet). The idea would be that=
<br>
&gt; userspace would populate the LPM map with the keys (prefixes) they<br>
&gt; wanted statistics for (in LibreQOS context that could be one key per<b=
r>
&gt; customer, for instance). Epping would then do a map lookup into the LP=
M,<br>
&gt; and if it gets a match it would update the statistics in that map entr=
y<br>
&gt; (keeping a histogram of latency values seen, basically). Simon&#39;s P=
R<br>
&gt; below uses this technique where userspace will &quot;reset&quot; the h=
istogram<br>
&gt; every time it loads it by swapping out two different map entries when =
it<br>
&gt; does a read; this allows you to control the sampling rate from<br>
&gt; userspace, and you&#39;ll just get the data since the last time you po=
lled.<br>
<br>
Thank&#39;s Toke for summarzing both the current state and the plan going<b=
r>
forward. I will just note that this PR (and all my other work with<br>
ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less<br>
on hold for a couple of weeks right now as I&#39;m trying to finish up a<br=
>
paper.<br>
<br>
&gt; I was thinking that if we all can agree on the map format, then your<b=
r>
&gt; polling daemon could be one userspace &quot;client&quot; for that, and=
 the epping<br>
&gt; binary itself could be another; but we could keep compatibility betwee=
n<br>
&gt; the two, so we don&#39;t duplicate effort.<br>
&gt;<br>
&gt; Similarly, refactoring of the epping code itself so it can be plugged<=
br>
&gt; into the cpumap-tc code would be a good goal...<br>
<br>
Should probably do that...at some point. In general I think it&#39;s a bit<=
br>
of an interesting problem to think about how to chain multiple XDP/tc<br>
programs together in an efficent way. Most XDP and tc programs will do<br>
some amount of packet parsing and when you have many chained programs<br>
parsing the same packets this obviously becomes a bit wasteful. In the<br>
same time it would be nice if one didn&#39;t need to manually merge<br>
multiple programs together into a single one like this to get rid of<br>
this duplicated parsing, or at least make that process of merging those<br>
programs as simple as possible.<br>
<br>
<br>
&gt; -Toke<br>
&gt;<br>
&gt; [0] <a href=3D"https://github.com/xdp-project/bpf-examples/pull/59" re=
l=3D"noreferrer" target=3D"_blank">https://github.com/xdp-project/bpf-examp=
les/pull/59</a><br>
<br>
N=C3=A4r du skickar e-post till Karlstads universitet behandlar vi dina per=
sonuppgifter&lt;<a href=3D"https://www.kau.se/gdpr" rel=3D"noreferrer" targ=
et=3D"_blank">https://www.kau.se/gdpr</a>&gt;.<br>
When you send an e-mail to Karlstad University, we will process your person=
al data&lt;<a href=3D"https://www.kau.se/en/gdpr" rel=3D"noreferrer" target=
=3D"_blank">https://www.kau.se/en/gdpr</a>&gt;.<br>
</blockquote></div>
_______________________________________________<br>
LibreQoS mailing list<br>
<a href=3D"mailto:LibreQoS@lists.bufferbloat.net" target=3D"_blank">LibreQo=
S@lists.bufferbloat.net</a><br>
<a href=3D"https://lists.bufferbloat.net/listinfo/libreqos" rel=3D"noreferr=
er" target=3D"_blank">https://lists.bufferbloat.net/listinfo/libreqos</a><b=
r>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail_signature"><div dir=3D"ltr">Robert Chac=C3=B3n<br>CEO | <a href=3D"htt=
p://jackrabbitwireless.com" target=3D"_blank">JackRabbit Wireless LLC</a><b=
r></div></div>

--000000000000ca82c305eb40e533--