From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <herberticus@gmail.com>
Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com
 [IPv6:2607:f8b0:4864:20::431])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 8D2C83B2A4
 for <libreqos@lists.bufferbloat.net>; Mon, 17 Oct 2022 14:45:43 -0400 (EDT)
Received: by mail-pf1-x431.google.com with SMTP id i3so11888887pfk.9
 for <libreqos@lists.bufferbloat.net>; Mon, 17 Oct 2022 11:45:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:from:to:cc:subject:date:message-id:reply-to;
 bh=wMDG2ysiDqyMQMXaA/GAwPzMaDMa8LYZq6e+ST5QDls=;
 b=MsvorrIkywtD1UCRtf6OCqFZRGQJvHjqp8flwbd+i7U/qNPC1mydqnlOJu/yc2lWC/
 KFU4WKY65fs+NUNZfzFoGfG8xEDiAZnRwBTI3s0j3xXBnWj5wSuNjufGZsPVq4fxE6oi
 +T3LxNUyTdjCPeFQJ8N//thGwxIlcOc2ALHO8XvKA+uXWyHLJaYZqT6ZmICXHivnpCv8
 5walc/nxAODGPzUAOxxqbRwXkX4g4k9MwTaddeUb++t+iUXjM3a3iuApiSWyj2G7i1cY
 XtFYNMoSMGR3OufJUACaz0jhlqk66vKnfGbTUH0XhvQgvo8OryReWSFzIj0VsMaqW+uH
 xDLQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=wMDG2ysiDqyMQMXaA/GAwPzMaDMa8LYZq6e+ST5QDls=;
 b=F2PwK8FyoYjVKnPdFrfbr8wGxiiqncxclDZ0oy1rom0gXX2kP+zFbEI+oyIDnVKyt1
 9beNKP9a08K5oQ8dBVoJ3gVCMAHkXpn7TKsKiyY/eIkZ5YixdHUzZUZ51P/ZO6gZsksj
 yOmgE6CFIyxw/dEICpmVa3+ppBMulxzZ3B/8O8W+JaWVGQVLtadSB6Wx/0u2HzSYIjLS
 u/RwtSOcmJZnZY5KHu/Y8EoFb8lvVDK4ItK2IQtPYkF2CZxH+CMt9inHOtCy0QIz0jK5
 940SiSmJ6i2rIOtSZu/cXOrhdUX3EsVcSApL/rohC+BbHYuexQHlSuM4oA2xMQmhCwbc
 S78A==
X-Gm-Message-State: ACrzQf27ODtmK6XU/Lr5u/l1hbfS+DHP2MNJIp7CqAH0LozmxoxxrBOU
 dsbA+ovtmWBPce06jU0PjYsXOvFOpCyEgsFsIEac8bK+erw=
X-Google-Smtp-Source: AMsMyM5VS6Du9anxhFUBsIsSPyFN/gwhFx74BynDa48+4e32f+8jm1/n3sfhg4yTLQ5ccJ6GauHQQImmSE66AhIzEFw=
X-Received: by 2002:a65:6e0e:0:b0:434:59e0:27d3 with SMTP id
 bd14-20020a656e0e000000b0043459e027d3mr11534186pgb.185.1666032342341; Mon, 17
 Oct 2022 11:45:42 -0700 (PDT)
MIME-Version: 1.0
References: <CA+erpM5CNocpTnxNpTyEifaLv2P-ZbRXASUxS7iYr8LgCRgRNA@mail.gmail.com>
 <87bkqatu61.fsf@toke.dk>
 <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
In-Reply-To: <759c25c6fd54dceccc00eada5ccf5358d2d1c20c.camel@kau.se>
From: Herbert Wolverson <herberticus@gmail.com>
Date: Mon, 17 Oct 2022 13:45:31 -0500
Message-ID: <CA+erpM4RN0QKfq4E2PXfPjz9P3cq_6fMJL1_GKLt_SQ3GGXPFw@mail.gmail.com>
To: Simon Sundberg <Simon.Sundberg@kau.se>
Cc: "libreqos@lists.bufferbloat.net" <libreqos@lists.bufferbloat.net>
Content-Type: multipart/alternative; boundary="000000000000f31b4c05eb3f618a"
Subject: Re: [LibreQoS] In BPF pping - so far
X-BeenThere: libreqos@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Many ISPs need the kinds of quality shaping cake can do
 <libreqos.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/libreqos>
List-Post: <mailto:libreqos@lists.bufferbloat.net>
List-Help: <mailto:libreqos-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/libreqos>,
 <mailto:libreqos-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 17 Oct 2022 18:45:43 -0000

--000000000000f31b4c05eb3f618a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hey,

My current (unfinished) progress on this is now available here:
https://github.com/thebracket/cpumap-pping-hackjob

I mean it about the warnings, this isn't at all stable, debugged - and
can't promise that it won't unleash the nasal demons
(to use a popular C++ phrase). The name is descriptive! ;-)

With that said, I'm pretty happy so far:

* It runs only on the classifier - which xdp-cpumap-tc has nicely shunted
onto a dedicated CPU. It has to run on both
  the inbound and outbound classifiers, since otherwise it would only see
half the conversation.
* It does assume that your ingress and egress CPUs are mapped to the same
interface; I do that anyway in BracketQoS. Not doing
  that opens up a potential world of pain, since writes to the shared maps
would require a locking scheme. Too much locking, and you lose all of the
benefit of using multiple CPUs to begin with.
* It is pretty wasteful of RAM, but most of the shaper systems I've worked
with have lots of it.
* I've been gradually removing features that I don't want for BracketQoS. A
hypothetical future "useful to everyone" version wouldn't do that.
* Rate limiting is working, but I removed the requirement for a shared
configuration provided from userland - so right now it's always set to
report at 1 second intervals per stream.

My testbed is currently 3 Hyper-V VMs - a simple "client" and "world", and
a "shaper" VM in between running a slightly hacked-up LibreQoS.
iperf from "client" to "world" (with Libre set to allow 10gbit/s max, via a
cake/HTB queue setup) is around 5 gbit/s at present, on my
test PC (the host is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs=
)

Output currently consists of debug messages reading:
  cpumap/0/map:4-1371    [000] D..2.   515.399222: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399239: bpf_trace_printk: (tc)
Send performance event (5,1), 374696
  cpumap/0/map:4-1371    [000] D..2.   515.399466: bpf_trace_printk: (tc)
Flow open event
  cpumap/0/map:4-1371    [000] D..2.   515.399475: bpf_trace_printk: (tc)
Send performance event (5,1), 247069
  cpumap/0/map:4-1371    [000] D..2.   516.405151: bpf_trace_printk: (tc)
Send performance event (5,1), 5217155
  cpumap/0/map:4-1371    [000] D..2.   517.405248: bpf_trace_printk: (tc)
Send performance event (5,1), 4515394
  cpumap/0/map:4-1371    [000] D..2.   518.406117: bpf_trace_printk: (tc)
Send performance event (5,1), 4481289
  cpumap/0/map:4-1371    [000] D..2.   519.406255: bpf_trace_printk: (tc)
Send performance event (5,1), 4255268
  cpumap/0/map:4-1371    [000] D..2.   520.407864: bpf_trace_printk: (tc)
Send performance event (5,1), 5249493
  cpumap/0/map:4-1371    [000] D..2.   521.406664: bpf_trace_printk: (tc)
Send performance event (5,1), 3795993
  cpumap/0/map:4-1371    [000] D..2.   522.407469: bpf_trace_printk: (tc)
Send performance event (5,1), 3949519
  cpumap/0/map:4-1371    [000] D..2.   523.408126: bpf_trace_printk: (tc)
Send performance event (5,1), 4365335
  cpumap/0/map:4-1371    [000] D..2.   524.408929: bpf_trace_printk: (tc)
Send performance event (5,1), 4154910
  cpumap/0/map:4-1371    [000] D..2.   525.410048: bpf_trace_printk: (tc)
Send performance event (5,1), 4405582
  cpumap/0/map:4-1371    [000] D..2.   525.434080: bpf_trace_printk: (tc)
Send flow event
  cpumap/0/map:4-1371    [000] D..2.   525.482714: bpf_trace_printk: (tc)
Send flow event

The times haven't been tweaked yet. The (5,1) is tc handle major/minor,
allocated by the xdp-cpumap parent.
I get pretty low latency between VMs; I'll set up a test with some
real-world data very soon.

I plan to keep hacking away, but feel free to take a peek.

Thanks,
Herbert

On Mon, Oct 17, 2022 at 10:14 AM Simon Sundberg <Simon.Sundberg@kau.se>
wrote:

> Hi, thanks for adding me to the conversation. Just a couple of quick
> notes.
>
> On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote=
:
> > [ Adding Simon to Cc ]
> >
> > Herbert Wolverson via LibreQoS <libreqos@lists.bufferbloat.net> writes:
> >
> > > Hey,
> > >
> > > I've had some pretty good success with merging xdp-pping (
> > > https://github.com/xdp-project/bpf-examples/blob/master/pping/pping.h
> )
> > > into xdp-cpumap-tc ( https://github.com/xdp-project/xdp-cpumap-tc ).
> > >
> > > I ported over most of the xdp-pping code, and then changed the entry
> point
> > > and packet parsing code to make use of the work already done in
> > > xdp-cpumap-tc (it's already parsed a big chunk of the packet, no need
> to do
> > > it twice). Then I switched the maps to per-cpu maps, and had to pin
> them -
> > > otherwise the two tc instances don't properly share data.
> > >
>
> I guess the xdp-cpumap-tc ensures that the same flow is processed on
> the same CPU core at both ingress or egress. Otherwise, if a flow may
> be processed by different cores on ingress and egress the per-CPU maps
> will not really work reliably as each core will have a different view
> on the state of the flow, if there's been a previous packet with a
> certain TSval from that flow etc.
>
> Furthermore, if a flow is always processed on the same core (on both
> ingress and egress) I think per-CPU maps may be a bit wasteful on
> memory. From my understanding the keys for per-CPU maps are still
> shared across all CPUs, it's just that each CPU gets its own value. So
> all CPUs will then have their own data for each flow, but it's only the
> CPU processing the flow that will have any relevant data for the flow
> while the remaining CPUs will just have an empty state for that flow.
> Under the same assumption that packets within the same flow are always
> processed on the same core there should generally not be any
> concurrency issues with having a global (non-per-CPU) either as packets
> from the same flow cannot be processed concurrently then (and thus no
> concurrent access to the same value in the map). I am however still
> very unclear on if there's any considerable performance impact between
> global and per-CPU map versions if the same key is not accessed
> concurrently.
>
> > > Right now, output
> > > is just stubbed - I've still got to port the perfmap output code.
> Instead,
> > > I'm dumping a bunch of extra data to the kernel debug pipe, so I can
> see
> > > roughly what the output would look like.
> > >
> > > With debug enabled and just logging I'm now getting about 4.9
> Gbits/sec on
> > > single-stream iperf between two VMs (with a shaper VM in the middle).
> :-)
> >
> > Just FYI, that "just logging" is probably the biggest source of
> > overhead, then. What Simon found was that sending the data from kernel
> > to userspace is one of the most expensive bits of epping, at least when
> > the number of data points goes up (which is does as additional flows ar=
e
> > added).
>
> Yhea, reporting individual RTTs when there's lots of them (you may get
> upwards of 1000 RTTs/s per flow) is not only problematic in terms of
> direct overhead from the tool itself, but also becomes demanding for
> whatever you use all those RTT samples for (i.e. need to log, parse,
> analyze etc. a very large amount of RTTs). One way to deal with that is
> of course to just apply some sort of sampling (the -r/--rate-limit and
> -R/--rtt-rate
> >
> > > So my question: how would you prefer to receive this data? I'll have =
to
> > > write a daemon that provides userspace control (periodic cleanup as
> well as
> > > reading the performance stream), so the world's kinda our oyster. I c=
an
> > > stick to Kathie's original format (and dump it to a named pipe,
> perhaps?),
> > > a condensed format that only shows what you want to use, an efficient
> > > binary format if you feel like parsing that...
> >
> > It would be great if we could combine efforts a bit here so we don't
> > fork the codebase more than we have to. I.e., if "upstream" epping and
> > whatever daemon you end up writing can agree on data format etc that
> > would be fantastic! Added Simon to Cc to facilitate this :)
> >
> > Briefly what I've discussed before with Simon was to have the ability t=
o
> > aggregate the metrics in the kernel (WiP PR [0]) and have a userspace
> > utility periodically pull them out. What we discussed was doing this
> > using an LPM map (which is not in that PR yet). The idea would be that
> > userspace would populate the LPM map with the keys (prefixes) they
> > wanted statistics for (in LibreQOS context that could be one key per
> > customer, for instance). Epping would then do a map lookup into the LPM=
,
> > and if it gets a match it would update the statistics in that map entry
> > (keeping a histogram of latency values seen, basically). Simon's PR
> > below uses this technique where userspace will "reset" the histogram
> > every time it loads it by swapping out two different map entries when i=
t
> > does a read; this allows you to control the sampling rate from
> > userspace, and you'll just get the data since the last time you polled.
>
> Thank's Toke for summarzing both the current state and the plan going
> forward. I will just note that this PR (and all my other work with
> ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less
> on hold for a couple of weeks right now as I'm trying to finish up a
> paper.
>
> > I was thinking that if we all can agree on the map format, then your
> > polling daemon could be one userspace "client" for that, and the epping
> > binary itself could be another; but we could keep compatibility between
> > the two, so we don't duplicate effort.
> >
> > Similarly, refactoring of the epping code itself so it can be plugged
> > into the cpumap-tc code would be a good goal...
>
> Should probably do that...at some point. In general I think it's a bit
> of an interesting problem to think about how to chain multiple XDP/tc
> programs together in an efficent way. Most XDP and tc programs will do
> some amount of packet parsing and when you have many chained programs
> parsing the same packets this obviously becomes a bit wasteful. In the
> same time it would be nice if one didn't need to manually merge
> multiple programs together into a single one like this to get rid of
> this duplicated parsing, or at least make that process of merging those
> programs as simple as possible.
>
>
> > -Toke
> >
> > [0] https://github.com/xdp-project/bpf-examples/pull/59
>
> N=C3=A4r du skickar e-post till Karlstads universitet behandlar vi dina
> personuppgifter<https://www.kau.se/gdpr>.
> When you send an e-mail to Karlstad University, we will process your
> personal data<https://www.kau.se/en/gdpr>.
>

--000000000000f31b4c05eb3f618a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hey,</div><div><br></div><div>My current (unfinished)=
 progress on this is now available here: <a href=3D"https://github.com/theb=
racket/cpumap-pping-hackjob">https://github.com/thebracket/cpumap-pping-hac=
kjob</a></div><div><br></div><div>I mean it about the warnings, this isn=
9;t at all stable, debugged - and can&#39;t promise that it won&#39;t unlea=
sh the nasal demons</div><div>(to use a popular C++ phrase). The name is de=
scriptive! ;-)</div><div><br></div><div>With that said, I&#39;m pretty happ=
y so far:</div><div><br></div><div>* It runs only on the classifier - which=
 xdp-cpumap-tc has nicely shunted onto a dedicated CPU. It has to run on bo=
th</div><div>=C2=A0 the inbound and outbound classifiers, since otherwise i=
t would only see half the conversation.<br></div><div>* It does assume that=
 your ingress and egress CPUs are mapped to the same interface; I do that a=
nyway in BracketQoS. Not doing</div><div>=C2=A0 that opens up a potential w=
orld of pain, since writes to the shared maps would require a locking schem=
e. Too much locking, and you lose all of the benefit of using multiple CPUs=
 to begin with.</div><div>* It is pretty wasteful of RAM, but most of the s=
haper systems I&#39;ve worked with have lots of it.</div><div>* I&#39;ve be=
en gradually removing features that I don&#39;t want for BracketQoS. A hypo=
thetical future &quot;useful to everyone&quot; version wouldn&#39;t do that=
.</div><div>* Rate limiting is working, but I removed the requirement for a=
 shared configuration provided from userland - so right now it&#39;s always=
 set to report at 1 second intervals per stream.<br></div><div><br></div><d=
iv>My testbed is currently 3 Hyper-V VMs - a simple &quot;client&quot; and =
&quot;world&quot;, and a &quot;shaper&quot; VM in between running a slightl=
y hacked-up LibreQoS. <br></div><div>iperf from &quot;client&quot; to &quot=
;world&quot; (with Libre set to allow 10gbit/s max, via a cake/HTB queue se=
tup) is around 5 gbit/s at present, on my <br></div><div>test PC (the host =
is a core i7, 12th gen, 12 cores - 64gb RAM and fast SSDs)</div><div><br></=
div><div>Output currently consists of debug messages reading:</div><div>=C2=
=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 515.399222: bpf_tra=
ce_printk: (tc) Flow open event<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0=
[000] D..2. =C2=A0 515.399239: bpf_trace_printk: (tc) Send performance even=
t (5,1), 374696<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=
=A0 515.399466: bpf_trace_printk: (tc) Flow open event<br>=C2=A0 cpumap/0/m=
ap:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 515.399475: bpf_trace_printk: (tc=
) Send performance event (5,1), 247069<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0=
 =C2=A0[000] D..2. =C2=A0 516.405151: bpf_trace_printk: (tc) Send performan=
ce event (5,1), 5217155<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D.=
.2. =C2=A0 517.405248: bpf_trace_printk: (tc) Send performance event (5,1),=
 4515394<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 518.=
406117: bpf_trace_printk: (tc) Send performance event (5,1), 4481289<br>=C2=
=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 519.406255: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4255268<br>=C2=A0 cpumap/0/ma=
p:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 520.407864: bpf_trace_printk: (tc)=
 Send performance event (5,1), 5249493<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0=
 =C2=A0[000] D..2. =C2=A0 521.406664: bpf_trace_printk: (tc) Send performan=
ce event (5,1), 3795993<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D.=
.2. =C2=A0 522.407469: bpf_trace_printk: (tc) Send performance event (5,1),=
 3949519<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 523.=
408126: bpf_trace_printk: (tc) Send performance event (5,1), 4365335<br>=C2=
=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 524.408929: bpf_tra=
ce_printk: (tc) Send performance event (5,1), 4154910<br>=C2=A0 cpumap/0/ma=
p:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 525.410048: bpf_trace_printk: (tc)=
 Send performance event (5,1), 4405582<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0=
 =C2=A0[000] D..2. =C2=A0 525.434080: bpf_trace_printk: (tc) Send flow even=
t<br>=C2=A0 cpumap/0/map:4-1371 =C2=A0 =C2=A0[000] D..2. =C2=A0 525.482714:=
 bpf_trace_printk: (tc) Send flow event</div><div><br></div><div>The times =
haven&#39;t been tweaked yet. The (5,1) is tc handle major/minor, allocated=
 by the xdp-cpumap parent. <br></div><div>I get pretty low latency between =
VMs; I&#39;ll set up a test with some real-world data very soon.<br></div><=
div><br></div><div>I plan to keep hacking away, but feel free to take a pee=
k.</div><div><br></div><div>Thanks,</div><div>Herbert<br></div></div><br><d=
iv class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Oct =
17, 2022 at 10:14 AM Simon Sundberg &lt;<a href=3D"mailto:Simon.Sundberg@ka=
u.se">Simon.Sundberg@kau.se</a>&gt; wrote:<br></div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex">Hi, thanks for adding me to the conversation. Ju=
st a couple of quick<br>
notes.<br>
<br>
On Mon, 2022-10-17 at 16:13 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote:<=
br>
&gt; [ Adding Simon to Cc ]<br>
&gt;<br>
&gt; Herbert Wolverson via LibreQoS &lt;<a href=3D"mailto:libreqos@lists.bu=
fferbloat.net" target=3D"_blank">libreqos@lists.bufferbloat.net</a>&gt; wri=
tes:<br>
&gt;<br>
&gt; &gt; Hey,<br>
&gt; &gt;<br>
&gt; &gt; I&#39;ve had some pretty good success with merging xdp-pping (<br=
>
&gt; &gt; <a href=3D"https://github.com/xdp-project/bpf-examples/blob/maste=
r/pping/pping.h" rel=3D"noreferrer" target=3D"_blank">https://github.com/xd=
p-project/bpf-examples/blob/master/pping/pping.h</a> )<br>
&gt; &gt; into xdp-cpumap-tc ( <a href=3D"https://github.com/xdp-project/xd=
p-cpumap-tc" rel=3D"noreferrer" target=3D"_blank">https://github.com/xdp-pr=
oject/xdp-cpumap-tc</a> ).<br>
&gt; &gt;<br>
&gt; &gt; I ported over most of the xdp-pping code, and then changed the en=
try point<br>
&gt; &gt; and packet parsing code to make use of the work already done in<b=
r>
&gt; &gt; xdp-cpumap-tc (it&#39;s already parsed a big chunk of the packet,=
 no need to do<br>
&gt; &gt; it twice). Then I switched the maps to per-cpu maps, and had to p=
in them -<br>
&gt; &gt; otherwise the two tc instances don&#39;t properly share data.<br>
&gt; &gt;<br>
<br>
I guess the xdp-cpumap-tc ensures that the same flow is processed on<br>
the same CPU core at both ingress or egress. Otherwise, if a flow may<br>
be processed by different cores on ingress and egress the per-CPU maps<br>
will not really work reliably as each core will have a different view<br>
on the state of the flow, if there&#39;s been a previous packet with a<br>
certain TSval from that flow etc.<br>
<br>
Furthermore, if a flow is always processed on the same core (on both<br>
ingress and egress) I think per-CPU maps may be a bit wasteful on<br>
memory. From my understanding the keys for per-CPU maps are still<br>
shared across all CPUs, it&#39;s just that each CPU gets its own value. So<=
br>
all CPUs will then have their own data for each flow, but it&#39;s only the=
<br>
CPU processing the flow that will have any relevant data for the flow<br>
while the remaining CPUs will just have an empty state for that flow.<br>
Under the same assumption that packets within the same flow are always<br>
processed on the same core there should generally not be any<br>
concurrency issues with having a global (non-per-CPU) either as packets<br>
from the same flow cannot be processed concurrently then (and thus no<br>
concurrent access to the same value in the map). I am however still<br>
very unclear on if there&#39;s any considerable performance impact between<=
br>
global and per-CPU map versions if the same key is not accessed<br>
concurrently.<br>
<br>
&gt; &gt; Right now, output<br>
&gt; &gt; is just stubbed - I&#39;ve still got to port the perfmap output c=
ode. Instead,<br>
&gt; &gt; I&#39;m dumping a bunch of extra data to the kernel debug pipe, s=
o I can see<br>
&gt; &gt; roughly what the output would look like.<br>
&gt; &gt;<br>
&gt; &gt; With debug enabled and just logging I&#39;m now getting about 4.9=
 Gbits/sec on<br>
&gt; &gt; single-stream iperf between two VMs (with a shaper VM in the midd=
le). :-)<br>
&gt;<br>
&gt; Just FYI, that &quot;just logging&quot; is probably the biggest source=
 of<br>
&gt; overhead, then. What Simon found was that sending the data from kernel=
<br>
&gt; to userspace is one of the most expensive bits of epping, at least whe=
n<br>
&gt; the number of data points goes up (which is does as additional flows a=
re<br>
&gt; added).<br>
<br>
Yhea, reporting individual RTTs when there&#39;s lots of them (you may get<=
br>
upwards of 1000 RTTs/s per flow) is not only problematic in terms of<br>
direct overhead from the tool itself, but also becomes demanding for<br>
whatever you use all those RTT samples for (i.e. need to log, parse,<br>
analyze etc. a very large amount of RTTs). One way to deal with that is<br>
of course to just apply some sort of sampling (the -r/--rate-limit and<br>
-R/--rtt-rate<br>
&gt;<br>
&gt; &gt; So my question: how would you prefer to receive this data? I&#39;=
ll have to<br>
&gt; &gt; write a daemon that provides userspace control (periodic cleanup =
as well as<br>
&gt; &gt; reading the performance stream), so the world&#39;s kinda our oys=
ter. I can<br>
&gt; &gt; stick to Kathie&#39;s original format (and dump it to a named pip=
e, perhaps?),<br>
&gt; &gt; a condensed format that only shows what you want to use, an effic=
ient<br>
&gt; &gt; binary format if you feel like parsing that...<br>
&gt;<br>
&gt; It would be great if we could combine efforts a bit here so we don&#39=
;t<br>
&gt; fork the codebase more than we have to. I.e., if &quot;upstream&quot; =
epping and<br>
&gt; whatever daemon you end up writing can agree on data format etc that<b=
r>
&gt; would be fantastic! Added Simon to Cc to facilitate this :)<br>
&gt;<br>
&gt; Briefly what I&#39;ve discussed before with Simon was to have the abil=
ity to<br>
&gt; aggregate the metrics in the kernel (WiP PR [0]) and have a userspace<=
br>
&gt; utility periodically pull them out. What we discussed was doing this<b=
r>
&gt; using an LPM map (which is not in that PR yet). The idea would be that=
<br>
&gt; userspace would populate the LPM map with the keys (prefixes) they<br>
&gt; wanted statistics for (in LibreQOS context that could be one key per<b=
r>
&gt; customer, for instance). Epping would then do a map lookup into the LP=
M,<br>
&gt; and if it gets a match it would update the statistics in that map entr=
y<br>
&gt; (keeping a histogram of latency values seen, basically). Simon&#39;s P=
R<br>
&gt; below uses this technique where userspace will &quot;reset&quot; the h=
istogram<br>
&gt; every time it loads it by swapping out two different map entries when =
it<br>
&gt; does a read; this allows you to control the sampling rate from<br>
&gt; userspace, and you&#39;ll just get the data since the last time you po=
lled.<br>
<br>
Thank&#39;s Toke for summarzing both the current state and the plan going<b=
r>
forward. I will just note that this PR (and all my other work with<br>
ePPing/BPF-PPing/XDP-PPing/I-suck-at-names-PPing) will be more or less<br>
on hold for a couple of weeks right now as I&#39;m trying to finish up a<br=
>
paper.<br>
<br>
&gt; I was thinking that if we all can agree on the map format, then your<b=
r>
&gt; polling daemon could be one userspace &quot;client&quot; for that, and=
 the epping<br>
&gt; binary itself could be another; but we could keep compatibility betwee=
n<br>
&gt; the two, so we don&#39;t duplicate effort.<br>
&gt;<br>
&gt; Similarly, refactoring of the epping code itself so it can be plugged<=
br>
&gt; into the cpumap-tc code would be a good goal...<br>
<br>
Should probably do that...at some point. In general I think it&#39;s a bit<=
br>
of an interesting problem to think about how to chain multiple XDP/tc<br>
programs together in an efficent way. Most XDP and tc programs will do<br>
some amount of packet parsing and when you have many chained programs<br>
parsing the same packets this obviously becomes a bit wasteful. In the<br>
same time it would be nice if one didn&#39;t need to manually merge<br>
multiple programs together into a single one like this to get rid of<br>
this duplicated parsing, or at least make that process of merging those<br>
programs as simple as possible.<br>
<br>
<br>
&gt; -Toke<br>
&gt;<br>
&gt; [0] <a href=3D"https://github.com/xdp-project/bpf-examples/pull/59" re=
l=3D"noreferrer" target=3D"_blank">https://github.com/xdp-project/bpf-examp=
les/pull/59</a><br>
<br>
N=C3=A4r du skickar e-post till Karlstads universitet behandlar vi dina per=
sonuppgifter&lt;<a href=3D"https://www.kau.se/gdpr" rel=3D"noreferrer" targ=
et=3D"_blank">https://www.kau.se/gdpr</a>&gt;.<br>
When you send an e-mail to Karlstad University, we will process your person=
al data&lt;<a href=3D"https://www.kau.se/en/gdpr" rel=3D"noreferrer" target=
=3D"_blank">https://www.kau.se/en/gdpr</a>&gt;.<br>
</blockquote></div>

--000000000000f31b4c05eb3f618a--