From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 90F513B29D for ; Tue, 8 Nov 2022 11:03:03 -0500 (EST) Received: by mail-pj1-x1031.google.com with SMTP id gw22so14213396pjb.3 for ; Tue, 08 Nov 2022 08:03:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=R8/Mg91D3v/uDTyarM91BLI2ZeFD1F+fP+c3U8QcDyc=; b=jf76G9H+4S/SJD5A6dyA5Y9Diu8pdH8xci90sZBHZ8jdIbxDsTV4wibU+XVJfyv7vw kpvdajPGV8hItSt1bUBzIK47weQZsy/DJ0/Ey/j1WBvtP07EfAnFcKfKLzQDw0Eu3ADc Hw4vW4stC8PRt6qroyJrnJOCoi7/6OQtgrA6P0Owzi+WW9vMfG7/yqYGdALeXNAQYqDf cY23B8rzgpI1TwbKkSsGnTkjmHpcP5UpSdtqSoFFCpuGvuehZpnd7iDveQdPr7oQoGYx fSUFCpz7X5KMJ1GJaRsNlIQb0G5kXz+gLmUfpaR1gQB6YXetDPyuxInW8ru3P+c6qiHL 4ZQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=R8/Mg91D3v/uDTyarM91BLI2ZeFD1F+fP+c3U8QcDyc=; b=nBwXzSnOGQffx2YfzC6o7y6TRSVHY6j6ixbYs8nJv5bPMS8uslwJcdc+LJxMIlzDAn 5ThW/tb0bVcXYI6E60/+XU3AQRTNTgW9sV9I9ByD4Z5mrdBt9y5/zxreHBvK1twwu6G4 vyQ3tqRsL5qHxqj6K5oHXiuwzssCnb0ho8ZB57yyNmk+UjZYBg4FZyNmloE0xqGDMVWq L3UyKcDhqDBvwgxY1jRARRh9KGkPYxzZBRhmRLibKoV5Okxc4KWKhIASti8JUclJm/XL tOJYSx+9UaLfwac+HEE+KALL6urEchrHPmbdgEzkOdmKxxdNr8XKt924WHSy45ARtNP+ St2A== X-Gm-Message-State: ACrzQf3gPnWzUqa4BRXjWAYYX9wfge3hxIuqoS0TstVttx2AnGEl/8gY jJsfk43Z4c6eQOxHoKdtJNM8FCApL0JAEpjzrZNb1ntf X-Google-Smtp-Source: AMsMyM4w/YPU/QvrIJcFrbIrTLP2Q5wD9V/68UzQaB9Ka+0XiMDT7lk5sViUiUw9NCyrX14DKWQtlEv+ZuRMLL4yOto= X-Received: by 2002:a17:903:124e:b0:179:da2f:244e with SMTP id u14-20020a170903124e00b00179da2f244emr57524024plh.169.1667923381205; Tue, 08 Nov 2022 08:03:01 -0800 (PST) MIME-Version: 1.0 References: <877d05v825.fsf@toke.dk> In-Reply-To: <877d05v825.fsf@toke.dk> From: Herbert Wolverson Date: Tue, 8 Nov 2022 10:02:49 -0600 Message-ID: Cc: libreqos Content-Type: multipart/alternative; boundary="000000000000a63cce05ecf7ac0a" Subject: Re: [LibreQoS] Before/After Performance Comparison (Monitoring Mode) X-BeenThere: libreqos@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Many ISPs need the kinds of quality shaping cake can do List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Nov 2022 16:03:03 -0000 --000000000000a63cce05ecf7ac0a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'd probably go with an HTB queue per target IP group, and not attach a discipline to it - with only a ceiling set at the top. That'll do truly minimal shaping, and you can still use cpumap-pping to get the data you want. (The current branch I'm testing/working on also reports the local IP address, which I'm finding pretty helpful). Otherwise, you're going to make building both tools part of the setup process* and still have to parse IP pairs for results. Hopefully, there's a decent Python LPM Trie out there (to handle subnets and IPv6) to make that easier. I'm (obviously!) going to respectfully disagree with Toke on this one. I didn't dive into cpumap-pping for fun; I tried *really hard* to work with the original epping/xdp-pping. It's a great tool, really fantastic work. It's also not really designed for the same purpose. The original Polere pping is wonderful, but isn't going to scale - the way it ingests packets isn't going to scale across multiple CPUs, and having a core pegging 100% on a busy shaper box was degrading overall performance. epping solves the scalability issue wonderfully, and (rightly) remains focused on giving you a complete report of all of the data is accessed while it was running. If you want to run a monitoring session and see what's going on, it's a *fantastic* way to do it - serious props there. I benchmarked it at about 15 gbit/s on single-stream testing, which is *really* impressive (no other BPF programs active, no shaping). The first issue I ran into is that stacking XDP programs isn't all that well defined a process. You can make it work, but it gets messy when both programs have setup/teardown routines. I kinda, sorta managed to get the two running at once, and it mostly worked. There *really* needs to be an easier way that doesn't run headlong into Ubuntu's lovely "you updated the kernel and tools, we didn't think you'd need bpftool so we didn't include it" issues, adjusting scripts until neither says "oops, there's already an XDP program here! Bye!". I know that this is a pretty new thing, but the tooling hasn't really caught up yet to make this a comfortable process. I'm pretty sure I spent more time trying to run both at once than it took to make a combined version that sort-of ran. (I had a working version in an afternoon) With the two literally concatenated (but compiled together), it worked - but there was a noticeable performance cost. That's where orthogonal design choices hit - epping/xdp-pping is sampling everything (it can even go looking for DNS and ICMP!). A QoE box *really* needs to go out of its way to avoid adding any latency, otherwise you're self-defeating. A representative sample is really all you need - while for epping's target, a really detailed sample is what you need. When faced with differing design goals like that, my response is always to make a tool that very efficiently does what I need. Combining the packet parsing** was the obvious low-hanging fruit. It is faster, but not by very much. But I really hate it when code repeats itself. It seriously set off my OCD watching both find the IP header offset, determine protocol (IPv4 vs IPv6), etc. Small performance win. Bailing out as soon as we determine that we aren't looking at a TCP packet was a big performance win. You can achieve the same by carefully setting up the "config" for epping, but there's not a lot of point in keeping the DNS/ICMP code when it's not needed. Still a performance win, and not needing to maintain a configuration (that will be the same each time) makes setup easier. Running by default on the TC (egress) rather than XDP is a big win, too - but only after cpumap-tc has shunted processing to the appropriate CPU. Now processing is divided between CPUs, and cache locality is more likely to happen - the packet we are reading is in the local core's cache when cpumap-pping reads it, and there's a decent chance it'll still be there (at least L2) by the time it gets to the actual queue discipline. Changing the reporting mechanism was a really big win, in terms of performance and the tool aligning with what's needed: * Since xdp-cpumap has already done the work to determine that a flow belongs in TC handle X:Y - and mapping RTT performance to customer/circuit is *exactly* what we're trying to do - it just makes sense to take that value and use it as a key for the results. * Since we don't care about every packet - rather, we want a periodic representative sample - we can use an efficient per TC handle circular buffer in which to store results. * In turn, I realized that we could just *sample* rather than continually churning the circular buffer. So each flow's buffer has a capacity, and the monitor bails out once a flow buffer is full of RTT results. Really big performance win. "return" is a really fast call. :-) (The buffers are reset when read) * Perfmaps are great, but I didn't want to require a daemon run (mapping the perfmap results) and in turn output results in a LibreQoS-friendly format when a much simpler mechanism gets the same result - without another program sitting handling the mmap'd performance flows all the time. So the result was really fast and does exactly what I need. It's not meant to be "better" than the original; for the original's purpose, it's not great. For rapidly building QoE metrics on a live shaper box, with absolutely minimal overhead and a focus on sipping the firehose rather than trying to drink it all - it's about right. Philosophically, I've always favored tools that do exactly what I need. Likewise, if someone would like to come up with a really good recipe that runs both rather than a combined program - that'd be awesome. If it can match the performance of cpumap-pping, I'll happily switch BracketQoS to use it. You're obviously welcome to any of the code; if it can help the original projects, that's wonderful. Right now, I don't have the time to come up with a better way of layering XDP/TC programs! * - I keep wondering if I shouldn't roll some .deb packages and a configurator to make setup easier! ** - there *really* should be a standard flow dissector. The Linux traffic shaper's dissector can handle VLAN tags and an MPLS header. xdp-cpumap-tc handles VLANs with aplomb and doesn't touch MPLS. epping calls out to the xdp-project's dissector which appears to handle VLANs and also doesn't touch MPLS). Thanks, Herbert On Tue, Nov 8, 2022 at 8:23 AM Toke H=C3=B8iland-J=C3=B8rgensen via LibreQo= S < libreqos@lists.bufferbloat.net> wrote: > Robert Chac=C3=B3n via LibreQoS writes: > > > I was hoping to add a monitoring mode which could be used before "turni= ng > > on" LibreQoS, ideally before v1.3 release. This way operators can reall= y > > see what impact it's having on end-user and network latency. > > > > The simplest solution I can think of is to implement Monitoring Mode > using > > cpumap-pping as we already do - with plain HTB and leaf classes with no > > CAKE qdisc applied, and with HTB and leaf class rates set to impossibly > > high amounts (no plan enforcement). This would allow for before/after > > comparisons of Nodes (Access Points). My only concern with this approac= h > is > > that HTB, even with rates set impossibly high, may not be truly > > transparent. It would be pretty easy to implement though. > > > > Alternatively we could use ePPing > > but I > worry > > about throughput and the possibility of latency tracking being slightly > > different from cpumap-pping, which could limit the utility of a > comparison. > > We'd have to match IPs in a way that's a bit more involved here. > > > > Thoughts? > > Well, this kind of thing is exactly why I think concatenating the two > programs (cpumap and pping) into a single BPF program was a mistake: > those are two distinct pieces of functionality, and you want to be able > to run them separately, as your "monitor mode" use case shows. The > overhead of parsing the packet twice is trivial compared to everything > else those apps are doing, so I don't think the gain is worth losing > that flexibility. > > So I definitely think using the regular epping is the right thing to do > here. Simon is looking into improving its reporting so it can be > per-subnet using a user-supplied configuration file for the actual > subnets, which should hopefully make this feasible. I'm sure he'll chime > in here once he has something to test and/or with any questions that pop > up in the process. > > Longer term, I'm hoping all of Herbert's other improvements to epping > reporting/formatting can make it into upstream epping, so LibreQoS can > just use that for everything :) > > -Toke > _______________________________________________ > LibreQoS mailing list > LibreQoS@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/libreqos > --000000000000a63cce05ecf7ac0a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I'd probably go with an HTB queue per target IP g= roup, and not attach a
discipline to it - with only a ceilin= g set at the top. That'll do truly minimal
shaping, and = you can still use cpumap-pping to get the data you want.
(Th= e current branch I'm testing/working on also reports the local IP
<= /div>
address, which I'm finding pretty helpful). Otherwise, you= 9;re going to
make building both tools part of the setup process*= and still have
to parse IP pairs for results. Hopefully, there&#= 39;s a decent Python
LPM Trie out there (to handle subnets and IP= v6) to make that
easier.

I'm (ob= viously!) going to respectfully disagree with Toke on this one.
I= didn't dive into cpumap-pping for fun; I tried *really hard* to work
with the original epping/xdp-pping. It's a great tool, really = fantastic work.
It's also not really designed for the sa= me purpose.

The original Polere pping is wonderful= , but isn't going to scale - the
way it ingests packets isn&#= 39;t going to scale across multiple CPUs,
and having a core peggi= ng 100% on a busy shaper box was
degrading overall performance. e= pping solves the scalability
issue wonderfully, and (rightly) rem= ains focused on giving you
a complete report of all of the data i= s accessed while it was
running. If you want to run a monitoring = session and see what's
going on, it's a *fantastic* way t= o do it - serious props there. I
benchmarked it at about 15 gbit/= s on single-stream testing,
which is *really* impressive (no othe= r BPF programs active,
no shaping).

The = first issue I ran into is that stacking XDP programs isn't
al= l that well defined a process. You can make it work, but
it gets = messy when both programs have setup/teardown
routines. I kinda, s= orta managed to get the two running at
once, and it mostly worked= . There *really* needs to be an
easier way that doesn't run h= eadlong into Ubuntu's lovely
"you updated the kernel and= tools, we didn't think you'd
need bpftool so we didn'= ;t include it" issues, adjusting scripts
until neither says = "oops, there's already an XDP program
here! Bye!". = I know that this is a pretty new thing, but the
tooling hasn'= t really caught up yet to make this a comfortable
process. I'= m pretty sure I spent more time trying to run both
at once than i= t took to make a combined version that sort-of
ran. (I had a work= ing version in an afternoon)

With the two lite= rally concatenated (but compiled together),
it worked - but there= was a noticeable performance cost. That's
where orthogonal d= esign choices hit - epping/xdp-pping is
sampling everything (it c= an even go looking for DNS and ICMP!).
A QoE box *really* needs t= o go out of its way to avoid adding
any latency, otherwise you= 9;re self-defeating. A representative
sample is really all you ne= ed - while for epping's target,
a really detailed sample = is what you need. When faced with
differing design goals like tha= t, my response is always to
make a tool that very efficiently doe= s what I need.

Combining the packet parsing** was = the obvious low-hanging
fruit. It is faster, but not by very much= . But I really hate
it when code repeats itself. It seriously set= off my OCD
watching both find the IP header offset, determine pr= otocol
(IPv4 vs IPv6), etc. Small performance win.
=
Bailing out as soon as we determine that we aren't looki= ng
at a TCP packet was a big performance win. You can achieve
the same by carefully setting up the "config" for epping,<= /div>
but there's not a lot of point in keeping the DNS/ICMP code
when it's not needed. Still a performance win, and not
needing to maintain a configuration (that will be the same
eac= h time) makes setup easier.

Running by default= on the TC (egress) rather than XDP
is a big win, too - but only = after cpumap-tc has shunted
processing to the appropriate CPU. No= w processing is
divided between CPUs, and cache locality is more = likely
to happen - the packet we are reading is in the local
core's cache when cpumap-pping reads it, and there's
a decent chance it'll still be there (at least L2) by the time
<= div>it gets to the actual queue discipline.

Ch= anging the reporting mechanism was a really big win,
in terms of = performance and the tool aligning with what's
needed:
* Since xdp-cpumap has already done the work to determine
=C2= =A0 that a flow belongs in TC handle X:Y - and mapping RTT
=C2=A0= performance to customer/circuit is *exactly* what we're
=C2= =A0 trying to do - it just makes sense to take that value and
=C2= =A0 use it as a key for the results.
* Since we don't care ab= out every packet - rather, we want
=C2=A0 a periodic representati= ve sample - we can use an efficient
=C2=A0 per TC handle circular= buffer in which to store results.
* In turn, I realized that we = could just *sample* rather than
=C2=A0 continually churning the c= ircular buffer. So each flow's
=C2=A0 buffer has a capacity, = and the monitor bails out once a flow
=C2=A0 buffer is full of RT= T results. Really big performance win.
=C2=A0 "return" = is a really fast call. :-) (The buffers are reset when
=C2=A0 rea= d)
* Perfmaps are great, but I didn't want to require a daemo= n
=C2=A0 run (mapping the perfmap results) and in turn output
=C2=A0 results in a LibreQoS-friendly format when a much simpler
=C2=A0 mechanism gets the same result - without another program
=C2=A0 sitting handling the mmap'd performance flows all the time= .

So the result was really fast and does exact= ly what I need.
It's not meant to be "better" than = the original; for the original's
purpose, it's not great.= For rapidly building QoE metrics on
a live shaper box, with abso= lutely minimal overhead and a
focus on sipping the firehose rathe= r than trying to drink it
all - it's about right.

Philosophically, I've always favored tools that do ex= actly
what I need.

Likewise, if some= one would like to come up with a really
good recipe that runs bot= h rather than a combined
program - that'd be awesome. If it c= an match the
performance of cpumap-pping, I'll happily switch=
BracketQoS to use it.

You're ob= viously welcome to any of the code; if it can help
the original p= rojects, that's wonderful. Right now, I don't
have the ti= me to come up with a better way of layering
XDP/TC programs!
<= /div>

* - I keep wondering if I shouldn't= roll some .deb packages
and a configurator to make setup easier!=

** - there *really* should be a standard flow= dissector. The
Linux traffic shaper's dissector can han= dle VLAN tags and
an MPLS header. xdp-cpumap-tc handles VLAN= s with
aplomb and doesn't touch MPLS. epping calls out t= o the
xdp-project's dissector which appears to handle
VLANs and also doesn't touch MPLS).

Th= anks,
Herbert

On Tue, Nov 8, 2022 at 8:23 AM Toke H=C3= =B8iland-J=C3=B8rgensen via LibreQoS <libreqos@lists.bufferbloat.net> wrote:
Robert Chac=C3=B3n via LibreQ= oS <= libreqos@lists.bufferbloat.net> writes:

> I was hoping to add a monitoring mode which could be used before "= ;turning
> on" LibreQoS, ideally before v1.3 release. This way operators can= really
> see what impact it's having on end-user and network latency.
>
> The simplest solution I can think of is to implement Monitoring Mode u= sing
> cpumap-pping as we already do - with plain HTB and leaf classes with n= o
> CAKE qdisc applied, and with HTB and leaf class rates set to impossibl= y
> high amounts (no plan enforcement). This would allow for before/after<= br> > comparisons of Nodes (Access Points). My only concern with this approa= ch is
> that HTB, even with rates set impossibly high, may not be truly
> transparent. It would be pretty easy to implement though.
>
> Alternatively we could use ePPing
> <https://github.com/xdp-project= /bpf-examples/tree/master/pping> but I worry
> about throughput and the possibility of latency tracking being slightl= y
> different from cpumap-pping, which could limit the utility of a compar= ison.
> We'd have to match IPs in a way that's a bit more involved her= e.
>
> Thoughts?

Well, this kind of thing is exactly why I think concatenating the two
programs (cpumap and pping) into a single BPF program was a mistake:
those are two distinct pieces of functionality, and you want to be able
to run them separately, as your "monitor mode" use case shows. Th= e
overhead of parsing the packet twice is trivial compared to everything
else those apps are doing, so I don't think the gain is worth losing that flexibility.

So I definitely think using the regular epping is the right thing to do
here. Simon is looking into improving its reporting so it can be
per-subnet using a user-supplied configuration file for the actual
subnets, which should hopefully make this feasible. I'm sure he'll = chime
in here once he has something to test and/or with any questions that pop up in the process.

Longer term, I'm hoping all of Herbert's other improvements to eppi= ng
reporting/formatting can make it into upstream epping, so LibreQoS can
just use that for everything :)

-Toke
_______________________________________________
LibreQoS mailing list
LibreQo= S@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/libreqos
--000000000000a63cce05ecf7ac0a--