From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-io1-xd31.google.com (mail-io1-xd31.google.com
 [IPv6:2607:f8b0:4864:20::d31])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 3B67B3B29D
 for <bloat@lists.bufferbloat.net>; Sat, 23 Jan 2021 18:19:39 -0500 (EST)
Received: by mail-io1-xd31.google.com with SMTP id y19so19102054iov.2
 for <bloat@lists.bufferbloat.net>; Sat, 23 Jan 2021 15:19:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=h5xqb4MG3Sz5W5E3bUt6IyFvKahHLU3VLlZth8l32Vw=;
 b=HUalfiiu0XdatQRt5pQEVqkgdiSbQU46Qo0MRbBJLS5leIDypd5dA3IQ4zkn8cjniW
 evEXIjOjzN4ucqTb0cZsP4WSk+HZhq9oRQjsVDQSPkZ8Uz9NBxxth9vMDHb5KQ5GgriN
 TojvLGPxy5hjvdliB9UBRdu8Iu7i7i0Rc4Nc/KsG4RV6W8MOJJP/yPl1MH3lSvxp9/op
 zsoc2fmRqkJgFRtEZrCpBiRwbDpS4pnbciwLxDU1OeThBrCT9785/JkM6AoYJVIR6kQf
 B8KvpuvX1mwH6Z2Kyqg8qzmyVvb5EPUuY4fHmXS0bkhlM328R73fwlhh06/bUPHokUau
 XeHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=h5xqb4MG3Sz5W5E3bUt6IyFvKahHLU3VLlZth8l32Vw=;
 b=D76FrWBisTNWhw7Bh9SH/D2bf/m2BhOGUfBx9jRqfsVdT9wlHWYD9l3WnnkqKr1psI
 CnfZX3oCdR53ik8rluqliqTvyX2Qr1wrSq0JmtIp1qwJj6q4tZKHQUPADSO4ydjq+RSe
 e0tEvZsglblBvNK8ctMWgo7RmZaSa6PsgrieX5sUeQlTnjEdSeSe9LjHMN2o42bqEaSc
 cFEQi/eLBOKdZ7hozX8ebzqbw7LWdTS9RZoz5dGsqXdzMW16jd7wqsm5IdV1mGHDl9aT
 VrqpJMobMmgyGpfhV+SXQpTjxBh+LHbpspBwMjZuzOddWBBrfI1GnMvhkwNw7BDijggj
 7RWw==
X-Gm-Message-State: AOAM533YbC5z3MyjaSTDIIybUEUiPnmZ9CAkgxzAneCHT57NetLYNxM1
 1czDHfaxz2QX0In10rZma1zTD2IAftIvhWUyw8w=
X-Google-Smtp-Source: ABdhPJzcFmNYZDxQa+eEGL2dFDL8OkBdgNG9nC/+aaCfLJv8nKNhLaLsy8HbQy1s2TNylUvO9M/Yw8LqNn2+WHFcVrM=
X-Received: by 2002:a05:6e02:cd2:: with SMTP id
 c18mr47014ilj.249.1611443978482; 
 Sat, 23 Jan 2021 15:19:38 -0800 (PST)
MIME-Version: 1.0
References: <CAA93jw655Oji+wqbt-W32NHRvYaNfQUmR2rQEutEOxBRqGEoPQ@mail.gmail.com>
 <932357EB-614C-4F74-925C-A1D6FB5F3AD2@apple.com>
In-Reply-To: <932357EB-614C-4F74-925C-A1D6FB5F3AD2@apple.com>
From: Dave Taht <dave.taht@gmail.com>
Date: Sat, 23 Jan 2021 15:19:27 -0800
Message-ID: <CAA93jw4T=FaY-WdO51SkWohD6Vc47cXBbL042XQJDLQcLK2JTQ@mail.gmail.com>
To: Stuart Cheshire <cheshire@apple.com>
Cc: bloat <bloat@lists.bufferbloat.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Bloat] UniFi Dream Machine Pro
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sat, 23 Jan 2021 23:19:39 -0000

On Fri, Jan 22, 2021 at 11:43 AM Stuart Cheshire <cheshire@apple.com> wrote=
:
>
> On 20 Jan 2021, at 07:55, Dave Taht <dave.taht@gmail.com> wrote:
>
> > This review, highly recommending this router on the high end
> >
> > https://www.increasebroadbandspeed.co.uk/best-router-2020
> >
> > also states that the sqm implementation has been dumbed down significan=
tly and can only shape 800Mbit inbound. Long ago we did a backport of cake =
to the other ubnt routers mentioned in the review, has anyone tackled this =
one?

It's nice to see the "godfadder" of our effort back again here. I do
re-read periodically http://www.stuartcheshire.org/rants/latency.html

At the price of perhaps over-lecturing for a wider audience.

> According to the UniFi Dream Machine Pro data sheet, it has a 1.7 GHz qua=
d-core ARM Cortex-A57 processor and achieves the following throughput numbe=
rs (downlink direction):

>
> 8.0 Gb/s with Deep Packet Inspection
I'm always very dubious of these kind of numbers against anything but
single large, bulk flows. Also if the
fast path is not entirely offloaded, performance goes to hell.

> 3.5 Gb/s with DPI + Intrusion Detection
> 0.8 Gb/s with IPsec VPN

Especially here, also. I should also note that the rapidly deploying
wireguard vpn outperforms ipsec
in just about every way... in software.

>
> <https://dl.ubnt.com/ds/udm-pro>
>
> Is implementing CoDel queueing really 10x more burden than running =E2=80=
=9CUbiquiti=E2=80=99s proprietary Deep Packet Inspection (DPI) engine=E2=80=
=9D? Is CoDel 4x more burden than Ubiquiti=E2=80=99s IDS (Intrusion Detecti=
on System) and IPS (Intrusion Prevention System)?

These questions, given that the actual fq-codel overhead is nearly
immeasurable, and the code complexity much less than these, are the
makings of a very good rant targetted at a hw offload maker. :)

hashing is generally "free" and in hw, selecting a different queue can
be done with single indirect

Cake has a lot of ideas that would benefit from actual hw offloads. a
4 or 8 way associative cache is a common IP hw block....

> Is CoDel really the same per-packet cost as doing full IPsec VPN decrypti=
on on every packet? I realize the IPsec VPN decryption probably has

No.

>some assist from crypto-specific ARM instructions or hardware, but even so=
, crypto operations are generally considered relatively expensive. If this =
device can do 800 Mb/s throughput doing IPsec VPN decryption for every pack=
et, it feels like it ought to be able to do a lot better than that just doi=
ng CoDel queueing calculations for every packet.

yep.

the only even semi-costly codel function is an invsqrt which can be
implemented in 3k gates or so in hw. In software the
newton approximation is nearly immeasurable, and accurate enough. (we
went to great lengths to make it
accurate in cake to no observable effect)

codel is not O(1) A nice thing about fq is that you can be codeling in
parallel, or if you are acting
on a single queue at a time, short circuit the overload section of
codel to give up and deliver a packet
if you cannot meet the deadline... or... using a very small fifo queue
(say 3k bytes at a gbit), the odds are extremely
good (millions? ... A lot. I worked it out once with various
assumptions...) that no matter how many packets you need to drop
at once, you can still run at line rate at a reasonable clock. bql
manages this short fifo in linux,
but it tends to be much larger and inflated by tso offloads.

You really don't need to drop or mark a lot of packets to achieve good
congestion control at high rates. But you know that. :)

Most "hw" offloads are actually offloads to a specialized cpu and thus
O(1) or not isn't much of a problem there.

> Is this just a software polish issue, that could be remedied by doing som=
e performance optimization on the CoDel code?

Don't know how to make it faster. The linux version is about as
optimized as we know how. A p4 implementation exists.

As everyone points out later on this thread, it's the software
*shaper* (on inbound especially) that is the real burden. TB has been
offloaded to hw. The QCA offloaded version has both the tb and
fq_codel in there.

also hw shaping outbound is vastly cheaper with a programmable
completion interrrupt. tell 1Gbit hardware to
interrupt at half the rate, bang, it's 500Mbit. (this is implemented
in several intel ethernet cards)

inbound shaping in sw is another one of the it's the latency stupid
things. It's not so much the clock rate, but how fast
the cpu can reschedule the thread, a number that doesn't scale much
with clock, but with cache and pipeline depth.

One reason why I adore the mill cpu design is that it can context
switch in 5 clocks, where x86 takes 1000....


> It=E2=80=99s also possible that the information in the review might simpl=
y be wrong -- it=E2=80=99s hard to measure throughput numbers in excess of =
1 Gb/s unless you have both a client and a server connected faster than tha=
t in order to run the test. In other words, gigabit Ethernet is out, so bot=
h client and server would have to be connected via the 10 Gb/s SFP+ ports (=
of which the UDM-PRO has just two -- one in the upstream direction, and one=
 in the downstream direction). Speaking for myself personally, I don=E2=80=
=99t have any devices with 10 Gb/s capability, and my Internet connection i=
sn=E2=80=99t above 1 Gb/s either, so as long as it can get reasonably close=
 to 1 Gb/s that=E2=80=99s more than I need (or could use) right now.

As most 1Gbit ISP links are still quite overbuffered (over 120ms was
what I'd measured with comcast, 60ms on sonic fiber, both a few years
back), vs a total induced latency of *0-5ms* with sqm at 800mbit, it
generally seems to me that inbound shaping to something close to a
gbit is a win for videoconferencing, gaming, vr, jacktrip and other
latency sensitive traffic.

On a 35Mbit upload, fq_codel or cake are *loafing*. If we were to get
around to doing a backport of cake to this device, I'd probably
go with htb+fq_codel on the download and cake on the upload, where the
ack-filtering and per host/per flow fq of cake would be ideal.

(this, btw, is what I do presently)

ack-filtering at these asymmetries is a pretty big win for retaining a
high download speed with competing upload traffic.
https://blog.cerowrt.org/post/ack_filtering/ you cannot do anything
even close to a steady gbit down with competing uplink traffic on the
cable modems I've tested to date.

> Stuart Cheshire
>


--=20
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave T=C3=A4ht> CTO, TekLibre, LLC Tel: 1-831-435-0729