[Bloat] UniFi Dream Machine Pro

Sat Jan 23 18:19:27 EST 2021

On Fri, Jan 22, 2021 at 11:43 AM Stuart Cheshire <cheshire at apple.com> wrote:
>
> On 20 Jan 2021, at 07:55, Dave Taht <dave.taht at gmail.com> wrote:
>
> > This review, highly recommending this router on the high end
> >
> > https://www.increasebroadbandspeed.co.uk/best-router-2020
> >
> > also states that the sqm implementation has been dumbed down significantly and can only shape 800Mbit inbound. Long ago we did a backport of cake to the other ubnt routers mentioned in the review, has anyone tackled this one?

It's nice to see the "godfadder" of our effort back again here. I do
re-read periodically http://www.stuartcheshire.org/rants/latency.html

At the price of perhaps over-lecturing for a wider audience.

> According to the UniFi Dream Machine Pro data sheet, it has a 1.7 GHz quad-core ARM Cortex-A57 processor and achieves the following throughput numbers (downlink direction):

>
> 8.0 Gb/s with Deep Packet Inspection
I'm always very dubious of these kind of numbers against anything but
single large, bulk flows. Also if the
fast path is not entirely offloaded, performance goes to hell.

> 3.5 Gb/s with DPI + Intrusion Detection
> 0.8 Gb/s with IPsec VPN

Especially here, also. I should also note that the rapidly deploying
wireguard vpn outperforms ipsec
in just about every way... in software.

>
> <https://dl.ubnt.com/ds/udm-pro>
>
> Is implementing CoDel queueing really 10x more burden than running “Ubiquiti’s proprietary Deep Packet Inspection (DPI) engine”? Is CoDel 4x more burden than Ubiquiti’s IDS (Intrusion Detection System) and IPS (Intrusion Prevention System)?

These questions, given that the actual fq-codel overhead is nearly
immeasurable, and the code complexity much less than these, are the
makings of a very good rant targetted at a hw offload maker. :)

hashing is generally "free" and in hw, selecting a different queue can
be done with single indirect

Cake has a lot of ideas that would benefit from actual hw offloads. a
4 or 8 way associative cache is a common IP hw block....

> Is CoDel really the same per-packet cost as doing full IPsec VPN decryption on every packet? I realize the IPsec VPN decryption probably has

No.

>some assist from crypto-specific ARM instructions or hardware, but even so, crypto operations are generally considered relatively expensive. If this device can do 800 Mb/s throughput doing IPsec VPN decryption for every packet, it feels like it ought to be able to do a lot better than that just doing CoDel queueing calculations for every packet.

yep.

the only even semi-costly codel function is an invsqrt which can be
implemented in 3k gates or so in hw. In software the
newton approximation is nearly immeasurable, and accurate enough. (we
went to great lengths to make it
accurate in cake to no observable effect)

codel is not O(1) A nice thing about fq is that you can be codeling in
parallel, or if you are acting
on a single queue at a time, short circuit the overload section of
codel to give up and deliver a packet
if you cannot meet the deadline... or... using a very small fifo queue
(say 3k bytes at a gbit), the odds are extremely
good (millions? ... A lot. I worked it out once with various
assumptions...) that no matter how many packets you need to drop
at once, you can still run at line rate at a reasonable clock. bql
manages this short fifo in linux,
but it tends to be much larger and inflated by tso offloads.

You really don't need to drop or mark a lot of packets to achieve good
congestion control at high rates. But you know that. :)

Most "hw" offloads are actually offloads to a specialized cpu and thus
O(1) or not isn't much of a problem there.

> Is this just a software polish issue, that could be remedied by doing some performance optimization on the CoDel code?

Don't know how to make it faster. The linux version is about as
optimized as we know how. A p4 implementation exists.

As everyone points out later on this thread, it's the software
*shaper* (on inbound especially) that is the real burden. TB has been
offloaded to hw. The QCA offloaded version has both the tb and
fq_codel in there.

also hw shaping outbound is vastly cheaper with a programmable
completion interrrupt. tell 1Gbit hardware to
interrupt at half the rate, bang, it's 500Mbit. (this is implemented
in several intel ethernet cards)

inbound shaping in sw is another one of the it's the latency stupid
things. It's not so much the clock rate, but how fast
the cpu can reschedule the thread, a number that doesn't scale much
with clock, but with cache and pipeline depth.

One reason why I adore the mill cpu design is that it can context
switch in 5 clocks, where x86 takes 1000....

> It’s also possible that the information in the review might simply be wrong -- it’s hard to measure throughput numbers in excess of 1 Gb/s unless you have both a client and a server connected faster than that in order to run the test. In other words, gigabit Ethernet is out, so both client and server would have to be connected via the 10 Gb/s SFP+ ports (of which the UDM-PRO has just two -- one in the upstream direction, and one in the downstream direction). Speaking for myself personally, I don’t have any devices with 10 Gb/s capability, and my Internet connection isn’t above 1 Gb/s either, so as long as it can get reasonably close to 1 Gb/s that’s more than I need (or could use) right now.

As most 1Gbit ISP links are still quite overbuffered (over 120ms was
what I'd measured with comcast, 60ms on sonic fiber, both a few years
back), vs a total induced latency of *0-5ms* with sqm at 800mbit, it
generally seems to me that inbound shaping to something close to a
gbit is a win for videoconferencing, gaming, vr, jacktrip and other
latency sensitive traffic.

On a 35Mbit upload, fq_codel or cake are *loafing*. If we were to get
around to doing a backport of cake to this device, I'd probably
go with htb+fq_codel on the download and cake on the upload, where the
ack-filtering and per host/per flow fq of cake would be ideal.

(this, btw, is what I do presently)

ack-filtering at these asymmetries is a pretty big win for retaining a
high download speed with competing upload traffic.
https://blog.cerowrt.org/post/ack_filtering/ you cannot do anything
even close to a steady gbit down with competing uplink traffic on the
cable modems I've tested to date.

> Stuart Cheshire
>

-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave at taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729