[Cake] [Bloat] Really getting 1G out of ISP?

Tue Jul 6 22:53:22 EDT 2021

On Tue, Jul 6, 2021 at 7:26 PM Dave Taht <dave.taht at gmail.com> wrote:

> On Tue, Jul 6, 2021 at 3:32 PM Aaron Wood <woody77 at gmail.com> wrote:
> >
> > I'm running an Odyssey from Seeed Studios (celeron J4125 with dual
> i211), and it can handle Cake at 1Gbps on a single core (which it needs to,
> because OpenWRT's i211 support still has multiple receive queues disabled).
>
> Not clear if that is shaped or not? Line rate is easy on processors of
> that class or better, but shaped?
>

That's shaped.  I can shape 800+, and the kernel ramps the clock rate up to
2.5GHz as needed, IIRC.  I'm guessing that it might thermally limit at some
point, but I haven't had sustained >500Mbps traffic for long enough to
really exercise that.  Although the covid WFH and has definitely increased
the likelihood that I'm hitting >500Mbps downloads.

> some points:
>
> On inbound shaping especially it it still best to lock network traffic
> to a single core in low end platforms.
>
> Cake itself is not multicore, although the design essentially is. We
> did some work towards trying to make it shape across multiple cores
> and multiple hardware queues. IF the locking contention could be
> minimized (RCU) I felt it possible for a win here, but a bigger win
> would be to eliminate "mirred" from the ingress path entirely.
>

I was going to play around with shaping to lower levels across multiple
cores, as many of the loads I deal with are multi-stream, but I always
worry about the ack path, as the provisioned rates are so asymmetric
(35Mbps up).  I'm using `ack-filter-aggressive` on egress to help.  I've
found that the most aggressive ack filtering seems to hurt throughput.

>
> Even multiple transmit queues remains kind of dicy in linux, and
> actually tend to slow network processing in most cases I've tried at
> gbit line rates. They also add latency, as (1) BQL is MIAD, not AIMD,
> so it stays "stuck" at a "good" level for a long time, AND 2) each hw
> queue gets an additive fifo at this layer, so where, you might need
> only 40k to keep a single hw queue busy, you end up with 160k with 4
> hw queues. This problem is getting worse and worse (64 queues are
> common in newer hardware, 1000s in really new hardware) and a revisit
> to how BQL does things in this case would be useful. Ideally it would
> share state (with a cross core variable and atomic locks) as to how
> much total buffering was actually needed "down there" across all the
> queues, but without trying it, I worry that that would end up costing
> a lot of cpu cycles.
>
> Feel free to experiment with multiple transmit queues locked to other
> cores with the set-affinity bits in /proc/interrupts. I'm sure these
> MUST be useful on some platform, but I think most of the use for
> multiple hw queues is when a locally processing application  is
> getting the data, not when it is being routed.
>
> Ironically, I guess, the shorter your queues the higher likelihood a
> given packet will remain in l2 or even l1 cache.
>

I'm pinning all the queues to cores.  Although I've pinned rx/tx for the
same interface to the same cores, with cores 0-1 doing LAN and 2-3 doing
WAN duties...  I may try matching flow directions per core (rx WAN and tx
LAN on the same core).

One separate reason to set affinity on startup is that the reshuffling that
the kernel tries to do will cause things to stumble as the caches all miss.

The note about BQL is interesting...  Is that actually configurable (I
haven't gone looking, before).

OTOH, I've hit a point where trying to squeeze the most out of it just
doesn't seem necessary.  When I was bench-testing it (with local traffic
generation), I could saturate wire rates in both directions with cake
running, and limiting.  So...  Not much of a worry there.  But it's still
inconsistent on live traffic and with a real internet.  I'm not sure if
that is due to the dynamic frequency scaling, or just congestion at the
head-end, or what.

I was going to start a separate thread, but I've been contemplating what
measurements and stats I can long-term monitor to understand the
intermittent stumbles and hangs that I see.  I'm fairly certain that
they're in the "It can't be DNS....  ::sigh:: It's always DNS...."
category, though.  And if that's the case, I should just log all the
queries and look at the response times.  It seems to be marginally better
with dns-over-https (doing happy-eyeballs-like concurrent requests across
google and cloudflare), but I can't be certain.

> I
> >
> > On Tue, Jun 22, 2021 at 12:44 AM Giuseppe De Luca <dropheaders at gmx.com>
> wrote:
> >>
> >> Also a PC Engines APU4 will do the job
> >> (https://inonius.net/results/?userId=17996087f5e8 - this is a
> >> 1gbit/1gbit, with Openwrt/sqm-scripts set to 900/900.  ISP is Sony NURO
> >> in Japan). Will follow this thread to know if some interesting device
> >> popup :)
> >>
> >>
> >> https://inonius.net/results/?userId=17996087f5e8
> >>
> >> On 6/22/2021 6:12 AM, Sebastian Moeller wrote:
> >> >
> >> > On 22 June 2021 06:00:48 CEST, Stephen Hemminger <
> stephen at networkplumber.org> wrote:
> >> >> Is there any consumer hardware that can actually keep up and do AQM
> at
> >> >> 1Gbit.
> >> >          Over in the OpenWrt forums the same question pops up
> routinely once per week. The best answer ATM seems to be a combination of a
> raspberry pi4B with a decent USB3 gigabit ethernet dongle, a managed switch
> and any capable (OpenWrt) AP of the user's liking. With 4 arm A72 cores the
> will traffic shape up to a gigabit as reported by multiple users.
> >> >
> >> >
> >> >> It seems everyone seems obsessed with gamer Wifi 6. But can only do
> >> >> 300Mbit single
> >> >> stream with any kind of QoS.
> >> > IIUC most commercial home routers/APs bet on offload engines to do
> most of the heavy lifting, but as far as I understand only the NSS cores
> have a shaper and fq_codel module....
> >> >
> >> >
> >> >> It doesn't help that all the local ISP's claim 10Mbit upload even
> with
> >> >> 1G download.
> >> >> Is this a head end provisioning problem or related to Docsis 3.0 (or
> >> >> later) modems?
> >> > For DOCSIS the issue seems to be an unfortunate frequency split
> between up and downstream and use of lower efficiency coding schemes .
> >> > Over here the incumbent cable isp provisions  fifty Mbps for upstream
> and plans to increase that to hundred once the upstream is switched to
> docsis 3.1.
> >> > I believe one issue is that since most of the upstream is required
> for the reverse ACK traffic for the download and hence it can not be
> oversubscribed too much.... but I think we have real docsis experts on the
> list, so I will stop my speculation here...
> >> >
> >> > Regards
> >> >           Sebastian
> >> >
> >> >
> >> >
> >> >
> >> >> _______________________________________________
> >> >> Bloat mailing list
> >> >> Bloat at lists.bufferbloat.net
> >> >> https://lists.bufferbloat.net/listinfo/bloat
> >> _______________________________________________
> >> Bloat mailing list
> >> Bloat at lists.bufferbloat.net
> >> https://lists.bufferbloat.net/listinfo/bloat
> >
> > _______________________________________________
> > Bloat mailing list
> > Bloat at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
>
>
>
> --
> Latest Podcast:
> https://www.linkedin.com/feed/update/urn:li:activity:6791014284936785920/
>
> Dave Täht CTO, TekLibre, LLC
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/cake/attachments/20210706/f8cf7c31/attachment.html>