From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp119.iad3a.emailsrvr.com (smtp119.iad3a.emailsrvr.com [173.203.187.119]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 8DB303B2A4 for ; Mon, 18 Sep 2023 16:24:51 -0400 (EDT) Received: from app42.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp15.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id E98B9393D; Mon, 18 Sep 2023 16:24:50 -0400 (EDT) Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app42.wa-webapps.iad3a (Postfix) with ESMTP id BF76FE1BD1; Mon, 18 Sep 2023 16:24:50 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Mon, 18 Sep 2023 16:24:50 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Mon, 18 Sep 2023 16:24:50 -0400 (EDT) From: "David P. Reed" To: "dave seddon" Cc: "Cake List" MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: References: X-Client-IP: 209.6.168.128 Message-ID: <1695068690.78066946@apps.rackspace.com> X-Mailer: webmail/19.0.24-RC X-Classification-ID: 488ca67c-c05a-49f2-aa67-dd6024b37b23-1-1 Subject: Re: [Cake] some comprehensive arm64 w/cake results X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Sep 2023 20:24:51 -0000 =0A=0AOn Monday, September 18, 2023 3:50pm, "dave seddon via Cake" said:=0A=0A> _________________________________________= ______=0A> Cake mailing list=0A> Cake@lists.bufferbloat.net=0A> https://lis= ts.bufferbloat.net/listinfo/cake=0A> G'day Mr David Reed,=0A> =0A> Thanks f= or the comments.=0A> =0A> Definitely agree with your sentiments and the tes= ts definitely do NOT=0A> simply represent Intel verse ARM.=0A> =0A> Perhaps= I should have been more clear about the objectives of the testing:=0A=0AIt= 's just an issue I'm sensitive to, because throughout my career I've read "= Brand X is slow" when the test was actually testing something else. (An ann= oying post popped up on Medium today that claimed "WebAssembly doesn't spee= d up Web applications" based on a badly designed Linux Foundation-commissio= ned study that the poster misunderstood. The poster also seemed to think th= at running web applications using the laptop's cycles is bad compared to ru= nning web applications exclusively on the server in the cloud). This alread= y had me in a sour mood.=0A=0A> =0A> I'm curious to understand the performa= nce of these lower end SoC devices,=0A> because these are the types of devi= ces that act as home gateway routers, as=0A> access points, and such. Ther= e are many many millions of these devices out=0A> there and I don't know ho= w well understood their performance is:=0A> e.g. How bad is my Spectrum Int= ernet cable modem?=0A> e.g. I have a Unifi security gateway and it's "smart= queue" performance is=0A> pretty poor ( <200 Mb/s ). Why is it so poor?= =0A=0AI'm curious, too! We know that on older home routers, with really slo= w MIPS processors, Cake struggles with GigE. As these old MIPS designs get = phased out and replaced by ARM designs, it will matter.=0ARaspberry Pi 4's = just aren't very good at networking because of their I/O architecture on th= e board, just as they are slow at USB in general. That's why the CM4 is int= eresting. It's interesting that the PiHole has gotten so popular - it would= run better on an Pi with a better network architecture.=0A=0A> =0A> Obviou= sly, with real servers ( and even virtual AWS ones ) which have real=0A> NI= Cs, you get things like multi-queues with RSS, and a lot more tuning=0A> kn= obs, and so they can go a lot faster.=0A> =0A> In the tests so far, the Asu= s CN60 device with the r8169 performs pretty=0A> well, where the NIC is lik= ely to be contributing positively. The default=0A> configuration has a bun= ch of off-loading enabled:=0A> =0A> root@asus-cn60-2:/home/das# ethtool --s= how-features enp1s0 | grep ": on"=0A> rx-checksumming: on=0A> tx-checksummi= ng: on=0A> tx-checksum-ipv4: on=0A> tx-checksum-ipv6: on=0A> generic-receiv= e-offload: on=0A> rx-vlan-offload: on=0A> tx-vlan-offload: on=0A> highdma: = on [fixed]=0A> =0A> However, based on these initial tests, which are not co= mplete, it's=0A> certainly curious that the Pi4 is doing ~923Mbit/s with pf= ifo_fast and then=0A> doing significantly less ( ~621 Mbits/sec ) with cake= . I'm interested to=0A> understand this in more detail, where DaveT has re= commended adding 20ms or=0A> 40ms. The cake tests so far had rtt 1ms and r= tt 3ms, which might be too=0A> low. ( If it is too low, then maybe it woul= d make sense to remove "rtt lan=0A> =3D rtt 1ms" option, as it's a misleadi= ng configuration option? )=0A> =0A> Definitely, during the testing these li= ttle devices have the NIC IRQs all=0A> going through core 0, so I want to e= xplore tuning options.=0A> =0A> root@rpi4b:/home/das# cat /proc/interrupts = | grep -E '(CPU0|eth0)'=0A> CPU0 CPU1 CPU2 CPU= 3=0A> 30: 38651749 0 0 0 GICv2 189 Level= =0A> eth0 <--- IRQs only going to CPU0=0A> 31: 20418643 0 = 0 0 GICv2 190 Level=0A> eth0=0A> =0A> Some ideas incl= ude:=0A> - Moving most processes of core0. e.g. Configure all the systemd s= lices NOT=0A> to use core0, so core0 is essentially freed to only service t= he IRQs=0A> - RPS (=0A> https://www.kernel.org/doc/html/latest/networking/s= caling.html#rps-receive-packet-steering=0A> ). e.g. Can the other cores get= more involved?=0A> - Tuning ideas from here:=0A> https://github.com/leandr= omoreira/linux-network-performance-parameters.=0A> Specifically, I was wond= ering about increasing netdev_budget sysctls.=0A> =0A> The defaults are sho= wn here=0A> =0A> root@rpi4b:/home/das# sysctl -a | grep netdev_budget=0A> n= et.core.netdev_budget =3D 300=0A> net.core.netdev_budget_usecs =3D 8000=0A>= =0A> "Armbian's kernel isn't a particularly high performance kernel build.= "=0A> =0A> Happy to discuss any recommended tuning. Armbrian is very easy = to install=0A> on the microSD card. ( Actually, I have the LicheePi 4A RIS= C-V, but can't=0A> find a easy image to just load on a microSD card. )=0A> = =0A> =0A> Over the weekend, I reconfigured the testing setup using a lot mo= re VLANs.=0A> Now each device has ALL the different qdiscs configured on di= fferent VLANs=0A> and IPs, allowing the iperf/flent tests to be run one aft= er the other with=0A> no need to change the qdiscs between tests. I'm curr= ently repeating every=0A> combination of test, before adding the netem 20/4= 0ms latency as DaveT=0A> suggested. ( Test take a while: 8 devices * 6 qdi= scs =3D 48 tests, by 10=0A> minute tests =3D 480 minutes =3D 8 hours )=0A> = =0A> Roughly the plan is:=0A> 1. Retest all combinations. This is to confi= rm the starting position. <---=0A> running now=0A> 2. Add netem latency 20 = and 40ms, and retest all combinations. I'm hoping=0A> Pi4 cake performance= will be closer to > 900 Mb/s=0A> 3. Apply some tuning options, and retest = all combinations=0A>=0AI'm very interested in seeing your results after thi= s.=0AGrat job so far.=0A =0A> Kind regards,=0A> Dave Seddon=0A> =0A> On Sun= , Sep 17, 2023 at 6:05=E2=80=AFPM Dave Taht wrote:=0A= > =0A>>=0A>> A huge thanks to dave seddon for buckling down and doing some= =0A>> comprehensive testing of a variety of arm64 gear!=0A>>=0A>>=0A>> http= s://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI= /edit#heading=3Dh.bpvv3vr500nw=0A>>=0A>> --=0A>> Oct 30:=0A>> https://netde= vconf.info/0x17/news/the-maestro-and-the-music-bof.html=0A>> Dave T=C3=A4ht= CSO, LibreQos=0A>>=0A> =0A> =0A> --=0A> Regards,=0A> Dave Seddon=0A> +1 41= 5 857 5102=0A> =0A