From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id F19683B2A4 for ; Mon, 18 Sep 2023 15:50:24 -0400 (EDT) Received: by mail-lj1-x22e.google.com with SMTP id 38308e7fff4ca-2bffdf50212so28469441fa.1 for ; Mon, 18 Sep 2023 12:50:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695066622; x=1695671422; darn=lists.bufferbloat.net; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=H8QGt9+qQlAAYWMTRJSlFrxsX2ExER9XQF49EM6hQSA=; b=UAW1UDHiBd5DIBeGQ3kwPEsy+nBaDYcFH+L+O1MkL3Tq+TA1FBweZKCPz3VdGkTJ4U mgn7/3X1FbNrXwqbz9+x9h8on2WuxM4OJ/Qn09oEO36ydhKJfwI0Yo9LujkxFyD6H1mX Z8xK95/yVMIIOq89lkAJMx/xlKzHeAnvTrUwWtyGZHbOcnZg6Qpzodo12ot60l6EXc6U xp/VXn3JM/ExJ5dJqjPPNRyr5yqiHgiVE49c5fHeHjiiKS8Rlk/2ey8H9E6YuiD/lm7d +j+9DTvDzPXLENfvSkAffROJyfwYuh4p1BDQKU9giCbFHaX1ICGRv6BxV9kNq5FWrU7t eVBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695066622; x=1695671422; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H8QGt9+qQlAAYWMTRJSlFrxsX2ExER9XQF49EM6hQSA=; b=NFop1ZQMORRg8Tu9jdmTlRadiXwTy7FeXhztqzn2mscAxxFjvtU9a/PPVKAdnBD4+H SV0hgGhzYuJBhzoG0fFVSA4rKZ83Nv5jAly7yjAWJrxD/HL49+wC+ExV6BwmPLA445+E m5D/Fx4YiYUsYY2W1oYlN1eMzsoJDiSlJK3MHY4yXzE6icb22fNufcc83Q9J3SWa0lkM KR6Ng1bLpq5WSVar7+mqzPXM4d/z9t4/FavPUttPurpyT1SWBqs5VcxrnvNSKAbS3Sc3 QcsL35RdPeTqjM9jPl4GynKMLx90QD8dgERL+zy/5H+DP9Xrbxsof/bQFcPEMzRge64U CdHw== X-Gm-Message-State: AOJu0YzxMPA+RJmLRb3EQpJRPPwrV+tlXSf/IxS5ihpDCuB4f16CK4YL o3uQ1u+le0xGYrdDzKMUMpAOfhrSPD+xdFTmidDgK9CpHCw= X-Google-Smtp-Source: AGHT+IFMwfThXU4QSl4sDULcTB8sC2QEFtd7sie3iq4I4ZMvSSeIxY01zYkjqnrPhGxHVtOTkBA1WxwTTEihQxhhPV0= X-Received: by 2002:a2e:8683:0:b0:2bf:fa16:2787 with SMTP id l3-20020a2e8683000000b002bffa162787mr4931016lji.39.1695066621990; Mon, 18 Sep 2023 12:50:21 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: dave seddon Date: Mon, 18 Sep 2023 12:50:10 -0700 Message-ID: To: Cake List Content-Type: multipart/alternative; boundary="000000000000dfda530605a77335" Subject: Re: [Cake] some comprehensive arm64 w/cake results X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Sep 2023 19:50:25 -0000 --000000000000dfda530605a77335 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable G'day Mr David Reed, Thanks for the comments. Definitely agree with your sentiments and the tests definitely do NOT simply represent Intel verse ARM. Perhaps I should have been more clear about the objectives of the testing: I'm curious to understand the performance of these lower end SoC devices, because these are the types of devices that act as home gateway routers, as access points, and such. There are many many millions of these devices out there and I don't know how well understood their performance is: e.g. How bad is my Spectrum Internet cable modem? e.g. I have a Unifi security gateway and it's "smart queue" performance is pretty poor ( <200 Mb/s ). Why is it so poor? Obviously, with real servers ( and even virtual AWS ones ) which have real NICs, you get things like multi-queues with RSS, and a lot more tuning knobs, and so they can go a lot faster. In the tests so far, the Asus CN60 device with the r8169 performs pretty well, where the NIC is likely to be contributing positively. The default configuration has a bunch of off-loading enabled: root@asus-cn60-2:/home/das# ethtool --show-features enp1s0 | grep ": on" rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ipv6: on generic-receive-offload: on rx-vlan-offload: on tx-vlan-offload: on highdma: on [fixed] However, based on these initial tests, which are not complete, it's certainly curious that the Pi4 is doing ~923Mbit/s with pfifo_fast and then doing significantly less ( ~621 Mbits/sec ) with cake. I'm interested to understand this in more detail, where DaveT has recommended adding 20ms or 40ms. The cake tests so far had rtt 1ms and rtt 3ms, which might be too low. ( If it is too low, then maybe it would make sense to remove "rtt lan =3D rtt 1ms" option, as it's a misleading configuration option? ) Definitely, during the testing these little devices have the NIC IRQs all going through core 0, so I want to explore tuning options. root@rpi4b:/home/das# cat /proc/interrupts | grep -E '(CPU0|eth0)' CPU0 CPU1 CPU2 CPU3 30: 38651749 0 0 0 GICv2 189 Level eth0 <--- IRQs only going to CPU0 31: 20418643 0 0 0 GICv2 190 Level eth0 Some ideas include: - Moving most processes of core0. e.g. Configure all the systemd slices NOT to use core0, so core0 is essentially freed to only service the IRQs - RPS ( https://www.kernel.org/doc/html/latest/networking/scaling.html#rps-receive-= packet-steering ). e.g. Can the other cores get more involved? - Tuning ideas from here: https://github.com/leandromoreira/linux-network-performance-parameters. Specifically, I was wondering about increasing netdev_budget sysctls. The defaults are shown here root@rpi4b:/home/das# sysctl -a | grep netdev_budget net.core.netdev_budget =3D 300 net.core.netdev_budget_usecs =3D 8000 "Armbian's kernel isn't a particularly high performance kernel build." Happy to discuss any recommended tuning. Armbrian is very easy to install on the microSD card. ( Actually, I have the LicheePi 4A RISC-V, but can't find a easy image to just load on a microSD card. ) Over the weekend, I reconfigured the testing setup using a lot more VLANs. Now each device has ALL the different qdiscs configured on different VLANs and IPs, allowing the iperf/flent tests to be run one after the other with no need to change the qdiscs between tests. I'm currently repeating every combination of test, before adding the netem 20/40ms latency as DaveT suggested. ( Test take a while: 8 devices * 6 qdiscs =3D 48 tests, by 10 minute tests =3D 480 minutes =3D 8 hours ) Roughly the plan is: 1. Retest all combinations. This is to confirm the starting position. <--- running now 2. Add netem latency 20 and 40ms, and retest all combinations. I'm hoping Pi4 cake performance will be closer to > 900 Mb/s 3. Apply some tuning options, and retest all combinations Kind regards, Dave Seddon On Sun, Sep 17, 2023 at 6:05=E2=80=AFPM Dave Taht wro= te: > > A huge thanks to dave seddon for buckling down and doing some > comprehensive testing of a variety of arm64 gear! > > > https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFl= RuhiUI/edit#heading=3Dh.bpvv3vr500nw > > -- > Oct 30: > https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html > Dave T=C3=A4ht CSO, LibreQos > --=20 Regards, Dave Seddon +1 415 857 5102 --000000000000dfda530605a77335 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
G'day Mr David Reed,

Tha= nks for the comments.

Definitely agree with your s= entiments and the tests definitely do NOT simply represent Intel verse ARM.=

Perhaps I should have been more clear about t= he objectives of the testing:

I'm curious = to understand the performance of these lower end SoC devices, because these= are the types of devices that act as home gateway routers, as access point= s, and such.=C2=A0 There are many many millions of these devices out there = and I don't know how well understood their performance is:
e.= g. How bad is my Spectrum Internet cable modem?
e.g. I have a= Unifi security gateway and it's "smart queue" performance is= pretty poor ( <200 Mb/s ).=C2=A0 Why is it so poor?

<= /div>
Obviously, with real servers ( and even virtual AWS ones ) which= =20 have real NICs, you get things like multi-queues with RSS, and a lot more= =20 tuning knobs, and so they can go a lot faster.

In = the tests so far, the Asus CN60 device with the r8169 performs pretty well,= where the NIC is likely to be contributing positively.=C2=A0 The default c= onfiguration has a bunch of off-loading enabled:

root@asus-cn60-2:/home/das# ethtool --show-features enp1s0 | grep "= : on"
rx-checksumming: on
tx-checksumming: on
tx-checksum-ip= v4: on
tx-checksum-ipv6: on
generic-receive-offload: on
rx-vlan-o= ffload: on
tx-vlan-offload: on
highdma: on [fixed]

=
However, based on these initial tests, which are not complete, i= t's certainly curious that the Pi4 is doing ~923Mbit/s with pfifo_fast = and then doing significantly less ( ~621 Mbits/sec ) with cake.=C2=A0 I'= ;m interested to understand this in more detail, where DaveT has recommende= d adding 20ms or 40ms.=C2=A0 The cake tests so far had rtt 1ms and rtt 3ms,= which might be too low.=C2=A0 ( If it is too low, then maybe it would make= sense to remove "rtt lan =3D rtt 1ms" option, as it's a misl= eading configuration option? )

Definitely, dur= ing the testing these little devices have the=20 NIC IRQs all going through core 0, so I want to explore tuning options.=C2= =A0

root@rpi4b:/home/das# cat /proc/interrupt= s | grep -E '(CPU0|eth0)'
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0CPU0 =C2=A0 =C2=A0 =C2=A0 CPU1 =C2=A0 =C2=A0 =C2=A0 CPU2 =C2=A0 =C2=A0 = =C2=A0 CPU3 =C2=A0 =C2=A0 =C2=A0
=C2=A030: =C2=A0 38651749 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 GICv2 189 Level =C2=A0 =C2=A0 eth0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 <--- IRQs only going to CPU0
=C2=A031: = =C2=A0 20418643 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 GICv2 190 = Level =C2=A0 =C2=A0 eth0

Some ideas include:
- Moving most processes of core0. e.g. Configure all the systemd s= lices NOT to use core0, so core0 is essentially freed to only service the I= RQs
- Tuning ideas = from here: https://github.com/leandromoreira/linu= x-network-performance-parameters. Specifically, I was wondering about i= ncreasing netdev_budget sysctls.

The defaults are = shown here

root@rpi4b:/home/das# sysctl -a | grep = netdev_budget
net.core.netdev_budget =3D 300
net.core.netdev_budget_u= secs =3D 8000

"Armbian's kernel isn&#= 39;t a particularly high performance kernel build."

Happy to discuss any recommended tuning.=C2=A0 Armbrian is very easy= to install on the microSD card.=C2=A0 ( Actually, I have the LicheePi 4A R= ISC-V, but can't find a easy image to just load on a microSD card. )


Over the weekend, I reconfigured = the testing setup using a lot more VLANs.=C2=A0 Now each device has ALL the= different qdiscs configured on different VLANs and IPs, allowing the iperf= /flent tests to be run one after the other with no need to change the qdisc= s between tests.=C2=A0 I'm currently repeating every combination of tes= t, before adding the netem 20/40ms latency as DaveT suggested.=C2=A0 ( Test= take a while: 8 devices * 6 qdiscs =3D 48 tests, by 10 minute tests =3D 48= 0 minutes =3D 8 hours )

Roughly the plan is:
1. Retest all combinations.=C2=A0 This is to confirm the starting = position. <--- running now
2. Add netem latency 20 and 40m= s, and retest all combinations.=C2=A0 I'm hoping Pi4 cake performance w= ill be closer to > 900 Mb/s
3. Apply some tuning options, = and retest all combinations

Kind regards,
Dave Seddon

On Sun, Sep 17, 2023 at 6:05=E2=80=AFPM Dave Taht &l= t;dave.taht@gmail.com> wrote:=

A huge thanks to dave seddon for buckling down and doing s= ome comprehensive testing of a variety of arm64 gear!


--
Regards,
Dave= Seddon
+1 415 857 5102
--000000000000dfda530605a77335--