<div dir="ltr"><div>G'day Mr David Reed,</div><div><br></div><div>Thanks for the comments.</div><div><br></div><div>Definitely agree with your sentiments and the tests definitely do NOT simply represent Intel verse ARM.<br></div><div><br></div><div>Perhaps I should have been more clear about the objectives of the testing:<br></div><div><br></div><div>I'm curious to understand the performance of these lower end SoC devices, because these are the types of devices that act as home gateway routers, as access points, and such. There are many many millions of these devices out there and I don't know how well understood their performance is:</div><div>e.g. How bad is my Spectrum Internet cable modem?<br></div><div>e.g. I have a Unifi security gateway and it's "smart queue" performance is pretty poor ( <200 Mb/s ). Why is it so poor?<br></div><div><br></div><div>Obviously, with real servers ( and even virtual AWS ones ) which
have real NICs, you get things like multi-queues with RSS, and a lot more
tuning knobs, and so they can go a lot faster.</div><div><br></div><div>In the tests so far, the Asus CN60 device with the r8169 performs pretty well, where the NIC is likely to be contributing positively. The default configuration has a bunch of off-loading enabled:<br></div><div><br></div><div>root@asus-cn60-2:/home/das# ethtool --show-features enp1s0 | grep ": on"<br>rx-checksumming: on<br>tx-checksumming: on<br> tx-checksum-ipv4: on<br> tx-checksum-ipv6: on<br>generic-receive-offload: on<br>rx-vlan-offload: on<br>tx-vlan-offload: on<br>highdma: on [fixed]<br></div><div><br></div><div>However, based on these initial tests, which are not complete, it's certainly curious that the Pi4 is doing ~923Mbit/s with pfifo_fast and then doing significantly less ( ~621 Mbits/sec ) with cake. I'm interested to understand this in more detail, where DaveT has recommended adding 20ms or 40ms. The cake tests so far had rtt 1ms and rtt 3ms, which might be too low. ( If it is too low, then maybe it would make sense to remove "rtt lan = rtt 1ms" option, as it's a misleading configuration option? )<br></div><div><br></div><div>Definitely, during the testing these little devices have the
NIC IRQs all going through core 0, so I want to explore tuning options. <br></div><div><br></div><div>root@rpi4b:/home/das# cat /proc/interrupts | grep -E '(CPU0|eth0)'<br> CPU0 CPU1 CPU2 CPU3 <br> 30: 38651749 0 0 0 GICv2 189 Level eth0 <--- IRQs only going to CPU0<br> 31: 20418643 0 0 0 GICv2 190 Level eth0<br></div><div><br></div><div>Some ideas include:</div><div>- Moving most processes of core0. e.g. Configure all the systemd slices NOT to use core0, so core0 is essentially freed to only service the IRQs<br></div><div>- RPS ( <a href="https://www.kernel.org/doc/html/latest/networking/scaling.html#rps-receive-packet-steering">https://www.kernel.org/doc/html/latest/networking/scaling.html#rps-receive-packet-steering</a> ). e.g. Can the other cores get more involved?<br></div><div>- Tuning ideas from here: <a href="https://github.com/leandromoreira/linux-network-performance-parameters" target="_blank">https://github.com/leandromoreira/linux-network-performance-parameters</a>. Specifically, I was wondering about increasing netdev_budget sysctls.</div><div><br></div><div>The defaults are shown here</div><div><br></div><div>root@rpi4b:/home/das# sysctl -a | grep netdev_budget<br>net.core.netdev_budget = 300<br>net.core.netdev_budget_usecs = 8000<br></div><div><br></div><div>"Armbian's kernel isn't a particularly high performance kernel build."</div><div><br></div><div>Happy to discuss any recommended tuning. Armbrian is very easy to install on the microSD card. ( Actually, I have the LicheePi 4A RISC-V, but can't find a easy image to just load on a microSD card. )<br></div><div><br></div><div><br></div><div>Over the weekend, I reconfigured the testing setup using a lot more VLANs. Now each device has ALL the different qdiscs configured on different VLANs and IPs, allowing the iperf/flent tests to be run one after the other with no need to change the qdiscs between tests. I'm currently repeating every combination of test, before adding the netem 20/40ms latency as DaveT suggested. ( Test take a while: 8 devices * 6 qdiscs = 48 tests, by 10 minute tests = 480 minutes = 8 hours )<br></div><div><br></div><div>Roughly the plan is:</div><div>1. Retest all combinations. This is to confirm the starting position. <--- running now<br></div><div>2. Add netem latency 20 and 40ms, and retest all combinations. I'm hoping Pi4 cake performance will be closer to > 900 Mb/s<br></div><div>3. Apply some tuning options, and retest all combinations</div><div><br></div><div>Kind regards,</div><div>Dave Seddon<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Sep 17, 2023 at 6:05 PM Dave Taht <<a href="mailto:dave.taht@gmail.com">dave.taht@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div>A huge thanks to dave seddon for buckling down and doing some comprehensive testing of a variety of arm64 gear!<div><br clear="all"><div><a href="https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI/edit#heading=h.bpvv3vr500nw" target="_blank">https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI/edit#heading=h.bpvv3vr500nw</a><br></div><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Oct 30: <a href="https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html" target="_blank">https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html</a></div><div>Dave Täht CSO, LibreQos<br></div></div></div></div></div>
</blockquote></div><br clear="all"><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>Regards,<br></div>Dave Seddon<br>+1 415 857 5102<br></div></div></div></div></div></div>