From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from bobcat.rjmcmahon.com (bobcat.rjmcmahon.com [45.33.58.123]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 380453B29E for ; Mon, 20 Feb 2023 16:13:39 -0500 (EST) Received: from mail.rjmcmahon.com (bobcat.rjmcmahon.com [45.33.58.123]) by bobcat.rjmcmahon.com (Postfix) with ESMTPA id 594121B258; Mon, 20 Feb 2023 13:13:38 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.11.0 bobcat.rjmcmahon.com 594121B258 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rjmcmahon.com; s=bobcat; t=1676927618; bh=JU667LG7oddzZ3e58bUm3x+7JK80RkgnvaTL6z5f7RU=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=n2pO6oXu2n8jybxGlweV7kls+cxwJXVM/Gb2TBMGSaJh6PB3LMyal+um2dfWrT7az qlU4vNiDpMqbh2ri5wY9XzhbfRWk1IfBI1ohRf/+lPZeVzJEFcnKfxeBP5RXPTNhdR HMHGtSfQkKHEpnlcVg6Cg6Pvd8mSQNlR8uktebkM= MIME-Version: 1.0 Date: Mon, 20 Feb 2023 13:13:38 -0800 From: rjmcmahon To: Tim Chown Cc: Dave Taht , Rpm In-Reply-To: References: Message-ID: <209db2cfbd4a7c191f312f49500eeb21@rjmcmahon.com> X-Sender: rjmcmahon@rjmcmahon.com Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Rpm] in case anyone here digs podcasts X-BeenThere: rpm@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: revolutions per minute - a new metric for measuring responsiveness List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Feb 2023 21:13:39 -0000 Hi Tim, I can respond to iperf 2 questions around bufferbloat. My apologies to all for the longish email. (Feel free to contact me about a webex or zoom discussion if that would be helpful to any of your engineering teams.) There are basically three things, metrics, mechanisms and methodologies. I'll touch on each a bit here but it won't be comprehensive in any of these domain vectors. First, we believe bufferbloat is a design flaw and, in that context, it needs to be detected and rooted out way ahead of shipping products. (It also needs to be monitored for regression breakages.) Second, bloat displacement units are either memory in bytes for TCP or packets for UDP. It's really not units in time (nor its inverse) though time deltas can be used to calculate a sampled value via Little's law: "the average number of items in a stationary queuing system, based on the average waiting time of an item within a system and the average number of items arriving at the system per unit of time." Iperf 2 calls this inP) Next, we feel conflating bloat and low latency as misguided. And users don't care about low latency either. What people want is the fastest distributed causality system possible. Network i/0 delays can, and many times do, drive that speed of causality. This speed is not throughput, nor phy rate, nor channel capacity, etc. though they each may or may not have effect. Iperf 2 assumes two things (ignoring the need for advanced test case design done by skilled engineers) to root out bloat. First, the network under test is highly controlled and all traffic is basically synthetic. Second, iperf2 does assume the test designer has synchronized the client & server clocks via something like ptp4l, ptpd2, GPS disciplined oscillators, etc. Iperf 2 only knows that clocks are sync'd when the user sets --trip-times on the client. No --trip-times, no one way delay (OWD) related metrics. (--trip-times without sync'd clocks == garbage) There are multiple metric options, e.g. --histograms, which provide the full distributions w/o the central limit theorem (CLT) averaging. One can also write or read rate limit traffic thread with -b, or play with the --near-congestion option on the client to weight delays on the sampled RTT. The -P can be used as well which will create concurrent traffic threads, or --full-duplex (iperf 2 is a multi-threaded design.) Also, the new --working-loads options may be useful too. There are ways to use python scripts found in the flows directory to code up different testing methodologies. The dependency there is ssh support on the devices under test and python 3.10+ (for its asyncio) as the controller. Visualizations are specifically deferred to users (iperf 2 only presents raw data.) None of these are further talked about here but more information can be found on the man page. https://iperf2.sourceforge.io/iperf-manpage.html (Sadly, writing documentation for iperf 2 is almost always last on my todo list.) Also, with all the attention on mitigating congested queues, there is now a lot of sw trying to do "fancy queueing." Sadly, we find bloat mitigations to have intermittent failures, so a single, up-front, design test around bloat is not sufficient across time, device and network. That's part of the reason why I'm skeptical about user-level tests too. They might pass one moment and fail the next. We find sw & devices have to be constantly tested as human engineers break things at times w/o knowing it. The life as a T&M engineer so-to-speak is an always-on type of job. The simplest way to measure bloat is to use an old TCP CCA, e.g. reno/cubic. First, see what's available and allowed and what is the default: [rjmcmahon@ryzen3950 iperf2-code]$ sysctl -a | grep congestion_control | grep tcp net.ipv4.tcp_allowed_congestion_control = reno cubic bbr net.ipv4.tcp_available_congestion_control = reno cubic bbr net.ipv4.tcp_congestion_control = cubic Let's first try default cubic with a single TCP flow using one-second sampling over two 10G NICs on the same 10G switch. Server output first since it has the inP metric (which is about 1.26 Mbytes.) Note: TCP_NOTSENT_LOWAT is now set by default when --trip-times is used. One can use --tcp-write-prefetch to affect this, including disabling it. The event based writes message on the client shows this. [root@rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0 ------------------------------------------------------------ Server listening on TCP port 5001 with pid 23097 Binding to local address 0.0.0.0 and iface enp2s0 Read buffer size: 128 KByte (Dist bin width=16.0 KByte) TCP window size: 128 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69 port 52850 (trip-times) (sock=4) (peer 2.1.9-rc2) (icwnd/mss/irtt=14/1448/194) on 2023-02-20 12:38:18.348 (PST) [ ID] Interval Transfer Bandwidth Burst Latency avg/min/max/stdev (cnt/size) inP NetPwr Reads=Dist [ 1] 0.00-1.00 sec 1.09 GBytes 9.35 Gbits/sec 1.131/0.536/6.016/0.241 ms (8916/131072) 1.26 MByte 1033105 23339=3542:3704:3647:3622:7106:1357:238:123 [ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 1.128/0.941/1.224/0.034 ms (8978/131076) 1.27 MByte 1042945 22282=3204:3388:3467:3311:5251:2611:887:163 [ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 1.120/0.948/1.323/0.036 ms (8979/131068) 1.26 MByte 1050406 22334=3261:3324:3449:3391:5772:1945:1005:187 [ 1] 3.00-4.00 sec 1.10 GBytes 9.41 Gbits/sec 1.121/0.962/1.217/0.034 ms (8978/131070) 1.26 MByte 1049555 23457=3554:3669:3684:3645:7198:1284:344:79 [ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 1.116/0.942/1.246/0.034 ms (8978/131079) 1.25 MByte 1054966 23884=3641:3810:3857:3779:8029:449:292:27 [ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 1.115/0.957/1.227/0.035 ms (8979/131064) 1.25 MByte 1055858 22756=3361:3476:3544:3446:6247:1724:812:146 [ 1] 6.00-7.00 sec 1.10 GBytes 9.41 Gbits/sec 1.119/0.967/1.213/0.033 ms (8978/131074) 1.26 MByte 1051938 23580=3620:3683:3724:3648:6672:2048:163:22 [ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 1.116/0.962/1.225/0.033 ms (8978/131081) 1.25 MByte 1054253 23710=3645:3703:3760:3732:7402:1178:243:47 [ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec 1.117/0.951/1.229/0.034 ms (8979/131061) 1.25 MByte 1053809 22917=3464:3467:3521:3551:6154:2069:633:58 [ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 1.111/0.934/1.296/0.033 ms (8978/131078) 1.25 MByte 1059127 22703=3336:3477:3499:3490:5961:2084:759:97 [ 1] 0.00-10.00 sec 11.0 GBytes 9.41 Gbits/sec 1.119/0.536/6.016/0.083 ms (89734/131072) 1.26 MByte 1050568 230995=34633:35707:36156:35621:65803:16750:5376: [rjmcmahon@ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1 --trip-times ------------------------------------------------------------ Client connecting to 192.168.1.58, TCP port 5001 with pid 1063336 via enp4s0 (1 flows) Write buffer size: 131072 Byte TOS set to 0x0 (Nagle on) TCP window size: 16.0 KByte (default) Event based writes (pending queue watermark at 16384 bytes) ------------------------------------------------------------ [ 1] local 192.168.1.69%enp4s0 port 52850 connected with 192.168.1.58 port 5001 (prefetch=16384) (trip-times) (sock=3) (icwnd/mss/irtt=14/1448/304) (ct=0.35 ms) on 2023-02-20 12:38:18.347 (PST) [ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd/RTT(var) NetPwr [ 1] 0.00-1.00 sec 1.09 GBytes 9.36 Gbits/sec 8928/0 25 1406K/1068(27) us 1095703 [ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 3 1406K/1092(42) us 1077623 [ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 1406K/1077(26) us 1092632 [ 1] 3.00-4.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0 1406K/1064(28) us 1106105 [ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 1406K/1065(23) us 1104943 [ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 1406K/1072(35) us 1097728 [ 1] 6.00-7.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0 1406K/1065(28) us 1105066 [ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 1406K/1057(30) us 1113306 [ 1] 8.00-9.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 11 1406K/1077(22) us 1092753 [ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 1406K/1052(23) us 1118597 [ 1] 0.00-10.01 sec 11.0 GBytes 9.40 Gbits/sec 89734/0 39 1406K/1057(25) us 1111889 [rjmcmahon@ryzen3950 iperf2-code]$ Next with BBR (use: -Z or --linux-congestion algo - set TCP congestion control algorithm (Linux only)) the inP drops to under 600KBytes. [root@rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0 ------------------------------------------------------------ Server listening on TCP port 5001 with pid 23515 Binding to local address 0.0.0.0 and iface enp2s0 Read buffer size: 128 KByte (Dist bin width=16.0 KByte) TCP window size: 128 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69 port 32972 (trip-times) (sock=4) (peer 2.1.9-rc2) (icwnd/mss/irtt=14/1448/191) on 2023-02-20 12:51:12.258 (PST) [ ID] Interval Transfer Bandwidth Burst Latency avg/min/max/stdev (cnt/size) inP NetPwr Reads=Dist [ 1] 0.00-1.00 sec 1.09 GBytes 9.37 Gbits/sec 0.520/0.312/5.681/0.321 ms (8939/131074) 596 KByte 2251265 22481=3223:3446:3546:3432:5649:2147:914:124 [ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 0.488/0.322/0.630/0.036 ms (8978/131074) 561 KByte 2409868 23288=3487:3672:3610:3610:6525:1980:392:12 [ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 0.488/0.210/1.114/0.043 ms (8972/131071) 560 KByte 2409679 23538=3567:3744:3653:3740:7173:1167:431:63 [ 1] 3.00-4.00 sec 1.10 GBytes 9.41 Gbits/sec 0.497/0.339/0.617/0.038 ms (8978/131077) 572 KByte 2365971 22509=3238:3455:3400:3479:5652:2315:927:43 [ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 0.496/0.326/0.642/0.040 ms (8979/131066) 570 KByte 2371488 22116=3154:3348:3428:3239:5002:2704:1099:142 [ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 0.489/0.318/0.689/0.039 ms (8978/131071) 562 KByte 2405955 22742=3400:3438:3470:3472:6117:2103:709:33 [ 1] 6.00-7.00 sec 1.10 GBytes 9.41 Gbits/sec 0.483/0.320/0.601/0.035 ms (8978/131073) 555 KByte 2437891 23678=3641:3721:3671:3680:7201:1752:10:2 [ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 0.490/0.329/0.643/0.039 ms (8979/131067) 563 KByte 2402744 23006=3428:3584:3533:3603:6417:1527:794:120 [ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec 0.488/0.250/2.262/0.085 ms (8977/131085) 561 KByte 2412134 23646=3621:3774:3694:3686:6832:1813:137:89 [ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 0.485/0.250/0.743/0.037 ms (8979/131057) 557 KByte 2427710 23415=3546:3645:3638:3669:7168:1374:362:13 [ 1] 0.00-10.00 sec 11.0 GBytes 9.41 Gbits/sec 0.492/0.210/5.681/0.111 ms (89744/131072) 566 KByte 2388488 230437=34307:35831:35645:35613:63743:18882:5775:641 [rjmcmahon@ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1 --trip-times -Z bbr ------------------------------------------------------------ Client connecting to 192.168.1.58, TCP port 5001 with pid 1064072 via enp4s0 (1 flows) Write buffer size: 131072 Byte TCP congestion control set to bbr TOS set to 0x0 (Nagle on) TCP window size: 16.0 KByte (default) Event based writes (pending queue watermark at 16384 bytes) ------------------------------------------------------------ [ 1] local 192.168.1.69%enp4s0 port 32972 connected with 192.168.1.58 port 5001 (prefetch=16384) (trip-times) (sock=3) (icwnd/mss/irtt=14/1448/265) (ct=0.32 ms) on 2023-02-20 12:51:12.257 (PST) [ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd/RTT(var) NetPwr [ 1] 0.00-1.00 sec 1.09 GBytes 9.38 Gbits/sec 8945/0 35 540K/390(18) us 3006254 [ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 528K/409(25) us 2877175 [ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 8972/0 19 554K/465(35) us 2528985 [ 1] 3.00-4.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0 562K/473(27) us 2488151 [ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 10 540K/467(41) us 2519838 [ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 531K/416(21) us 2828761 [ 1] 6.00-7.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0 554K/426(22) us 2762665 [ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 543K/405(21) us 2905591 [ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 526K/433(22) us 2717701 [ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0 537K/446(20) us 2638485 [ 1] 10.00-11.01 sec 128 KBytes 1.04 Mbits/sec 1/0 0 537K/439(20) us 296 [ 1] 0.00-11.01 sec 11.0 GBytes 8.55 Gbits/sec 89744/0 64 537K/439(20) us 2434241 I hope this helps a bit. There are a lot more possibilities so engineers can play with knobs to affect things and see how their network performs. Bob > Hi, > >> On 19 Feb 2023, at 23:49, Dave Taht via Rpm >> wrote: >> >> https://packetpushers.net/podcast/heavy-networking-666-improving-quality-of-experience-with-libreqos/ > > I’m a bit lurgy-ridden today so had a listen as it’s nice passive > content. I found it good and informative, though somewhat I the weeds > (for me) after about half way through, but I looked top a few things > that were Brough up and learnt a few useful details, so overall well > worth the time, thanks. > >> came out yesterday. You'd have to put up with about 8 minutes of my >> usual rants before we get into where we are today with the project and >> the problems we are facing. (trying to scale past 80Gbit now) We have >> at this point validated the behavior of several benchmarks, and are >> moving towards more fully emulating various RTTs. See >> https://payne.taht.net and click on run bandwidth test to see how we >> are moving along. It is so nice to see sawtooths in real time! > > I tried the link and clicked the start test. I feel I should be able > to click a stop test button too, bt again interesting to see :) > >> Bufferbloat is indeed, the number of the beast. > > I’m in a different world to the residential ISP one that was the focus > of what you presented, specifically he R&E networks where most users > are connected via local Ethernet campus networks. But there will be a > lot of WiFi of course. > > It would be interesting to gauge to what extent buffer boat is a > problem for typical campus users, vs typical residential network > users. Is there data on that? We’re very interested in the new rpm > (well, rps!) draft and the iperf2 implementation, which we’ve run from > both home network and campus systems to an iperf2 server on our NREN > backbone. I think my next question on the iperf2 tool would be the > methodology to ramp up the testing to see at what point buffer bloat > is experienced (noting some of your per-hit comments in the podcast). > > Regarding the speeds, we are interested in high speed large scale file > strangers, e.g. for the CERN community, so might (say) typically see > iperf3 test up to 20-25Gbps single flow or iperf2 (which is much > better multi-flow) filling a high RTT 100G link with around half a > dozen flows. In practice though the CERN transfers are hundreds or > thousands of flows, each of a few hundred Mbps or a small number of > Gbps, and the site access networks for the larger facilities > 100G-400G. > > In the longest prefix match topic, are there people looking at that > with white box platforms, open NOSes and P4 type solutions? > > Tim > _______________________________________________ > Rpm mailing list > Rpm@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/rpm