[Rpm] in case anyone here digs podcasts

Mon Feb 20 16:13:38 EST 2023

Hi Tim,

I can respond to iperf 2 questions around bufferbloat. My apologies to 
all for the longish email. (Feel free to contact me about a webex or 
zoom discussion if that would be helpful to any of your engineering 
teams.)

There are basically three things, metrics, mechanisms and methodologies. 
I'll touch on each a bit here but it won't be comprehensive in any of 
these domain vectors.

First, we believe bufferbloat is a design flaw and, in that context, it 
needs to be detected and rooted out way ahead of shipping products. (It 
also needs to be monitored for regression breakages.) Second, bloat 
displacement units are either memory in bytes for TCP or packets for 
UDP. It's really not units in time (nor its inverse) though time deltas 
can be used to calculate a sampled value via Little's law: "the average 
number of items in a stationary queuing system, based on the average 
waiting time of an item within a system and the average number of items 
arriving at the system per unit of time." Iperf 2 calls this inP)

Next, we feel conflating bloat and low latency as misguided. And users 
don't care about low latency either. What people want is the fastest 
distributed causality system possible. Network i/0 delays can, and many 
times do, drive that speed of causality. This speed is not throughput, 
nor phy rate, nor channel capacity, etc. though they each may or may not 
have effect.

Iperf 2 assumes two things (ignoring the need for advanced test case 
design done by skilled engineers) to root out bloat. First, the network 
under test is highly controlled and all traffic is basically synthetic. 
Second, iperf2 does assume the test designer has synchronized the client 
& server clocks via something like ptp4l, ptpd2, GPS disciplined 
oscillators, etc. Iperf 2 only knows that clocks are sync'd when the 
user sets --trip-times on the client. No --trip-times, no one way delay 
(OWD) related metrics. (--trip-times without sync'd clocks == garbage)

There are multiple metric options, e.g. --histograms, which provide the 
full distributions w/o the central limit theorem (CLT) averaging. One 
can also write or read rate limit traffic thread with -b, or play with 
the --near-congestion option on the client to weight delays on the 
sampled RTT.  The -P can be used as well which will create concurrent 
traffic threads, or --full-duplex (iperf 2 is a multi-threaded design.) 
Also, the new --working-loads options may be useful too. There are ways 
to use python scripts found in the flows directory to code up different 
testing methodologies. The dependency there is ssh support on the 
devices under test and python 3.10+ (for its asyncio) as the controller. 
Visualizations are specifically deferred to users (iperf 2 only presents 
raw data.) None of these are further talked about here but more 
information can be found on the man page. 
https://iperf2.sourceforge.io/iperf-manpage.html (Sadly, writing 
documentation for iperf 2 is almost always last on my todo list.)

Also, with all the attention on mitigating congested queues, there is 
now a lot of sw trying to do "fancy queueing." Sadly, we find bloat 
mitigations to have intermittent failures, so a single, up-front, design 
test around bloat is not sufficient across time, device and network. 
That's part of the reason why I'm skeptical about user-level tests too. 
They might pass one moment and fail the next. We find sw & devices have 
to be constantly tested as human engineers break things at times w/o 
knowing it. The life as a T&M engineer so-to-speak is an always-on type 
of job.

The simplest way to measure bloat is to use an old TCP CCA, e.g. 
reno/cubic.  First, see what's available and allowed and what is the 
default:

[rjmcmahon at ryzen3950 iperf2-code]$ sysctl -a | grep congestion_control | 
grep tcp
net.ipv4.tcp_allowed_congestion_control = reno cubic bbr
net.ipv4.tcp_available_congestion_control = reno cubic bbr
net.ipv4.tcp_congestion_control = cubic

Let's first try default cubic with a single TCP flow using one-second 
sampling over two 10G NICs on the same 10G switch. Server output first 
since it has the inP metric (which is about 1.26 Mbytes.) Note: 
TCP_NOTSENT_LOWAT is now set by default when --trip-times is used. One 
can use --tcp-write-prefetch to affect this, including disabling it. The 
event based writes message on the client shows this.

[root at rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0
------------------------------------------------------------
Server listening on TCP port 5001 with pid 23097
Binding to local address 0.0.0.0 and iface enp2s0
Read buffer size:  128 KByte (Dist bin width=16.0 KByte)
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69 
port 52850 (trip-times) (sock=4) (peer 2.1.9-rc2) 
(icwnd/mss/irtt=14/1448/194) on 2023-02-20 12:38:18.348 (PST)
[ ID] Interval        Transfer    Bandwidth    Burst Latency 
avg/min/max/stdev (cnt/size) inP NetPwr  Reads=Dist
[  1] 0.00-1.00 sec  1.09 GBytes  9.35 Gbits/sec  
1.131/0.536/6.016/0.241 ms (8916/131072) 1.26 MByte 1033105  
23339=3542:3704:3647:3622:7106:1357:238:123
[  1] 1.00-2.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.128/0.941/1.224/0.034 ms (8978/131076) 1.27 MByte 1042945  
22282=3204:3388:3467:3311:5251:2611:887:163
[  1] 2.00-3.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.120/0.948/1.323/0.036 ms (8979/131068) 1.26 MByte 1050406  
22334=3261:3324:3449:3391:5772:1945:1005:187
[  1] 3.00-4.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.121/0.962/1.217/0.034 ms (8978/131070) 1.26 MByte 1049555  
23457=3554:3669:3684:3645:7198:1284:344:79
[  1] 4.00-5.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.116/0.942/1.246/0.034 ms (8978/131079) 1.25 MByte 1054966  
23884=3641:3810:3857:3779:8029:449:292:27
[  1] 5.00-6.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.115/0.957/1.227/0.035 ms (8979/131064) 1.25 MByte 1055858  
22756=3361:3476:3544:3446:6247:1724:812:146
[  1] 6.00-7.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.119/0.967/1.213/0.033 ms (8978/131074) 1.26 MByte 1051938  
23580=3620:3683:3724:3648:6672:2048:163:22
[  1] 7.00-8.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.116/0.962/1.225/0.033 ms (8978/131081) 1.25 MByte 1054253  
23710=3645:3703:3760:3732:7402:1178:243:47
[  1] 8.00-9.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.117/0.951/1.229/0.034 ms (8979/131061) 1.25 MByte 1053809  
22917=3464:3467:3521:3551:6154:2069:633:58
[  1] 9.00-10.00 sec  1.10 GBytes  9.41 Gbits/sec  
1.111/0.934/1.296/0.033 ms (8978/131078) 1.25 MByte 1059127  
22703=3336:3477:3499:3490:5961:2084:759:97
[  1] 0.00-10.00 sec  11.0 GBytes  9.41 Gbits/sec  
1.119/0.536/6.016/0.083 ms (89734/131072) 1.26 MByte 1050568  
230995=34633:35707:36156:35621:65803:16750:5376:

[rjmcmahon at ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1 
--trip-times
------------------------------------------------------------
Client connecting to 192.168.1.58, TCP port 5001 with pid 1063336 via 
enp4s0 (1 flows)
Write buffer size: 131072 Byte
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
Event based writes (pending queue watermark at 16384 bytes)
------------------------------------------------------------
[  1] local 192.168.1.69%enp4s0 port 52850 connected with 192.168.1.58 
port 5001 (prefetch=16384) (trip-times) (sock=3) 
(icwnd/mss/irtt=14/1448/304) (ct=0.35 ms) on 2023-02-20 12:38:18.347 
(PST)
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry     
Cwnd/RTT(var)        NetPwr
[  1] 0.00-1.00 sec  1.09 GBytes  9.36 Gbits/sec  8928/0        25     
1406K/1068(27) us  1095703
[  1] 1.00-2.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         3     
1406K/1092(42) us  1077623
[  1] 2.00-3.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0     
1406K/1077(26) us  1092632
[  1] 3.00-4.00 sec  1.10 GBytes  9.42 Gbits/sec  8979/0         0     
1406K/1064(28) us  1106105
[  1] 4.00-5.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0     
1406K/1065(23) us  1104943
[  1] 5.00-6.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0     
1406K/1072(35) us  1097728
[  1] 6.00-7.00 sec  1.10 GBytes  9.42 Gbits/sec  8979/0         0     
1406K/1065(28) us  1105066
[  1] 7.00-8.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0     
1406K/1057(30) us  1113306
[  1] 8.00-9.00 sec  1.10 GBytes  9.42 Gbits/sec  8979/0        11     
1406K/1077(22) us  1092753
[  1] 9.00-10.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0     
1406K/1052(23) us  1118597
[  1] 0.00-10.01 sec  11.0 GBytes  9.40 Gbits/sec  89734/0        39     
1406K/1057(25) us  1111889
[rjmcmahon at ryzen3950 iperf2-code]$

Next with BBR (use: -Z or --linux-congestion algo - set TCP congestion 
control algorithm (Linux only)) the inP drops to under 600KBytes.

[root at rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0
------------------------------------------------------------
Server listening on TCP port 5001 with pid 23515
Binding to local address 0.0.0.0 and iface enp2s0
Read buffer size:  128 KByte (Dist bin width=16.0 KByte)
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69 
port 32972 (trip-times) (sock=4) (peer 2.1.9-rc2) 
(icwnd/mss/irtt=14/1448/191) on 2023-02-20 12:51:12.258 (PST)
[ ID] Interval        Transfer    Bandwidth    Burst Latency 
avg/min/max/stdev (cnt/size) inP NetPwr  Reads=Dist
[  1] 0.00-1.00 sec  1.09 GBytes  9.37 Gbits/sec  
0.520/0.312/5.681/0.321 ms (8939/131074)  596 KByte 2251265  
22481=3223:3446:3546:3432:5649:2147:914:124
[  1] 1.00-2.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.488/0.322/0.630/0.036 ms (8978/131074)  561 KByte 2409868  
23288=3487:3672:3610:3610:6525:1980:392:12
[  1] 2.00-3.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.488/0.210/1.114/0.043 ms (8972/131071)  560 KByte 2409679  
23538=3567:3744:3653:3740:7173:1167:431:63
[  1] 3.00-4.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.497/0.339/0.617/0.038 ms (8978/131077)  572 KByte 2365971  
22509=3238:3455:3400:3479:5652:2315:927:43
[  1] 4.00-5.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.496/0.326/0.642/0.040 ms (8979/131066)  570 KByte 2371488  
22116=3154:3348:3428:3239:5002:2704:1099:142
[  1] 5.00-6.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.489/0.318/0.689/0.039 ms (8978/131071)  562 KByte 2405955  
22742=3400:3438:3470:3472:6117:2103:709:33
[  1] 6.00-7.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.483/0.320/0.601/0.035 ms (8978/131073)  555 KByte 2437891  
23678=3641:3721:3671:3680:7201:1752:10:2
[  1] 7.00-8.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.490/0.329/0.643/0.039 ms (8979/131067)  563 KByte 2402744  
23006=3428:3584:3533:3603:6417:1527:794:120
[  1] 8.00-9.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.488/0.250/2.262/0.085 ms (8977/131085)  561 KByte 2412134  
23646=3621:3774:3694:3686:6832:1813:137:89
[  1] 9.00-10.00 sec  1.10 GBytes  9.41 Gbits/sec  
0.485/0.250/0.743/0.037 ms (8979/131057)  557 KByte 2427710  
23415=3546:3645:3638:3669:7168:1374:362:13
[  1] 0.00-10.00 sec  11.0 GBytes  9.41 Gbits/sec  
0.492/0.210/5.681/0.111 ms (89744/131072)  566 KByte 2388488  
230437=34307:35831:35645:35613:63743:18882:5775:641

[rjmcmahon at ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1 
--trip-times -Z bbr
------------------------------------------------------------
Client connecting to 192.168.1.58, TCP port 5001 with pid 1064072 via 
enp4s0 (1 flows)
Write buffer size: 131072 Byte
TCP congestion control set to bbr
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
Event based writes (pending queue watermark at 16384 bytes)
------------------------------------------------------------
[  1] local 192.168.1.69%enp4s0 port 32972 connected with 192.168.1.58 
port 5001 (prefetch=16384) (trip-times) (sock=3) 
(icwnd/mss/irtt=14/1448/265) (ct=0.32 ms) on 2023-02-20 12:51:12.257 
(PST)
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry     
Cwnd/RTT(var)        NetPwr
[  1] 0.00-1.00 sec  1.09 GBytes  9.38 Gbits/sec  8945/0        35      
540K/390(18) us  3006254
[  1] 1.00-2.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0      
528K/409(25) us  2877175
[  1] 2.00-3.00 sec  1.10 GBytes  9.41 Gbits/sec  8972/0        19      
554K/465(35) us  2528985
[  1] 3.00-4.00 sec  1.10 GBytes  9.42 Gbits/sec  8979/0         0      
562K/473(27) us  2488151
[  1] 4.00-5.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0        10      
540K/467(41) us  2519838
[  1] 5.00-6.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0      
531K/416(21) us  2828761
[  1] 6.00-7.00 sec  1.10 GBytes  9.42 Gbits/sec  8979/0         0      
554K/426(22) us  2762665
[  1] 7.00-8.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0      
543K/405(21) us  2905591
[  1] 8.00-9.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0      
526K/433(22) us  2717701
[  1] 9.00-10.00 sec  1.10 GBytes  9.41 Gbits/sec  8978/0         0      
537K/446(20) us  2638485
[  1] 10.00-11.01 sec   128 KBytes  1.04 Mbits/sec  1/0         0      
537K/439(20) us  296
[  1] 0.00-11.01 sec  11.0 GBytes  8.55 Gbits/sec  89744/0        64     
  537K/439(20) us  2434241

I hope this helps a bit. There are a lot more possibilities so engineers 
can play with knobs to affect things and see how their network performs.

Bob

> Hi,
> 
>> On 19 Feb 2023, at 23:49, Dave Taht via Rpm 
>> <rpm at lists.bufferbloat.net> wrote:
>> 
>> https://packetpushers.net/podcast/heavy-networking-666-improving-quality-of-experience-with-libreqos/
> 
> I’m a bit lurgy-ridden today so had a listen as it’s nice passive
> content.  I found it good and informative, though somewhat I the weeds
> (for me) after about half way through, but I looked top a few things
> that were Brough up and learnt a few useful details, so overall well
> worth the time, thanks.
> 
>> came out yesterday. You'd have to put up with about 8 minutes of my
>> usual rants before we get into where we are today with the project and
>> the problems we are facing. (trying to scale past 80Gbit now) We have
>> at this point validated the behavior of several benchmarks, and are
>> moving towards more fully emulating various RTTs. See
>> https://payne.taht.net and click on run bandwidth test to see how we
>> are moving along. It is so nice to see sawtooths in real time!
> 
> I tried the link and clicked the start test.  I feel I should be able
> to click a stop test button too, bt again interesting to see :)
> 
>> Bufferbloat is indeed, the number of the beast.
> 
> I’m in a different world to the residential ISP one that was the focus
> of what you presented, specifically he R&E networks where most users
> are connected via local Ethernet campus networks.  But there will be a
> lot of WiFi of course.
> 
> It would be interesting to gauge to what extent buffer boat is a
> problem for typical campus users, vs typical residential network
> users.  Is there data on that?  We’re very interested in the new rpm
> (well, rps!) draft and the iperf2 implementation, which we’ve run from
> both home network and campus systems to an iperf2 server on our NREN
> backbone. I think my next question on the iperf2 tool would be the
> methodology to ramp up the testing to see at what point buffer bloat
> is experienced (noting some of your per-hit comments in the podcast).
> 
> Regarding the speeds, we are interested in high speed large scale file
> strangers, e.g. for the CERN community, so might (say) typically see
> iperf3 test up to 20-25Gbps single flow or iperf2 (which is much
> better multi-flow) filling a high RTT 100G link with around half a
> dozen flows.  In practice though the CERN transfers are hundreds or
> thousands of flows, each of a few hundred Mbps or a small number of
> Gbps, and the site access networks for the larger facilities
> 100G-400G.
> 
> In the longest prefix match topic, are there people looking at that
> with white box platforms, open NOSes and P4 type solutions?
> 
> Tim
> _______________________________________________
> Rpm mailing list
> Rpm at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/rpm