[Rpm] in case anyone here digs podcasts
rjmcmahon
rjmcmahon at rjmcmahon.com
Mon Feb 20 16:13:38 EST 2023
Hi Tim,
I can respond to iperf 2 questions around bufferbloat. My apologies to
all for the longish email. (Feel free to contact me about a webex or
zoom discussion if that would be helpful to any of your engineering
teams.)
There are basically three things, metrics, mechanisms and methodologies.
I'll touch on each a bit here but it won't be comprehensive in any of
these domain vectors.
First, we believe bufferbloat is a design flaw and, in that context, it
needs to be detected and rooted out way ahead of shipping products. (It
also needs to be monitored for regression breakages.) Second, bloat
displacement units are either memory in bytes for TCP or packets for
UDP. It's really not units in time (nor its inverse) though time deltas
can be used to calculate a sampled value via Little's law: "the average
number of items in a stationary queuing system, based on the average
waiting time of an item within a system and the average number of items
arriving at the system per unit of time." Iperf 2 calls this inP)
Next, we feel conflating bloat and low latency as misguided. And users
don't care about low latency either. What people want is the fastest
distributed causality system possible. Network i/0 delays can, and many
times do, drive that speed of causality. This speed is not throughput,
nor phy rate, nor channel capacity, etc. though they each may or may not
have effect.
Iperf 2 assumes two things (ignoring the need for advanced test case
design done by skilled engineers) to root out bloat. First, the network
under test is highly controlled and all traffic is basically synthetic.
Second, iperf2 does assume the test designer has synchronized the client
& server clocks via something like ptp4l, ptpd2, GPS disciplined
oscillators, etc. Iperf 2 only knows that clocks are sync'd when the
user sets --trip-times on the client. No --trip-times, no one way delay
(OWD) related metrics. (--trip-times without sync'd clocks == garbage)
There are multiple metric options, e.g. --histograms, which provide the
full distributions w/o the central limit theorem (CLT) averaging. One
can also write or read rate limit traffic thread with -b, or play with
the --near-congestion option on the client to weight delays on the
sampled RTT. The -P can be used as well which will create concurrent
traffic threads, or --full-duplex (iperf 2 is a multi-threaded design.)
Also, the new --working-loads options may be useful too. There are ways
to use python scripts found in the flows directory to code up different
testing methodologies. The dependency there is ssh support on the
devices under test and python 3.10+ (for its asyncio) as the controller.
Visualizations are specifically deferred to users (iperf 2 only presents
raw data.) None of these are further talked about here but more
information can be found on the man page.
https://iperf2.sourceforge.io/iperf-manpage.html (Sadly, writing
documentation for iperf 2 is almost always last on my todo list.)
Also, with all the attention on mitigating congested queues, there is
now a lot of sw trying to do "fancy queueing." Sadly, we find bloat
mitigations to have intermittent failures, so a single, up-front, design
test around bloat is not sufficient across time, device and network.
That's part of the reason why I'm skeptical about user-level tests too.
They might pass one moment and fail the next. We find sw & devices have
to be constantly tested as human engineers break things at times w/o
knowing it. The life as a T&M engineer so-to-speak is an always-on type
of job.
The simplest way to measure bloat is to use an old TCP CCA, e.g.
reno/cubic. First, see what's available and allowed and what is the
default:
[rjmcmahon at ryzen3950 iperf2-code]$ sysctl -a | grep congestion_control |
grep tcp
net.ipv4.tcp_allowed_congestion_control = reno cubic bbr
net.ipv4.tcp_available_congestion_control = reno cubic bbr
net.ipv4.tcp_congestion_control = cubic
Let's first try default cubic with a single TCP flow using one-second
sampling over two 10G NICs on the same 10G switch. Server output first
since it has the inP metric (which is about 1.26 Mbytes.) Note:
TCP_NOTSENT_LOWAT is now set by default when --trip-times is used. One
can use --tcp-write-prefetch to affect this, including disabling it. The
event based writes message on the client shows this.
[root at rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0
------------------------------------------------------------
Server listening on TCP port 5001 with pid 23097
Binding to local address 0.0.0.0 and iface enp2s0
Read buffer size: 128 KByte (Dist bin width=16.0 KByte)
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69
port 52850 (trip-times) (sock=4) (peer 2.1.9-rc2)
(icwnd/mss/irtt=14/1448/194) on 2023-02-20 12:38:18.348 (PST)
[ ID] Interval Transfer Bandwidth Burst Latency
avg/min/max/stdev (cnt/size) inP NetPwr Reads=Dist
[ 1] 0.00-1.00 sec 1.09 GBytes 9.35 Gbits/sec
1.131/0.536/6.016/0.241 ms (8916/131072) 1.26 MByte 1033105
23339=3542:3704:3647:3622:7106:1357:238:123
[ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec
1.128/0.941/1.224/0.034 ms (8978/131076) 1.27 MByte 1042945
22282=3204:3388:3467:3311:5251:2611:887:163
[ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec
1.120/0.948/1.323/0.036 ms (8979/131068) 1.26 MByte 1050406
22334=3261:3324:3449:3391:5772:1945:1005:187
[ 1] 3.00-4.00 sec 1.10 GBytes 9.41 Gbits/sec
1.121/0.962/1.217/0.034 ms (8978/131070) 1.26 MByte 1049555
23457=3554:3669:3684:3645:7198:1284:344:79
[ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec
1.116/0.942/1.246/0.034 ms (8978/131079) 1.25 MByte 1054966
23884=3641:3810:3857:3779:8029:449:292:27
[ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec
1.115/0.957/1.227/0.035 ms (8979/131064) 1.25 MByte 1055858
22756=3361:3476:3544:3446:6247:1724:812:146
[ 1] 6.00-7.00 sec 1.10 GBytes 9.41 Gbits/sec
1.119/0.967/1.213/0.033 ms (8978/131074) 1.26 MByte 1051938
23580=3620:3683:3724:3648:6672:2048:163:22
[ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec
1.116/0.962/1.225/0.033 ms (8978/131081) 1.25 MByte 1054253
23710=3645:3703:3760:3732:7402:1178:243:47
[ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec
1.117/0.951/1.229/0.034 ms (8979/131061) 1.25 MByte 1053809
22917=3464:3467:3521:3551:6154:2069:633:58
[ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec
1.111/0.934/1.296/0.033 ms (8978/131078) 1.25 MByte 1059127
22703=3336:3477:3499:3490:5961:2084:759:97
[ 1] 0.00-10.00 sec 11.0 GBytes 9.41 Gbits/sec
1.119/0.536/6.016/0.083 ms (89734/131072) 1.26 MByte 1050568
230995=34633:35707:36156:35621:65803:16750:5376:
[rjmcmahon at ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1
--trip-times
------------------------------------------------------------
Client connecting to 192.168.1.58, TCP port 5001 with pid 1063336 via
enp4s0 (1 flows)
Write buffer size: 131072 Byte
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
Event based writes (pending queue watermark at 16384 bytes)
------------------------------------------------------------
[ 1] local 192.168.1.69%enp4s0 port 52850 connected with 192.168.1.58
port 5001 (prefetch=16384) (trip-times) (sock=3)
(icwnd/mss/irtt=14/1448/304) (ct=0.35 ms) on 2023-02-20 12:38:18.347
(PST)
[ ID] Interval Transfer Bandwidth Write/Err Rtry
Cwnd/RTT(var) NetPwr
[ 1] 0.00-1.00 sec 1.09 GBytes 9.36 Gbits/sec 8928/0 25
1406K/1068(27) us 1095703
[ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 3
1406K/1092(42) us 1077623
[ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
1406K/1077(26) us 1092632
[ 1] 3.00-4.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0
1406K/1064(28) us 1106105
[ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
1406K/1065(23) us 1104943
[ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
1406K/1072(35) us 1097728
[ 1] 6.00-7.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0
1406K/1065(28) us 1105066
[ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
1406K/1057(30) us 1113306
[ 1] 8.00-9.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 11
1406K/1077(22) us 1092753
[ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
1406K/1052(23) us 1118597
[ 1] 0.00-10.01 sec 11.0 GBytes 9.40 Gbits/sec 89734/0 39
1406K/1057(25) us 1111889
[rjmcmahon at ryzen3950 iperf2-code]$
Next with BBR (use: -Z or --linux-congestion algo - set TCP congestion
control algorithm (Linux only)) the inP drops to under 600KBytes.
[root at rjm-nas rjmcmahon]# iperf -s -i 1 -e -B 0.0.0.0%enp2s0
------------------------------------------------------------
Server listening on TCP port 5001 with pid 23515
Binding to local address 0.0.0.0 and iface enp2s0
Read buffer size: 128 KByte (Dist bin width=16.0 KByte)
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.58%enp2s0 port 5001 connected with 192.168.1.69
port 32972 (trip-times) (sock=4) (peer 2.1.9-rc2)
(icwnd/mss/irtt=14/1448/191) on 2023-02-20 12:51:12.258 (PST)
[ ID] Interval Transfer Bandwidth Burst Latency
avg/min/max/stdev (cnt/size) inP NetPwr Reads=Dist
[ 1] 0.00-1.00 sec 1.09 GBytes 9.37 Gbits/sec
0.520/0.312/5.681/0.321 ms (8939/131074) 596 KByte 2251265
22481=3223:3446:3546:3432:5649:2147:914:124
[ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec
0.488/0.322/0.630/0.036 ms (8978/131074) 561 KByte 2409868
23288=3487:3672:3610:3610:6525:1980:392:12
[ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec
0.488/0.210/1.114/0.043 ms (8972/131071) 560 KByte 2409679
23538=3567:3744:3653:3740:7173:1167:431:63
[ 1] 3.00-4.00 sec 1.10 GBytes 9.41 Gbits/sec
0.497/0.339/0.617/0.038 ms (8978/131077) 572 KByte 2365971
22509=3238:3455:3400:3479:5652:2315:927:43
[ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec
0.496/0.326/0.642/0.040 ms (8979/131066) 570 KByte 2371488
22116=3154:3348:3428:3239:5002:2704:1099:142
[ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec
0.489/0.318/0.689/0.039 ms (8978/131071) 562 KByte 2405955
22742=3400:3438:3470:3472:6117:2103:709:33
[ 1] 6.00-7.00 sec 1.10 GBytes 9.41 Gbits/sec
0.483/0.320/0.601/0.035 ms (8978/131073) 555 KByte 2437891
23678=3641:3721:3671:3680:7201:1752:10:2
[ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec
0.490/0.329/0.643/0.039 ms (8979/131067) 563 KByte 2402744
23006=3428:3584:3533:3603:6417:1527:794:120
[ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec
0.488/0.250/2.262/0.085 ms (8977/131085) 561 KByte 2412134
23646=3621:3774:3694:3686:6832:1813:137:89
[ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec
0.485/0.250/0.743/0.037 ms (8979/131057) 557 KByte 2427710
23415=3546:3645:3638:3669:7168:1374:362:13
[ 1] 0.00-10.00 sec 11.0 GBytes 9.41 Gbits/sec
0.492/0.210/5.681/0.111 ms (89744/131072) 566 KByte 2388488
230437=34307:35831:35645:35613:63743:18882:5775:641
[rjmcmahon at ryzen3950 iperf2-code]$ iperf -c 192.168.1.58%enp4s0 -i 1
--trip-times -Z bbr
------------------------------------------------------------
Client connecting to 192.168.1.58, TCP port 5001 with pid 1064072 via
enp4s0 (1 flows)
Write buffer size: 131072 Byte
TCP congestion control set to bbr
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
Event based writes (pending queue watermark at 16384 bytes)
------------------------------------------------------------
[ 1] local 192.168.1.69%enp4s0 port 32972 connected with 192.168.1.58
port 5001 (prefetch=16384) (trip-times) (sock=3)
(icwnd/mss/irtt=14/1448/265) (ct=0.32 ms) on 2023-02-20 12:51:12.257
(PST)
[ ID] Interval Transfer Bandwidth Write/Err Rtry
Cwnd/RTT(var) NetPwr
[ 1] 0.00-1.00 sec 1.09 GBytes 9.38 Gbits/sec 8945/0 35
540K/390(18) us 3006254
[ 1] 1.00-2.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
528K/409(25) us 2877175
[ 1] 2.00-3.00 sec 1.10 GBytes 9.41 Gbits/sec 8972/0 19
554K/465(35) us 2528985
[ 1] 3.00-4.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0
562K/473(27) us 2488151
[ 1] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 10
540K/467(41) us 2519838
[ 1] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
531K/416(21) us 2828761
[ 1] 6.00-7.00 sec 1.10 GBytes 9.42 Gbits/sec 8979/0 0
554K/426(22) us 2762665
[ 1] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
543K/405(21) us 2905591
[ 1] 8.00-9.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
526K/433(22) us 2717701
[ 1] 9.00-10.00 sec 1.10 GBytes 9.41 Gbits/sec 8978/0 0
537K/446(20) us 2638485
[ 1] 10.00-11.01 sec 128 KBytes 1.04 Mbits/sec 1/0 0
537K/439(20) us 296
[ 1] 0.00-11.01 sec 11.0 GBytes 8.55 Gbits/sec 89744/0 64
537K/439(20) us 2434241
I hope this helps a bit. There are a lot more possibilities so engineers
can play with knobs to affect things and see how their network performs.
Bob
> Hi,
>
>> On 19 Feb 2023, at 23:49, Dave Taht via Rpm
>> <rpm at lists.bufferbloat.net> wrote:
>>
>> https://packetpushers.net/podcast/heavy-networking-666-improving-quality-of-experience-with-libreqos/
>
> I’m a bit lurgy-ridden today so had a listen as it’s nice passive
> content. I found it good and informative, though somewhat I the weeds
> (for me) after about half way through, but I looked top a few things
> that were Brough up and learnt a few useful details, so overall well
> worth the time, thanks.
>
>> came out yesterday. You'd have to put up with about 8 minutes of my
>> usual rants before we get into where we are today with the project and
>> the problems we are facing. (trying to scale past 80Gbit now) We have
>> at this point validated the behavior of several benchmarks, and are
>> moving towards more fully emulating various RTTs. See
>> https://payne.taht.net and click on run bandwidth test to see how we
>> are moving along. It is so nice to see sawtooths in real time!
>
> I tried the link and clicked the start test. I feel I should be able
> to click a stop test button too, bt again interesting to see :)
>
>> Bufferbloat is indeed, the number of the beast.
>
> I’m in a different world to the residential ISP one that was the focus
> of what you presented, specifically he R&E networks where most users
> are connected via local Ethernet campus networks. But there will be a
> lot of WiFi of course.
>
> It would be interesting to gauge to what extent buffer boat is a
> problem for typical campus users, vs typical residential network
> users. Is there data on that? We’re very interested in the new rpm
> (well, rps!) draft and the iperf2 implementation, which we’ve run from
> both home network and campus systems to an iperf2 server on our NREN
> backbone. I think my next question on the iperf2 tool would be the
> methodology to ramp up the testing to see at what point buffer bloat
> is experienced (noting some of your per-hit comments in the podcast).
>
> Regarding the speeds, we are interested in high speed large scale file
> strangers, e.g. for the CERN community, so might (say) typically see
> iperf3 test up to 20-25Gbps single flow or iperf2 (which is much
> better multi-flow) filling a high RTT 100G link with around half a
> dozen flows. In practice though the CERN transfers are hundreds or
> thousands of flows, each of a few hundred Mbps or a small number of
> Gbps, and the site access networks for the larger facilities
> 100G-400G.
>
> In the longest prefix match topic, are there people looking at that
> with white box platforms, open NOSes and P4 type solutions?
>
> Tim
> _______________________________________________
> Rpm mailing list
> Rpm at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/rpm
More information about the Rpm
mailing list