[Make-wifi-fast] [Bloat] TCP_NOTSENT_LOWAT applied to e2e TCP msg latency

Bob McMahon bob.mcmahon at broadcom.com
Tue Oct 26 19:23:35 EDT 2021


I'm confused. I don't see any blocking nor partial writes per the write()
at the app level with TCP_NOTSENT_LOWAT set at 4 bytes. The burst is 40K,
the write size is 4K and the watermark is 4 bytes. There are ten writes per
burst.

The S8 histograms are the times waiting on the select().  The first value
is the bin number (multiplied by 100usec bin width) and second the bin
count. The worst case time is at the end and is timestamped per unix epoch.

The second run is over a controlled WiFi link where a 99.7% point of 4-8ms
for a WiFi TX op arbitration win is in the ballpark. The first is 1G wired
and is in the 600 usec range. (No media arbitration there.)

 [root at localhost iperf2-code]# src/iperf -c 10.19.87.9 --trip-times -i 1 -e
--tcp-write-prefetch 4 -l 4K --burst-size=40K --histograms
WARN: option of --burst-size without --burst-period defaults --burst-period
to 1 second
------------------------------------------------------------
Client connecting to 10.19.87.9, TCP port 5001 with pid 2124 (1 flows)
Write buffer size: 4096 Byte
Bursting: 40.0 KByte every 1.00 seconds
TCP window size: 85.0 KByte (default)
Event based writes (pending queue watermark at 4 bytes)
Enabled select histograms bin-width=0.100 ms, bins=10000
------------------------------------------------------------
[  1] local 10.19.87.10%eth0 port 33166 connected with 10.19.87.9 port 5001
(MSS=1448) (prefetch=4) (trip-times) (sock=3) (ct=0.54 ms) on 2021-10-26
16:07:33 (PDT)
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry
Cwnd/RTT        NetPwr
[  1] 0.00-1.00 sec  40.1 KBytes   329 Kbits/sec  11/0          0
14K/5368 us  8
[  1] 0.00-1.00 sec S8-PDF: bin(w=100us):cnt(10)=1:1,2:5,3:2,4:1,11:1
(5.00/95.00/99.7%=1/11/11,Outliers=0,obl/obu=0/0) (1.089
ms/1635289653.928360)
[  1] 1.00-2.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/569 us  72
[  1] 1.00-2.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,2:1,3:4,4:1,7:1,8:1
(5.00/95.00/99.7%=1/8/8,Outliers=0,obl/obu=0/0) (0.736 ms/1635289654.928088)
[  1] 2.00-3.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/312 us  131
[  1] 2.00-3.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:2,3:2,5:2,6:1
(5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.548 ms/1635289655.927776)
[  1] 3.00-4.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/302 us  136
[  1] 3.00-4.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,2:2,3:5,6:1
(5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.584 ms/1635289656.927814)
[  1] 4.00-5.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/316 us  130
[  1] 4.00-5.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:2,4:2,5:2,6:1
(5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.572 ms/1635289657.927810)
[  1] 5.00-6.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/253 us  162
[  1] 5.00-6.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:2,3:4,5:1
(5.00/95.00/99.7%=1/5/5,Outliers=0,obl/obu=0/0) (0.417 ms/1635289658.927630)
[  1] 6.00-7.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/290 us  141
[  1] 6.00-7.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:3,4:3,6:1
(5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.573 ms/1635289659.927771)
[  1] 7.00-8.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/359 us  114
[  1] 7.00-8.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,3:4,4:3,6:1
(5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.570 ms/1635289660.927753)
[  1] 8.00-9.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/349 us  117
[  1] 8.00-9.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:5,4:1,7:1
(5.00/95.00/99.7%=1/7/7,Outliers=0,obl/obu=0/0) (0.608 ms/1635289661.927843)
[  1] 9.00-10.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/347 us  118
[  1] 9.00-10.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:1,3:5,8:1
(5.00/95.00/99.7%=1/8/8,Outliers=0,obl/obu=0/0) (0.725 ms/1635289662.927861)
[  1] 0.00-10.01 sec   400 KBytes   327 Kbits/sec  102/0          0
14K/1519 us  27
[  1] 0.00-10.01 sec S8(f)-PDF:
bin(w=100us):cnt(100)=1:25,2:13,3:36,4:11,5:5,6:5,7:2,8:2,11:1
(5.00/95.00/99.7%=1/7/11,Outliers=0,obl/obu=0/0) (1.089
ms/1635289653.928360)

[root at localhost iperf2-code]# src/iperf -c 192.168.1.1 --trip-times -i 1 -e
--tcp-write-prefetch 4 -l 4K --burst-size=40K --histograms
WARN: option of --burst-size without --burst-period defaults --burst-period
to 1 second
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001 with pid 2131 (1 flows)
Write buffer size: 4096 Byte
Bursting: 40.0 KByte every 1.00 seconds
TCP window size: 85.0 KByte (default)
Event based writes (pending queue watermark at 4 bytes)
Enabled select histograms bin-width=0.100 ms, bins=10000
------------------------------------------------------------
[  1] local 192.168.1.4%eth1 port 45518 connected with 192.168.1.1 port
5001 (MSS=1448) (prefetch=4) (trip-times) (sock=3) (ct=5.48 ms) on
2021-10-26 16:07:56 (PDT)
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry
Cwnd/RTT        NetPwr
[  1] 0.00-1.00 sec  40.1 KBytes   329 Kbits/sec  11/0          0
14K/10339 us  4
[  1] 0.00-1.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:1,40:1,47:1,49:2,50:3,51:1,60:1
(5.00/95.00/99.7%=1/60/60,Outliers=0,obl/obu=0/0) (5.990
ms/1635289676.802143)
[  1] 1.00-2.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/4853 us  8
[  1] 1.00-2.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,38:1,39:1,44:1,45:1,49:1,51:1,52:1,60:1
(5.00/95.00/99.7%=1/60/60,Outliers=0,obl/obu=0/0) (5.937
ms/1635289677.802274)
[  1] 2.00-3.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/4991 us  8
[  1] 2.00-3.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,48:1,49:2,50:2,51:1,60:1,64:1
(5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.307
ms/1635289678.794326)
[  1] 3.00-4.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/4610 us  9
[  1] 3.00-4.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,49:3,50:3,56:1,64:1
(5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.362
ms/1635289679.794335)
[  1] 4.00-5.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5028 us  8
[  1] 4.00-5.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,49:6,59:1,64:1
(5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.367
ms/1635289680.794399)
[  1] 5.00-6.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5113 us  8
[  1] 5.00-6.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,49:3,50:2,58:1,60:1,65:1
(5.00/95.00/99.7%=1/65/65,Outliers=0,obl/obu=0/0) (6.442
ms/1635289681.794392)
[  1] 6.00-7.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5054 us  8
[  1] 6.00-7.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,39:1,49:3,51:1,60:2,64:1
(5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.374
ms/1635289682.794335)
[  1] 7.00-8.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5138 us  8
[  1] 7.00-8.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,39:2,40:1,49:2,50:1,60:1,64:1
(5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.396
ms/1635289683.794338)
[  1] 8.00-9.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5329 us  8
[  1] 8.00-9.00 sec S8-PDF:
bin(w=100us):cnt(10)=1:2,38:1,45:2,49:1,50:3,63:1
(5.00/95.00/99.7%=1/63/63,Outliers=0,obl/obu=0/0) (6.292
ms/1635289684.794262)
[  1] 9.00-10.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0
14K/5329 us  8
[  1] 9.00-10.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,39:1,49:3,50:3,84:1
(5.00/95.00/99.7%=1/84/84,Outliers=0,obl/obu=0/0) (8.306
ms/1635289685.796315)
[  1] 0.00-10.01 sec   400 KBytes   327 Kbits/sec  102/0          0
14K/6331 us  6
[  1] 0.00-10.01 sec S8(f)-PDF:
bin(w=100us):cnt(100)=1:19,38:2,39:5,40:2,44:1,45:3,47:1,48:1,49:26,50:17,51:4,52:1,56:1,58:1,59:1,60:7,63:1,64:5,65:1,84:1
(5.00/95.00/99.7%=1/64/84,Outliers=0,obl/obu=0/0) (8.306
ms/1635289685.796315)

Bob

On Tue, Oct 26, 2021 at 11:45 AM Christoph Paasch <cpaasch at apple.com> wrote:

> Hello,
>
> > On Oct 25, 2021, at 9:24 PM, Eric Dumazet <eric.dumazet at gmail.com>
> wrote:
> >
> >
> >
> > On 10/25/21 8:11 PM, Stuart Cheshire via Bloat wrote:
> >> On 21 Oct 2021, at 17:51, Bob McMahon via Make-wifi-fast <
> make-wifi-fast at lists.bufferbloat.net> wrote:
> >>
> >>> Hi All,
> >>>
> >>> Sorry for the spam. I'm trying to support a meaningful TCP message
> latency w/iperf 2 from the sender side w/o requiring e2e clock
> synchronization. I thought I'd try to use the TCP_NOTSENT_LOWAT event to
> help with this. It seems that this event goes off when the bytes are in
> flight vs have reached the destination network stack. If that's the case,
> then iperf 2 client (sender) may be able to produce the message latency by
> adding the drain time (write start to TCP_NOTSENT_LOWAT) and the sampled
> RTT.
> >>>
> >>> Does this seem reasonable?
> >>
> >> I’m not 100% sure what you’re asking, but I will try to help.
> >>
> >> When you set TCP_NOTSENT_LOWAT, the TCP implementation won’t report
> your endpoint as writable (e.g., via kqueue or epoll) until less than that
> threshold of data remains unsent. It won’t stop you writing more bytes if
> you want to, up to the socket send buffer size, but it won’t *ask* you for
> more data until the TCP_NOTSENT_LOWAT threshold is reached.
> >
> >
> > When I implemented TCP_NOTSENT_LOWAT back in 2013 [1], I made sure that
> sendmsg() would actually
> > stop feeding more bytes in TCP transmit queue if the current amount of
> unsent bytes
> > was above the threshold.
> >
> > So it looks like Apple implementation is different, based on your
> description ?
>
> Yes, TCP_NOTSENT_LOWAT only impacts the wakeup on iOS/macOS/...
>
> An app can still fill the send-buffer if it does a sendmsg() with a large
> buffer or does repeated calls to sendmsg().
>
> Fur Apple, the goal of TCP_NOTSENT_LOWAT was to allow an app to quickly
> change the data it "scheduled" to send. And thus allow the app to write the
> smallest "logical unit" it has. If that unit is 512KB large, the app is
> allowed to send that.
> For example, in case of video-streaming one may want to skip ahead in the
> video. In that case the app still needs to transmit the remaining parts of
> the previous frame anyways, before it can send the new video frame.
> That's the reason why the Apple implementation allows one to write more
> than just the lowat threshold.
>
>
> That being said, I do think that Linux's way allows for an easier API
> because the app does not need to be careful at how much data it sends after
> an epoll/kqueue wakeup. So, the latency-benefits will be easier to get.
>
>
> Christoph
>
>
>
> > [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36
> >
> > netperf does not use epoll(), but rather a loop over sendmsg().
> >
> > One of the point of TCP_NOTSENT_LOWAT for Google was to be able to
> considerably increase
> > max number of bytes in transmit queues (3rd column of
> /proc/sys/net/ipv4/tcp_wmem)
> > by 10x, allowing for autotune to increase BDP for big RTT flows, this
> without
> > increasing memory needs for flows with small RTT.
> >
> > In other words, the TCP implementation attempts to keep BDP bytes in
> flight + TCP_NOTSENT_LOWAT bytes buffered and ready to go. The BDP of bytes
> in flight is necessary to fill the network pipe and get good throughput.
> The TCP_NOTSENT_LOWAT of bytes buffered and ready to go is provided to give
> the source software some advance notice that the TCP implementation will
> soon be looking for more bytes to send, so that the buffer doesn’t run dry,
> thereby lowering throughput. (The old SO_SNDBUF option conflates both
> “bytes in flight” and “bytes buffered and ready to go” into the same
> number.)
> >>
> >> If you wait for the TCP_NOTSENT_LOWAT notification, write a chunk of n
> bytes of data, and then wait for the next TCP_NOTSENT_LOWAT notification,
> that will tell you roughly how long it took n bytes to depart the machine.
> You won’t know why, though. The bytes could depart the machine in response
> for acks indicating that the same number of bytes have been accepted at the
> receiver. But the bytes can also depart the machine because CWND is
> growing. Of course, both of those things are usually happening at the same
> time.
> >>
> >> How to use TCP_NOTSENT_LOWAT is explained in this video:
> >>
> >> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2199>
> >>
> >> Later in the same video is a two-minute demo (time offset 42:00 to time
> offset 44:00) showing a “before and after” demo illustrating the dramatic
> difference this makes for screen sharing responsiveness.
> >>
> >> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2520>
> >>
> >> Stuart Cheshire
> >> _______________________________________________
> >> Bloat mailing list
> >> Bloat at lists.bufferbloat.net
> >> https://lists.bufferbloat.net/listinfo/bloat
> >>
> > _______________________________________________
> > Bloat mailing list
> > Bloat at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
>
>

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/make-wifi-fast/attachments/20211026/6d7c2c9f/attachment-0001.html>


More information about the Make-wifi-fast mailing list