[Bloat] using tcp_notsent_lowat in various apps?

Fri Jun 19 03:10:11 EDT 2015

On Fri, 2015-06-19 at 07:07 +0300, Jonathan Morton wrote:
> > On 19 Jun, 2015, at 05:47, Juliusz Chroboczek
> <jch at pps.univ-paris-diderot.fr> wrote:
> > 
> >> I am curious if anyone has tried this new socket option in
> appropriate apps,
> > 
> > I'm probably confused, but I don't see how this is different from
> setting SO_SNDBUF.  I realise that's lower in the stack, but it should
> have a similar effect, shouldn't it?
> 
> What I understand of it is:
> 
> Reducing SO_SNDBUF causes send() to block until all of the data can be
> accommodated in the smaller buffer.  But select() will return the
> socket as soon as there is *any* space in that buffer to stuff data
> into.
> 
> TCP_NOTSENT_LOWAT causes select() to not return the socket until the
> data in the buffer falls below the mark, which may (and should) be a
> mere fraction of the total buffer size.
> 
> It’s a subtle difference, but worth noting.  The two options
> effectively apply to completely different system calls.
> 
> You could use both in the same program, but generally SO_SNDBUF would
> be set to a higher value than the low water mark.  This allows a
> complete chunk of data to be stuffed into the buffer, and the
> application can then spend more time waiting in select() - where it is
> in a better position to make control decisions which are likely to be
> latency sensitive, and it can service other sockets which might be
> draining or filling at a different rate.

SO_SNDBUF needs to be large enough to accommodate with losses/repairs.

If flow has no losses, SNDBUF needs to be at least BDP :
 ( cwnd * MSS / rtt)

If a packet can be lost once, then SNDBUF needs to be :
2 * (cwnd * MSS / rtt)

If a packet can be lost twice, then we need
3 * (cwnd * MSS / rtt)

... etc ...

But really TCP write queue is logically split into 2 different logical
parts :

[1] Already sent data, waiting for ACK. This one can be arbitrary big,
depending on network conditions.

[2] Not sent data.

1) Part is hard to size, because it depends on losses, which cannot be
predicted.

2) Part is easy to size, if we have some reasonable ways to schedule
the application to provide additional data (write()/send()) when empty.

SO_SNDBUF sizes the overall TCP write queue ([1] + [2])

While NOTSENT_LOWAT is able to restrict (2) only, avoiding filling write
queue when/if no drops are actually seen.