[Bloat] Measuring latency-under-load consistently

Fri Mar 11 20:09:57 EST 2011

On Sat, 2011-03-12 at 02:45 +0200, Jonathan Morton wrote:
> On 12 Mar, 2011, at 2:13 am, Rick Jones wrote:
> 
> > On Sat, 2011-03-12 at 02:00 +0200, Jonathan Morton wrote:
> >> I'm currently resurrecting my socket-programming skills (last used
> >> almost 10 years ago when IPv6 really *was* experimental) in the hope
> >> of making a usable latency-under-load tester.  This could be run in
> >> server-mode on one host, and then as a client on another host could be
> >> pointed at the server, followed by several minutes of churning and
> >> some nice round numbers.
> >> 
> >> It would need to make multiple TCP connections simultaneously, one of
> >> which would be used to measure latency (using NODELAY marked sockets),
> >> and one or more others used to load the network and measure goodput. 
> >> It would automatically determine how long to run in order to get a
> >> reliable result that can't easily be challenged by (eg.) an ISP.
> > 
> > Why would it require multiple TCP connections?  Only if none of the
> > connections have data flowing in the other direction, and your latency
> > measuring one would need that anyway. 
> 
> Because the latency within a bulk flow is not as interesting as the
>  latency experienced by interactive or realtime flows sharing the same
>  link as a bulk flow (or three).  In the presence of a re-ordering AQM
>  scheme (trivially, SFQ) the two are not the same.

Good point, something I'd not considered.

> Suppose for example that you're downloading the latest Ubuntu DVD and
>  you suddenly think of something to look up on Wikipedia.  With the
>  30-second latencies I have personally experienced on some non-AQM
>  links under load, that is intolerably slow.  With something as simple
>  as SFQ on that same queue it would be considerably better because the
>  new packets could bypass the queue associated with the old flow, but
>  measuring only the old flow wouldn't show that.
> 
> Note that the occasional packet losses on a plain SFQ drop-tail queue
>  would still show extremely long maximum inter-arrival delays on the
>  bulk flow, and this is captured by the third metric (flow smoothness).
> 
> > Also, while NODELAY (TCP_NODELAY I presume) might be interesting with
> > something that tried to have multiple, sub-MSS transactions in flight at
> > one time, it won't change anything about how the packets flow on the
> > network - TCP_NODELAY has no effect beyond the TCP code running the
> > connection associated with the socket for that connection.
> 
> I'm essentially going to be running a back-and-forth ping inside a TCP
>  session.  Nagle's algorithm can, if it glitches, add hundreds of
>  milliseconds to that, which can be very material - eg. when measuring
>  a LAN or Wifi.  I wouldn't set NODELAY on the bulk flows.
> 
> Why ping inside a TCP session, rather than ICMP?  Because I want to
>  bypass any specific optimisations for ICMP and measure what
>  applications can actually use.
> 
> > You may be able to get most of what you want with a top-of-trunk netperf
> > "burst mode" TCP_RR test. It isn't quite an exact match though.
> 
> I don't really see how that would get the measurement I want.

Then one TCP_STREAM (or TCP_MAERTS) and one TCP_RR with all the RTT
stats and histogram enabled :)

rick jones