"Eggert, Lars" <lars@netapp.com> writes:

> we tried this too. The TCP timestamps are too coarse-grained for
> datacenter latency measurements, I think under at least Linux and
> FreeBSD they get rounded up to 1ms or something. (Midori, do you
> remember the exact value?)

Right. Well now that you mention it, I do seem to recall having read
that Linux uses the clock ticks (related to the kernel hz value; i.e.
between 250 and 1000 hz depending on configuration) as timestamp units.
I suppose FreeBSD is similar.

> No, but the sender and receiver can agree to embed them every X bytes
> in the stream. Yeah, sometimes that timestamp may be transmitted in
> two segments, but I guess that should be OK?

Right, so a protocol might be something like this (I'm still envisioning
this in the context of the netperf TCP_STREAM / TCP_MAERTS tests):

1. Insert a sufficiently accurate timestamp into the TCP bandwidth
   measurement stream every X bytes (or maybe every X milliseconds?).

2. On the receiver side, look for these timestamps and each time one is
   received, calculate the delay (also in a sufficiently accurate, i.e.
   sub-millisecond, unit). Echo this calculated delay back to the
   sender, probably with a fresh timestamp attached.

3. The sender receives the delay measurements and either just outputs it
   straight away, or holds on to them until the end of the test and
   normalises them to be deltas against the minimum observed delay.

Now, some possible issues with this:

- Are we measuring the right thing? This will measure the time it takes
  a message to get from the application level on one side to the
  application level on another. There are a lot of things that could
  impact this apart from queueing latency; the most obvious one is
  packet loss and retransmissions which will give some spurious results
  I suppose (?). Doing the measurement with UDP packets would alleviate
  this, but then we're back to not being in-stream...

- As for point 3, not normalising the result and just outputting the
  computed delay as-is means that the numbers will be meaningless
  without very accurately synchronised clocks. On the other hand, not
  processing the numbers before outputting them will allow people who
  *do* have synchronised clocks to do something useful with them.
  Perhaps a --assume-sync-clocks parameter?

- Echoing back the delay measurements causes traffic which may or may
  not be significant; I'm thinking mostly in terms of running
  bidirectional measurements. Is that significant? A solution could be
  for the receiver to hold on to all the measurements until the end of
  the test and then send them back on the control connection.

- Is clock drift something to worry about over the timescales of these
  tests?
  https://www.usenix.org/legacy/events/iptps10/tech/slides/cohen.pdf
  seems to suggest it shouldn't be, as long as the tests only run for at
  most a few minutes.

> http://e2epi.internet2.edu/thrulay/ is the original. There are several
> variants, but I think they also have been abandoned:

Thanks. From what I can tell, the measurement here basically works by
something akin to the above: for TCP, the timestamp is just echoed back
by the receiver, so roundtrip time is measured. For UDP, the receiver
calculates the delay, so presumably clock synchronisation is a
prerequisite.


So anyway, thoughts? Is the above something worth pursuing?

-Toke