Preliminary results of using GPS to look for clock skew

Wed Sep 21 22:11:38 EDT 2011

Dave Taht <dave.taht at gmail.com>:
> It is comforting to know that ntp is working well in your case, and, using
> GPS, we have a verifiable means with decent error bars of checking against
> ntp's algos independently!

Yup.  It'll get better as I refine my profiling and gain more insight
into the numbers.  My next task is to compute a lower bound for RS-232
transmission time and subtract that from E-S so we know how much of the
dominant component in fix latency is processing time.

Er, for other bloat-dev members: I should have said up front that I've
volunteered to be the bufferbloat project's go-to guy on reliable time
sources for network performance profiling.  This is a completely
natural extension of the work I've been doing on GPSD since 2005.  GPS
gives us atomic-clock time with $40 hardware (provided we're below 60
drgress N or S latitude and can string an antenna somewhere with a
decent skyview).  I know almost everything there is to know about
extracting data from these sensors, and what I don't know my two senior
lieutenants on the GPSD project *do* know.

> Two ideas here:
> 
> 1) Run the router WITHOUT ntp enabled at all
>    (and/or testing against CLOCK_REALTIME)
>    It would be good to know how much the base clock drift is, without
> correction.

One of the things I don't know, and need to understand, is what the
relationships are among the different realtime clocks. The clock_gettime(3)
manual page is not hugely helpful.  It says:

       CLOCK_REALTIME
              System-wide real-time clock.  Setting this clock requires appro‐
              priate privileges.

       CLOCK_MONOTONIC
              Clock  that  cannot  be  set and represents monotonic time since
              some unspecified starting point.

       CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a  raw  hard‐
              ware-based time that is not subject to NTP adjustments.

       CLOCK_PROCESS_CPUTIME_ID
              High-resolution per-process timer from the CPU.

       CLOCK_THREAD_CPUTIME_ID
              Thread-specific CPU-time clock.

Er, so what exactly is the relationship between the CLOCK_REALTIME clock
and the time(2) clock?  Are they the same?  If they're different, how 
are they different?

It says the CLOCK_MONOTONIC clock isn't settable, but the CLOCK_MONOTONIC_RAW
text implies that the former may get NTP adjustments.  And it doesn.t specify
whether the per-process timers are NTP-corrected...I'd guess not, but who's 
to know from the above.

Can anyone point me to better documentation on these facilities?

> 2) ReRun all tests under load (example: netperf -l 3600 -H the_router)

I'll do this, for completeness, but I predict it's not going to make
any measurable difference.  The indications so far are that neither of
the means of time delivery I have available to check are compute-bound
or disk-I/O bound at any point in their delivery chains.  

So I think they're just going to shrug off any load short of
machine-thrashing-its-guts-out.  But part of the point of what I'm
doing is that soon we'll have the test tools to know for *sure* that's
true.

> The followon test to this is to actually start collecting and parsing
> ntp rawstats statistics, which can be easily turned on and collected
> on the router. It's getting-those-stats-somewhere coupled with the
> need to periodically delete these statistic files that's a problem at
> the moment, and really only the former...
> 
> The bufferbloat signal (if it exists) is in the noise that ntp is
> currently (successfully in your case) rejecting.
> 
> There are a couple rawstats parsers floating about, I have part of
> one, hal has another. I committed a major overdesign
> sin in mine by wanting to put it all into a postgres db, ran into
> major data representation problems (time on postgres is different than
> time inside of ntp), and put the work aside (it's on github in the
> same pieces I left it in)
> 
> To enable rawstats collection on the router, modify /etc/ntp.conf to contain:
> 
> statsdir /tmp
> statistics rawstats
> filegen rawstats file rawstats type day enable
> 
> and restart ntp
> 
> on a system protected by apparmor - like ubuntu - it's mildly trickier
> as you need to add
> a
> 
> whereverthelogdiris/rawstats* rwl
> 
> to the /etc/apparmor.d/usr.sbin.ntpd
> 
> The final bit of the cbbd is to actually collect port numbers - so
> stuff on ephemeral ports is known to be from
> natted devices and stuff on 123
> 
> but I'm getting way ahead of myself here.

Agreed.  I also think you're complicating life unnecessarily. 

If we need rawstats in a form for real-time monitoring, why not modify
NTP to optionally multicast them and avoid all this going to disk?  I
have good relations with the NTP guys, and they wouldn't be likely to
resist a feature request with a network-health-monitoring use case
even if we didn't. Let's *use* that zorch for something, rather than
fielding a fragile pile of hacks.

-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>