[Bloat] Graph of bloat

Hal Murray hmurray at megapathdsl.net
Thu Jul 9 06:07:23 EDT 2015


There are several parts to this discussion.

Leap seconds are ugly.  The basic problem is that POSIX pretends they don't 
exist.  That's a carryover from the early days when computer time keeping 
didn't have to worry about them.  They weren't introduced until 1972.  There 
should be a second labeled 23:59:60 but most systems just set the clock back 
a second and repeat 23:59:59, and all sorts of systems get in trouble when 
time goes backwards.

They don't impact daily life like leap years do, so we don't teach kids about them when they learn about leap years.  Most people don't even know they exist, and that includes most programmers.  An additional complication is that they are unpredictable so you can't wire simple conversions into a chunk of code that gets copied around.

Google decided that it was simpler to "smear" their clocks rather than chase down and fix the bugs in all their code.
  Time, technology and leaping seconds 
  http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html
The downside is that all their clocks are off by up to 1/2 second.  If you don't need accurate time for legal reasons like stock market trading, their approach is probably a good one.  Their internal clocks will all agree with each other, but they won't agree with outside systems that aren't playing the smearing game.

The blog above describes the smear using cosine - no sharp corners.  The graph shows a linear smear.


> Does ntp adjust system time backward based on getting nearly all it's
> samples with well over a 1/2 second of induced delay? 

The idea with smearing is to avoid having to set the clock back.  The reference time on that graph is UTC.  If your server was using only Google's NTP servers, it would follow that ramp, inserting the leap second over 20 hours rather than all at once by setting the clock back.  That's the whole point of the smear.  You lie to all your NTP clients and they all follow the same lie.

All that has nothing to do with bloat.  It's just background for why I was making the graph.

--------

Now for NTP...

After the typical NTP client-server exchange, the client has 4 time stamps, send and receive for packets going in both directions.  If you look at things in the right way, you have N equations and N+1 unknowns.  You need one more equation to sort things out.

If you assume that the clocks on both ends are accurate, you can compute the network transit times in both directions.

NTP makes the assumption that the network delays are symmetric.  Without bloat, that's generally reasonable.  It does screwup on long links with asymmetric routing.  If you watch NTP servers over a long distance, you can see steps when the routing changes.  On the scale of bloat, those errors are minor.  If you had a fast link rather than my slow DSL link they would be significant.

ntpd remembers the last 8 samples to each server.  It only uses the one with the lowest round trip time, assuming that the others hit some sort of queueing delay.  That filters out occasional bursts of interference or even bloat.  It doesn't work for sustained bloat.

The huff-n-puff filter can be used for sustained bloat - better to coast than get confused.  But there needs to be some limit on how long to wait before assuming the current timings are valid because the network has been reconfigured.  If your bloat lasts long enough, ntpd will get confused.


In addition to getting the time correct, ntpd is also trying to calibrate the clock frequency so the future time will be more accurate (if the current time is good).  That's the "drift".  Without that correction, the clock will drift farther from the true time the longer you wait.

Ballpark numbers for the errors in crystals are 10s of PPM (parts per million).  One PPM is roughly a second over 2 weeks, so an uncorrected clock is likely to drift seconds per day.  I have one system that's off by 138 PPM.  (The drift can also correct for minor errors in software.)

Normally, ntpd is just making minor corrections.  It does that by slewing the clock, that is by fudging the clock frequency so the clock will "drift" in the desired direction.  That takes a long time to make large corrections.  ntpd will normally step the clock if the correction is over 128 ms.

But stepping the clock backwards is what causes most of the problems.  ntpd has command line switches to don't-do-that, and another to allow one step at startup time...  There are no simple answers.

--------

> Judging from that graphic... I don't think huff and puff was designed for
> the bufferbloated era! so the question remains, in hal's tests, did ntp
> adjust the clock backwards? 

No.  The system that collected that data was getting time from a good local GPS clock.  It helps to have a place to stand if you want to collect time data.

Here is a typical pattern from a system using the pool without any huff-n-puff while I did a big download.
 8 Jul 22:02:17 ntpd[26705]: 0.0.0.0 061c 0c clock_step -0.259747 s
 8 Jul 23:06:24 ntpd[26705]: 0.0.0.0 061c 0c clock_step +0.274448 s


-- 
These are my opinions.  I hate spam.






More information about the Bloat mailing list