[Cerowrt-devel] uptime?

Fri Feb 6 15:47:46 EST 2015

On Sat, Feb 7, 2015 at 8:42 AM,  <Valdis.Kletnieks at vt.edu> wrote:
> On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:
>> so, how's everybody's uptime?
>
> Sitting at 27 days due to a power blip.

I do strongly feel that home routers should have a battery or supercap
with at least 30 seconds lifetime. In Nica, the power flickered 6
times a day, with half-day long outages every couple weeks (rolling
blackouts during one phase being much worse). It was a glorious PITA
to have to wait for everything in the tin cans and string connecting
everything to reassociate and return to connectivity,

In SF, I've seen it flicker, oh, a couple times, in the last 6 months,
3 times long enough to force a reboot of everything.

Batteries have got cheap, as has power conversion.

I guess a bigger philosophical questions I'm having re "uptime" are

A) "how long is long enough" before natural factors like power
failures start to dominate the uptime statistics?

And there are multiple modes of failure - "up but not working right"
is much worse than "reboot due to some self diagnostic saying we're
hosed somehow" or "reboot at 4AM because we installed an update and
nobody was actively using the system".

Personally - I'd like a systems critical computing device to stay up
for it's entire lifetime - 10 years -  but with the current
architectures of systems I don't see a way to get there, short of
stepping back to the 90s style microkernels where the device drivers
were abstracted out of the OS, or forward into the containerized
future which has at least some of the same benefits as microkernels
did.

And in any case, having a suite of inherent tests for continued
"correct" behavior both introduces "automated fixes" and potential
induced bugs in the monitoring tools. Take the dns failure path for
example - it is hard to tell if dns fails if your local instance is
messed up.

It is hard to test if dhcp is failing for any reason, aside from
actually having an external device attempt to register periodically
and test.

One of the things that I have seen happen is wifi multicast would keep
working on some versions of cero, but the unicast path would fail - so
babel would continue to route because it only tested the multicast
path.

100%cpu for long continued periods by daemons that shouldn't grab all
that would be a good indicator of failure....

On the observational side from this group - we know at this point that
crashing or misbehavior every few hours or days is unacceptable, every
few weeks is the best we can hope for from many other shipped
"commercial" home oriented systems, and a few months is a joy, and 259
days is our record. And there are so many potential causes of problems
that the best we can do is usually to continually update and revise
the software in light of new data - *and* make for
absolutely-damned-sure the hardware itself is engineered to the finest
quality standards possible (I'm a real bastard about requiring -40-70C
in everything I ever buy that is going to get wedged behind a tav)

So another way to put this is how many 9s of reliability can be aimed
for, and how do we get there? Most of the literature on MTBF is
oriented towards making better hardware (with moving parts) rather
than computer hardware or software (links to better stuff, anyone?).
I'll however argue that a planned reboot at 4AM is a zillion times
less customer affecting than one in the middle of a business critical
Skype call.

How many bugs merely happen due to entropy? (we seem to be doing a
great job of not running out of memory at this point, or writing to
flash too much). Or cosmic events? Are we at the point where ECC ram
would make a difference?

Certainly relying on a wonderful set of core early adopters is
something that works often better than a formal QA dept. (:) )

If you consider "uptime" as the sole goal, and reboots as a way of
solving major problems, well, speeding up boot times would be a way to
making progress on getting down to, say, 6 minutes of downtime per
year. (that's about 3 reboots/year at present boot times).

If we keep drilling down harder on the known problems over the product
lifetime, and have a way to do continual updates,
eventually, an acceptable goal for uptime and reliability could be
met. This unfortunately requires either a new/continueed sale rate to
cover the ongoing costs of continual development (ponzi schemes *do
work* for a while, usually), or an ongoing revenue stream dedicated to
maintaining and improving the product. A metric ton of people rent
their cable modems for 7 dollars a month, for example, but very little
if any of that revenue flows back into the the org that maintains and
updates the firmware itself. I could see offering automated updates
and news via email, (people are now quite used to that from their
phones) but I imagine very few would actually sign up for it and pay a
buck a month into the box on the wall that they are already paying
their ISP for.

And I don't really know what is acceptable for failure rates that are
"less than never". Paul Mckenney pointed out recently that he was
working on finding and fixing bugs that only happened once in a really
big number of times - but often enough that at least 3 people on the
planet were experiencing it every day.

If we had 10k or more users, collecting data as to problems more
centrally and picking through it with a fine toothed comb would help.
Certainly my polling this group for stats is not quite as useful as
actually getting y'all to opt in for at least some stats collection
(and figuring out what are useful stats to collect is a problem)

> One issue I've had is that dnsmasq has gone out to lunch a few times,
> resulting in devices losing their IPv4 address when they renew their
> DHCP lease.  Restarting dnsmasq makes it work again. I'll have to do
> more diagnosis the next time it happens...

While testing for the dnssec bug over ipv6 I was able to crash dnsmasq
quite frequently with the new test dnsmasq I'd distributed (did anyone
else try it?), but I got buried by the rush of things to do before I
left for NZ and haven't got back to it. And (at the moment, anyway)
can't seem to get back into that network from here, so it seems
crashed again for some reason or another.

-- 
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks