[Cerowrt-devel] uptime?

Fri Feb 6 16:10:24 EST 2015

As a sysadmin, I will say that uptime (time since last boot) as a metric is 
EVIL, it leads to people avoiding important updates and also drives people into 
implementing horribly complex and dangerous things to avoid resetting the update 
number

downtime is a much better metric as it can be qualified to exclude deliberate 
updates and external factors like power outages. As Dave says, a router 
rebooting at 3am is not going to bother very many people, but a hang that 
requires a manual reboot at 2pm is going to be 1000x worse.

measuring downtime can be hard, do you count it down if anything is broken or 
only if everything is broken (neither is really the right answer, but how do you 
quantify the impact of being partially down??)

Improving boot time is always a nice thing to do, but there is a point of 
diminishing returns. When you start you end up fixing some real issues, and 
usually end up finding things that improve ongoing operation. But after a while 
you get to much harder things and the result matters far less. If it takes 10 
minutes to boot, you boot and go do something else, cutting this by 50% is a 
huge win. But if you take 10 seconds to boot, cutting it to 5 seconds won't make 
a noticable difference (especially if this doesn't count the 20 seconds the BIOS 
takes before it hands control off to our software :-)

So, some tools to measure downtime, and some ability to flag downtime in a small 
handful of categories would be good. I see the useful categories as

1. external outage (power, ISP, etc)
2. planned reboots for upgrades
3. admin action (could be combined with external outages)
4. everything else

and then we should concentrate on the last group

It may be useful to modify LUCI so that on the login page there is a 'report 
outage' link that lets someone either add a reason to an existing outage, or 
report a partial outage (like the dnsmasq or dhcp issues) along with a few 
categories of severity (dead, severe impact, minor impact) and then have a 'are 
you willing to have this information reported' toggle to have it phone home with 
this data.

thoughts?

David Lang

On Sat, 7 Feb 2015, Dave Taht wrote:

> On Sat, Feb 7, 2015 at 8:42 AM,  <Valdis.Kletnieks at vt.edu> wrote:
>> On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:
>>> so, how's everybody's uptime?
>>
>> Sitting at 27 days due to a power blip.
>
> I do strongly feel that home routers should have a battery or supercap
> with at least 30 seconds lifetime. In Nica, the power flickered 6
> times a day, with half-day long outages every couple weeks (rolling
> blackouts during one phase being much worse). It was a glorious PITA
> to have to wait for everything in the tin cans and string connecting
> everything to reassociate and return to connectivity,
>
> In SF, I've seen it flicker, oh, a couple times, in the last 6 months,
> 3 times long enough to force a reboot of everything.
>
> Batteries have got cheap, as has power conversion.
>
> I guess a bigger philosophical questions I'm having re "uptime" are
>
> A) "how long is long enough" before natural factors like power
> failures start to dominate the uptime statistics?
>
> And there are multiple modes of failure - "up but not working right"
> is much worse than "reboot due to some self diagnostic saying we're
> hosed somehow" or "reboot at 4AM because we installed an update and
> nobody was actively using the system".
>
> Personally - I'd like a systems critical computing device to stay up
> for it's entire lifetime - 10 years -  but with the current
> architectures of systems I don't see a way to get there, short of
> stepping back to the 90s style microkernels where the device drivers
> were abstracted out of the OS, or forward into the containerized
> future which has at least some of the same benefits as microkernels
> did.
>
> And in any case, having a suite of inherent tests for continued
> "correct" behavior both introduces "automated fixes" and potential
> induced bugs in the monitoring tools. Take the dns failure path for
> example - it is hard to tell if dns fails if your local instance is
> messed up.
>
> It is hard to test if dhcp is failing for any reason, aside from
> actually having an external device attempt to register periodically
> and test.
>
> One of the things that I have seen happen is wifi multicast would keep
> working on some versions of cero, but the unicast path would fail - so
> babel would continue to route because it only tested the multicast
> path.
>
> 100%cpu for long continued periods by daemons that shouldn't grab all
> that would be a good indicator of failure....
>
> On the observational side from this group - we know at this point that
> crashing or misbehavior every few hours or days is unacceptable, every
> few weeks is the best we can hope for from many other shipped
> "commercial" home oriented systems, and a few months is a joy, and 259
> days is our record. And there are so many potential causes of problems
> that the best we can do is usually to continually update and revise
> the software in light of new data - *and* make for
> absolutely-damned-sure the hardware itself is engineered to the finest
> quality standards possible (I'm a real bastard about requiring -40-70C
> in everything I ever buy that is going to get wedged behind a tav)
>
> So another way to put this is how many 9s of reliability can be aimed
> for, and how do we get there? Most of the literature on MTBF is
> oriented towards making better hardware (with moving parts) rather
> than computer hardware or software (links to better stuff, anyone?).
> I'll however argue that a planned reboot at 4AM is a zillion times
> less customer affecting than one in the middle of a business critical
> Skype call.
>
> How many bugs merely happen due to entropy? (we seem to be doing a
> great job of not running out of memory at this point, or writing to
> flash too much). Or cosmic events? Are we at the point where ECC ram
> would make a difference?
>
> Certainly relying on a wonderful set of core early adopters is
> something that works often better than a formal QA dept. (:) )
>
> If you consider "uptime" as the sole goal, and reboots as a way of
> solving major problems, well, speeding up boot times would be a way to
> making progress on getting down to, say, 6 minutes of downtime per
> year. (that's about 3 reboots/year at present boot times).
>
> If we keep drilling down harder on the known problems over the product
> lifetime, and have a way to do continual updates,
> eventually, an acceptable goal for uptime and reliability could be
> met. This unfortunately requires either a new/continueed sale rate to
> cover the ongoing costs of continual development (ponzi schemes *do
> work* for a while, usually), or an ongoing revenue stream dedicated to
> maintaining and improving the product. A metric ton of people rent
> their cable modems for 7 dollars a month, for example, but very little
> if any of that revenue flows back into the the org that maintains and
> updates the firmware itself. I could see offering automated updates
> and news via email, (people are now quite used to that from their
> phones) but I imagine very few would actually sign up for it and pay a
> buck a month into the box on the wall that they are already paying
> their ISP for.
>
> And I don't really know what is acceptable for failure rates that are
> "less than never". Paul Mckenney pointed out recently that he was
> working on finding and fixing bugs that only happened once in a really
> big number of times - but often enough that at least 3 people on the
> planet were experiencing it every day.
>
> If we had 10k or more users, collecting data as to problems more
> centrally and picking through it with a fine toothed comb would help.
> Certainly my polling this group for stats is not quite as useful as
> actually getting y'all to opt in for at least some stats collection
> (and figuring out what are useful stats to collect is a problem)
>
>> One issue I've had is that dnsmasq has gone out to lunch a few times,
>> resulting in devices losing their IPv4 address when they renew their
>> DHCP lease.  Restarting dnsmasq makes it work again. I'll have to do
>> more diagnosis the next time it happens...
>
> While testing for the dnssec bug over ipv6 I was able to crash dnsmasq
> quite frequently with the new test dnsmasq I'd distributed (did anyone
> else try it?), but I got buried by the rush of things to do before I
> left for NZ and haven't got back to it. And (at the moment, anyway)
> can't seem to get back into that network from here, so it seems
> crashed again for some reason or another.
>
> -- 
> Dave Täht
>
> http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel