[Cerowrt-devel] uptime?

Fri Feb 6 16:01:43 EST 2015

On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht <dave.taht at gmail.com> wrote:

> On Sat, Feb 7, 2015 at 8:42 AM,  <Valdis.Kletnieks at vt.edu> wrote:
> > On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:
> >> so, how's everybody's uptime?
> >
> > Sitting at 27 days due to a power blip.
>
> I do strongly feel that home routers should have a battery or supercap
> with at least 30 seconds lifetime. In Nica, the power flickered 6
> times a day, with half-day long outages every couple weeks (rolling
> blackouts during one phase being much worse). It was a glorious PITA
> to have to wait for everything in the tin cans and string connecting
> everything to reassociate and return to connectivity,
>
> In SF, I've seen it flicker, oh, a couple times, in the last 6 months,
> 3 times long enough to force a reboot of everything.
>
> Batteries have got cheap, as has power conversion.
>

Yet the market this hardware is coming from is actively involved in chasing
itself to the bottom.  Sure, there are higher-end/more expensive devices,
but they're not the standard that are purchased (and tech trickles down
from those layers in odd ways).

> I guess a bigger philosophical questions I'm having re "uptime" are
>
> A) "how long is long enough" before natural factors like power
> failures start to dominate the uptime statistics?
>

"long enough"?  I'd say a year of uptime, minimum, certainly to the point
that fw updates to deal with security patches/features/optimizations are
the only _real_ source of reboot, aside from the odd cosmic particle
bit-flip.  At my day-job, our platform (based on router SoCs and running
modified OpenWRT distributions) has gone for a year without issue, but
usually they get upgraded by the management backend long before that time
arrives.

Also, it feels like once you get out past a month or two, you're either
"good forever", or nearly so.

> And there are multiple modes of failure - "up but not working right"
> is much worse than "reboot due to some self diagnostic saying we're
> hosed somehow" or "reboot at 4AM because we installed an update and
> nobody was actively using the system".
>

I've found that app layer functional watchdogs are a great way to build a
safety net.  They don't solve the bugs, but they buy you visibility into
the existence of the bugs (so long as you log and report on the triggering
watchdogs), so that you know where to start looking.

-Aaron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/cerowrt-devel/attachments/20150206/f6d5b7e8/attachment-0002.html>