<div dir="ltr">On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht <span dir="ltr"><<a href="mailto:dave.taht@gmail.com" target="_blank">dave.taht@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Sat, Feb 7, 2015 at 8:42 AM,  <<a href="mailto:Valdis.Kletnieks@vt.edu">Valdis.Kletnieks@vt.edu</a>> wrote:<br>

> On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:<br>

>> so, how's everybody's uptime?<br>

><br>

> Sitting at 27 days due to a power blip.<br>

<br>

</span>I do strongly feel that home routers should have a battery or supercap<br>

with at least 30 seconds lifetime. In Nica, the power flickered 6<br>

times a day, with half-day long outages every couple weeks (rolling<br>

blackouts during one phase being much worse). It was a glorious PITA<br>

to have to wait for everything in the tin cans and string connecting<br>

everything to reassociate and return to connectivity,<br>

<br>

In SF, I've seen it flicker, oh, a couple times, in the last 6 months,<br>

3 times long enough to force a reboot of everything.<br>

<br>

Batteries have got cheap, as has power conversion.<br></blockquote><div><br></div><div>Yet the market this hardware is coming from is actively involved in chasing itself to the bottom.  Sure, there are higher-end/more expensive devices, but they're not the standard that are purchased (and tech trickles down from those layers in odd ways).</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I guess a bigger philosophical questions I'm having re "uptime" are<br>

<br>

A) "how long is long enough" before natural factors like power<br>

failures start to dominate the uptime statistics?<br></blockquote><div><br></div><div>"long enough"?  I'd say a year of uptime, minimum, certainly to the point that fw updates to deal with security patches/features/optimizations are the only _real_ source of reboot, aside from the odd cosmic particle bit-flip.  At my day-job, our platform (based on router SoCs and running modified OpenWRT distributions) has gone for a year without issue, but usually they get upgraded by the management backend long before that time arrives.</div><div><br></div><div>Also, it feels like once you get out past a month or two, you're either "good forever", or nearly so.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">And there are multiple modes of failure - "up but not working right"<br>

is much worse than "reboot due to some self diagnostic saying we're<br>

hosed somehow" or "reboot at 4AM because we installed an update and<br>

nobody was actively using the system".<br></blockquote><div><br></div><div>I've found that app layer functional watchdogs are a great way to build a safety net.  They don't solve the bugs, but they buy you visibility into the existence of the bugs (so long as you log and report on the triggering watchdogs), so that you know where to start looking.</div><div><br></div><div>-Aaron</div></div></div></div>