From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f181.google.com (mail-ie0-f181.google.com [209.85.223.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 89A6821F1FE for ; Fri, 6 Feb 2015 13:01:43 -0800 (PST) Received: by iecar1 with SMTP id ar1so4368593iec.13 for ; Fri, 06 Feb 2015 13:01:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Qvontt7m0cpNNogfcA8y6OQBj3HvIdauTwvOOcdMOEs=; b=DCycEHq//iwLb70wtRdq7vTXShrFO/npdu+BV4fajMEzlsn07sE3mB4PRCtnmynQJa izMoj388NFdYD2Ua+LQGJeD/Mcv5fMxFRVknP72p/z5rXsI5Jsd1hbPx3UHfIkAroiSt 6T0ulBjXdN+6jpoSa9I0hEHQMmbBUB+b1FvdhbU3+vTNMUy7C3ltJ6H7WknyiyJ9wCdm +x2/mV4jMYbr9aGR8ctoVBS/p/HPxLuS/E+7DXRx3XehkO6otah2yxgfAwGNYQaQzbOO SEwl7T5tyw5jDRQJBP82Zdz40kZF1kMRfB2Zd8MkiAer3tKzOmvGF5j45m66Wuz8jYdi G2UA== MIME-Version: 1.0 X-Received: by 10.50.97.41 with SMTP id dx9mr4118920igb.1.1423256503275; Fri, 06 Feb 2015 13:01:43 -0800 (PST) Received: by 10.64.142.42 with HTTP; Fri, 6 Feb 2015 13:01:43 -0800 (PST) In-Reply-To: References: <15132.1423251758@turing-police.cc.vt.edu> Date: Fri, 6 Feb 2015 13:01:43 -0800 Message-ID: From: Aaron Wood To: Dave Taht Content-Type: multipart/alternative; boundary=047d7b10cd534d5d43050e71bad3 Cc: "cerowrt-devel@lists.bufferbloat.net" , Jed Laundry Subject: Re: [Cerowrt-devel] uptime? X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Feb 2015 21:02:12 -0000 --047d7b10cd534d5d43050e71bad3 Content-Type: text/plain; charset=UTF-8 On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht wrote: > On Sat, Feb 7, 2015 at 8:42 AM, wrote: > > On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said: > >> so, how's everybody's uptime? > > > > Sitting at 27 days due to a power blip. > > I do strongly feel that home routers should have a battery or supercap > with at least 30 seconds lifetime. In Nica, the power flickered 6 > times a day, with half-day long outages every couple weeks (rolling > blackouts during one phase being much worse). It was a glorious PITA > to have to wait for everything in the tin cans and string connecting > everything to reassociate and return to connectivity, > > In SF, I've seen it flicker, oh, a couple times, in the last 6 months, > 3 times long enough to force a reboot of everything. > > Batteries have got cheap, as has power conversion. > Yet the market this hardware is coming from is actively involved in chasing itself to the bottom. Sure, there are higher-end/more expensive devices, but they're not the standard that are purchased (and tech trickles down from those layers in odd ways). > I guess a bigger philosophical questions I'm having re "uptime" are > > A) "how long is long enough" before natural factors like power > failures start to dominate the uptime statistics? > "long enough"? I'd say a year of uptime, minimum, certainly to the point that fw updates to deal with security patches/features/optimizations are the only _real_ source of reboot, aside from the odd cosmic particle bit-flip. At my day-job, our platform (based on router SoCs and running modified OpenWRT distributions) has gone for a year without issue, but usually they get upgraded by the management backend long before that time arrives. Also, it feels like once you get out past a month or two, you're either "good forever", or nearly so. > And there are multiple modes of failure - "up but not working right" > is much worse than "reboot due to some self diagnostic saying we're > hosed somehow" or "reboot at 4AM because we installed an update and > nobody was actively using the system". > I've found that app layer functional watchdogs are a great way to build a safety net. They don't solve the bugs, but they buy you visibility into the existence of the bugs (so long as you log and report on the triggering watchdogs), so that you know where to start looking. -Aaron --047d7b10cd534d5d43050e71bad3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht <dave.taht@g= mail.com> wrote:
On Sat, Feb 7,= 2015 at 8:42 AM,=C2=A0 <Vald= is.Kletnieks@vt.edu> wrote:
> On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:
>> so, how's everybody's uptime?
>
> Sitting at 27 days due to a power blip.

I do strongly feel that home routers should have a battery or superc= ap
with at least 30 seconds lifetime. In Nica, the power flickered 6
times a day, with half-day long outages every couple weeks (rolling
blackouts during one phase being much worse). It was a glorious PITA
to have to wait for everything in the tin cans and string connecting
everything to reassociate and return to connectivity,

In SF, I've seen it flicker, oh, a couple times, in the last 6 months,<= br> 3 times long enough to force a reboot of everything.

Batteries have got cheap, as has power conversion.
Yet the market this hardware is coming from is actively involve= d in chasing itself to the bottom.=C2=A0 Sure, there are higher-end/more ex= pensive devices, but they're not the standard that are purchased (and t= ech trickles down from those layers in odd ways).

= =C2=A0
I guess a bigger philosophical q= uestions I'm having re "uptime" are

A) "how long is long enough" before natural factors like power failures start to dominate the uptime statistics?

=
"long enough"?=C2=A0 I'd say a year of uptime, min= imum, certainly to the point that fw updates to deal with security patches/= features/optimizations are the only _real_ source of reboot, aside from the= odd cosmic particle bit-flip.=C2=A0 At my day-job, our platform (based on = router SoCs and running modified OpenWRT distributions) has gone for a year= without issue, but usually they get upgraded by the management backend lon= g before that time arrives.

Also, it feels like on= ce you get out past a month or two, you're either "good forever&qu= ot;, or nearly so.

=C2=A0
And there are multiple modes of failure - "up but not wo= rking right"
is much worse than "reboot due to some self diagnostic saying we'r= e
hosed somehow" or "reboot at 4AM because we installed an update a= nd
nobody was actively using the system".

=
I've found that app layer functional watchdogs are a great way to = build a safety net.=C2=A0 They don't solve the bugs, but they buy you v= isibility into the existence of the bugs (so long as you log and report on = the triggering watchdogs), so that you know where to start looking.

-Aaron
--047d7b10cd534d5d43050e71bad3--