From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <woody77@gmail.com>
Received: from mail-ie0-f181.google.com (mail-ie0-f181.google.com
	[209.85.223.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 89A6821F1FE
	for <cerowrt-devel@lists.bufferbloat.net>;
	Fri,  6 Feb 2015 13:01:43 -0800 (PST)
Received: by iecar1 with SMTP id ar1so4368593iec.13
	for <cerowrt-devel@lists.bufferbloat.net>;
	Fri, 06 Feb 2015 13:01:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=Qvontt7m0cpNNogfcA8y6OQBj3HvIdauTwvOOcdMOEs=;
	b=DCycEHq//iwLb70wtRdq7vTXShrFO/npdu+BV4fajMEzlsn07sE3mB4PRCtnmynQJa
	izMoj388NFdYD2Ua+LQGJeD/Mcv5fMxFRVknP72p/z5rXsI5Jsd1hbPx3UHfIkAroiSt
	6T0ulBjXdN+6jpoSa9I0hEHQMmbBUB+b1FvdhbU3+vTNMUy7C3ltJ6H7WknyiyJ9wCdm
	+x2/mV4jMYbr9aGR8ctoVBS/p/HPxLuS/E+7DXRx3XehkO6otah2yxgfAwGNYQaQzbOO
	SEwl7T5tyw5jDRQJBP82Zdz40kZF1kMRfB2Zd8MkiAer3tKzOmvGF5j45m66Wuz8jYdi
	G2UA==
MIME-Version: 1.0
X-Received: by 10.50.97.41 with SMTP id dx9mr4118920igb.1.1423256503275; Fri,
	06 Feb 2015 13:01:43 -0800 (PST)
Received: by 10.64.142.42 with HTTP; Fri, 6 Feb 2015 13:01:43 -0800 (PST)
In-Reply-To: <CAA93jw5qh8wQi+jMbfVbBd1-bfgHiBjwswoBaTvtz9s2qMwVig@mail.gmail.com>
References: <CAA93jw46UL9=Qoo=FgKROxWwwjvcpc=wjdXS8fSA7GHGj1sFng@mail.gmail.com>
	<15132.1423251758@turing-police.cc.vt.edu>
	<CAA93jw5qh8wQi+jMbfVbBd1-bfgHiBjwswoBaTvtz9s2qMwVig@mail.gmail.com>
Date: Fri, 6 Feb 2015 13:01:43 -0800
Message-ID: <CALQXh-MSHpD1iDv_Ae95AA-k6O9k+G7S=izknS-oTgJvfo8b+A@mail.gmail.com>
From: Aaron Wood <woody77@gmail.com>
To: Dave Taht <dave.taht@gmail.com>
Content-Type: multipart/alternative; boundary=047d7b10cd534d5d43050e71bad3
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, Jed Laundry <jlaundry@jlaundry.com>
Subject: Re: [Cerowrt-devel] uptime?
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Fri, 06 Feb 2015 21:02:12 -0000

--047d7b10cd534d5d43050e71bad3
Content-Type: text/plain; charset=UTF-8

On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht <dave.taht@gmail.com> wrote:

> On Sat, Feb 7, 2015 at 8:42 AM,  <Valdis.Kletnieks@vt.edu> wrote:
> > On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:
> >> so, how's everybody's uptime?
> >
> > Sitting at 27 days due to a power blip.
>
> I do strongly feel that home routers should have a battery or supercap
> with at least 30 seconds lifetime. In Nica, the power flickered 6
> times a day, with half-day long outages every couple weeks (rolling
> blackouts during one phase being much worse). It was a glorious PITA
> to have to wait for everything in the tin cans and string connecting
> everything to reassociate and return to connectivity,
>
> In SF, I've seen it flicker, oh, a couple times, in the last 6 months,
> 3 times long enough to force a reboot of everything.
>
> Batteries have got cheap, as has power conversion.
>

Yet the market this hardware is coming from is actively involved in chasing
itself to the bottom.  Sure, there are higher-end/more expensive devices,
but they're not the standard that are purchased (and tech trickles down
from those layers in odd ways).


> I guess a bigger philosophical questions I'm having re "uptime" are
>
> A) "how long is long enough" before natural factors like power
> failures start to dominate the uptime statistics?
>

"long enough"?  I'd say a year of uptime, minimum, certainly to the point
that fw updates to deal with security patches/features/optimizations are
the only _real_ source of reboot, aside from the odd cosmic particle
bit-flip.  At my day-job, our platform (based on router SoCs and running
modified OpenWRT distributions) has gone for a year without issue, but
usually they get upgraded by the management backend long before that time
arrives.

Also, it feels like once you get out past a month or two, you're either
"good forever", or nearly so.


> And there are multiple modes of failure - "up but not working right"
> is much worse than "reboot due to some self diagnostic saying we're
> hosed somehow" or "reboot at 4AM because we installed an update and
> nobody was actively using the system".
>

I've found that app layer functional watchdogs are a great way to build a
safety net.  They don't solve the bugs, but they buy you visibility into
the existence of the bugs (so long as you log and report on the triggering
watchdogs), so that you know where to start looking.

-Aaron

--047d7b10cd534d5d43050e71bad3
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Fri, Feb 6, 2015 at 12:47 PM, Dave Taht <span dir=3D"lt=
r">&lt;<a href=3D"mailto:dave.taht@gmail.com" target=3D"_blank">dave.taht@g=
mail.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"=
gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><span class=3D"">On Sat, Feb 7,=
 2015 at 8:42 AM,=C2=A0 &lt;<a href=3D"mailto:Valdis.Kletnieks@vt.edu">Vald=
is.Kletnieks@vt.edu</a>&gt; wrote:<br>
&gt; On Fri, 06 Feb 2015 15:27:32 +1300, Dave Taht said:<br>
&gt;&gt; so, how&#39;s everybody&#39;s uptime?<br>
&gt;<br>
&gt; Sitting at 27 days due to a power blip.<br>
<br>
</span>I do strongly feel that home routers should have a battery or superc=
ap<br>
with at least 30 seconds lifetime. In Nica, the power flickered 6<br>
times a day, with half-day long outages every couple weeks (rolling<br>
blackouts during one phase being much worse). It was a glorious PITA<br>
to have to wait for everything in the tin cans and string connecting<br>
everything to reassociate and return to connectivity,<br>
<br>
In SF, I&#39;ve seen it flicker, oh, a couple times, in the last 6 months,<=
br>
3 times long enough to force a reboot of everything.<br>
<br>
Batteries have got cheap, as has power conversion.<br></blockquote><div><br=
></div><div>Yet the market this hardware is coming from is actively involve=
d in chasing itself to the bottom.=C2=A0 Sure, there are higher-end/more ex=
pensive devices, but they&#39;re not the standard that are purchased (and t=
ech trickles down from those layers in odd ways).</div><div><br></div><div>=
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex">I guess a bigger philosophical q=
uestions I&#39;m having re &quot;uptime&quot; are<br>
<br>
A) &quot;how long is long enough&quot; before natural factors like power<br=
>
failures start to dominate the uptime statistics?<br></blockquote><div><br>=
</div><div>&quot;long enough&quot;?=C2=A0 I&#39;d say a year of uptime, min=
imum, certainly to the point that fw updates to deal with security patches/=
features/optimizations are the only _real_ source of reboot, aside from the=
 odd cosmic particle bit-flip.=C2=A0 At my day-job, our platform (based on =
router SoCs and running modified OpenWRT distributions) has gone for a year=
 without issue, but usually they get upgraded by the management backend lon=
g before that time arrives.</div><div><br></div><div>Also, it feels like on=
ce you get out past a month or two, you&#39;re either &quot;good forever&qu=
ot;, or nearly so.</div><div><br></div><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">And there are multiple modes of failure - &quot;up but not wo=
rking right&quot;<br>
is much worse than &quot;reboot due to some self diagnostic saying we&#39;r=
e<br>
hosed somehow&quot; or &quot;reboot at 4AM because we installed an update a=
nd<br>
nobody was actively using the system&quot;.<br></blockquote><div><br></div>=
<div>I&#39;ve found that app layer functional watchdogs are a great way to =
build a safety net.=C2=A0 They don&#39;t solve the bugs, but they buy you v=
isibility into the existence of the bugs (so long as you log and report on =
the triggering watchdogs), so that you know where to start looking.</div><d=
iv><br></div><div>-Aaron</div></div></div></div>

--047d7b10cd534d5d43050e71bad3--