From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp191.iad.emailsrvr.com (smtp191.iad.emailsrvr.com [207.97.245.191]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id 0CBBF20061E for ; Sun, 8 Apr 2012 18:57:23 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp39.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id BC36D98A32; Sun, 8 Apr 2012 21:57:21 -0400 (EDT) X-Virus-Scanned: OK Received: from legacy15.wa-web.iad1a (legacy15.wa-web.iad1a.rsapps.net [192.168.4.105]) by smtp39.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id 9CBF698A22; Sun, 8 Apr 2012 21:57:21 -0400 (EDT) Received: from reed.com (localhost.localdomain [127.0.0.1]) by legacy15.wa-web.iad1a (Postfix) with ESMTP id 8F614408001; Sun, 8 Apr 2012 21:57:21 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Sun, 8 Apr 2012 21:57:21 -0400 (EDT) Date: Sun, 8 Apr 2012 21:57:21 -0400 (EDT) From: dpreed@reed.com To: "Dave Taht" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20120408215721000000_58253" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: <1333679627.997611294@apps.rackspace.com> <1333685372.501325169@apps.rackspace.com> Message-ID: <1333936641.5869838@apps.rackspace.com> X-Mailer: webmail7.0 X-Mailman-Approved-At: Sun, 08 Apr 2012 19:20:29 -0700 Cc: cerowrt-devel@lists.bufferbloat.net Subject: Re: [Cerowrt-devel] Cero-state this week and last X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Apr 2012 01:57:23 -0000 ------=_20120408215721000000_58253 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AThanks for the incredibly thoughtful response. I get the "packager" is= sue. It really compounds the problem if the upstream folks don't bother t= o focus on quality at the level one needs, coupled with the goals of the up= stream folks being different than the packager.=0A =0ARegarding Bob Taylor'= s resources... I worked closely with Bob and his various team members in a = number of dimensions, including consulting for him. You are right that Cer= oWRT does not have that kind of resource. (I have been trying hard in othe= r venues relating to radio issues that are really important to me to find a= way to assemble a coordinated set of resources at such a scale, and I've f= ailed so far. Still trying.)=0A =0AHowever, I think that the technical issu= e one could work on in this respect is a way to create high-level system te= sts of routing functionality and performance that would be independent of h= ardware configuration and also capable of creating a network environment th= at would avoid regression. Jay Lepreau created a very nice platform frame= work at Utah that the network "innovation" community might be able to "copy= " (where virtualized "networks" could be configured and tested). I'd be ha= ppy personally, for example, to provide some resources on my various home n= etworks (the one in my home, and the ones in my "cloud" instances) to run "= system tests" on new releases of CeroWRT and other systems - *if* it was ru= n in a way that did not disrupt my other work, using a bounded percentage o= f capacity and devices.=0A =0AI participated in the PlanetLab project that = HP and Intel supported, coordinated by Princeton and others.=0A =0AThis was= a model for a kind of "co-op" that incorporated networked resources.=0A = =0AWe need to create a generic networking innovation framework that is *ind= ependent* of ISOC, IETF, Verizon, Cisco, ATT, Alcatel-Lucent, etc. Those = guys may *help* but they should not be able to block experimentation or inn= ovation (which was the point of PlanetLab).=0A =0A-----Original Message----= -=0AFrom: "Dave Taht" =0ASent: Sunday, April 8, 2012 1= :31pm=0ATo: dpreed@reed.com=0ACc: cerowrt-devel@lists.bufferbloat.net=0ASub= ject: Re: [Cerowrt-devel] Cero-state this week and last=0A=0A=0A=0AOn Thu, = Apr 5, 2012 at 9:09 PM, wrote:=0A> I understand this. I= n the end of the day, however, *regression tests*=0A> matter, as well as te= sts to verify that new functionality actually works.=0A>=0A=0AThe problems = with developing global test suites, particularly when dealing with=0Aembedd= ed hardware, are manyfold. I certainly would like to have a full=0Atest sui= te=0Athat I could run on any router rather than the ad-hoc collection of=0A= tests I run now.=0A=0AOne of the problems we have is that we are testing fo= r new problems,=0Aand by definition,=0Ayou don't know what those are, and a= fter you fix it, you need to=0Adevelop a viable test.=0A=0AI can think hund= reds of things fixed in the past year that I'd like to=0Atest for, not just= on this=0Ahardware, or this software, but over the internet, E2E.=0A=0AExa= mple: in june of last year, there was a 10 year old bug in how ECN=0Aenable= d packets=0Awere prioritized by the *default* pfifo_fast qdisc. I'll argue = that=0Athis has skewed every study=0Apublished about it in the last decade = as well, and all that data and=0Aacademic papers need=0Ato be reanalyzed - = or preferably, thrown out, and we need to start over.=0A=0AThe harder probl= em than writing the tests are:=0A=0A1) Which problems are important? What t= ests are valid? What is repeatable?=0A2) Who will write the test?=0A3) How = can the test be deployed?=0A4) How often does it need to run?=0A5) How can = the data be analyzed?=0A6) How does that work get paid for?=0A=0AI am happy= that I have ONE ECN related study to rely on from steve=0Abauer at MIT. I'= d love to have more, and one of my other projects=0A(thumbgps) is going to = give us a baseline for investigating a bunch of=0Asimilar issues. I'm happy= to have a solution emerging to problems 3=0Aand 4 above. 1,2,5 and 6 remai= n unsolved.=0A=0A> I've managed projects with 200 daily "committers". Unle= ss those committers=0A> get immediate feedback on what they break (accident= ally) and design the=0A> tests for their new functionality so that others d= on't break what they=0A> carefully craft, projects go south and never recov= er.=0A=0AIf I need to establish cred here, I've been a part of projects wit= h=0Afar more committers and managed up to about 70 staff myself.=0A=0AWhen = working with the open source community, which is mostly=0Avolunteers, there= is no way to be dictatorial. Consensus needs to=0Asought. Needed stuff tha= t nobody wants to do has to get paid for.=0A=0AThe linux kernel - which is = at best, 1/10th of the overall code base=0Ain openwrt - goes through about = 10,000 changesets every quarter. The=0A3.2 development cycle had nearly 130= 0 committers.=0A=0Ahttp://lwn.net/Articles/472852/=0A=0AThey all have their= own processes for quality control, and somehow=0Amanage to produce a usabl= e system on a reliable basis.=0A=0AAnd that's just the kernel portion of th= e problem!=0A=0AA better word for the kind of work that goes on in developi= ng an OS=0Alike a redhat or openwrt is 'packaging'. You don't actually have= =0Aconventional 'developers', packagers are rather different catagory of=0A= developer, skilled in make, cross-compilation techniques, and many=0Ahave a= good familiarity with architecture-level issues (like=0Aendianness and bit= size and innards of a given architecture like arm=0Aor mips). They are cap= able of basic coding in dozens of languages, all=0Aon the same day, but are= only highly skilled in one or two, at best.=0A=0Ain this way they tend to = be something of a hybrid between sysadmin and=0Adeveloper. And they do care= , a lot, about the quality of the=0Aengineering, and try to push patches ba= ck to the developer, and make=0Asysadmins (and users) lives easier.=0A=0APa= ckagers do have standards for quality control, and do test on their=0Aown p= latforms, but: as the potential test matrix has=0Aseveral thousand permutat= ions, having a tool like a buildbot that=0Atries to give at least some wid= er coverage to the most common=0Acombinations, and leverages technology to = give them adaquate feedback=0Ato iteratively get it right.=0A=0ALike anythi= ng else, it can always be done better. In an ideal world, a=0Apackager coul= d do a test commit, get it built against all those=0Apermutations, have run= for 24 hours, exaustively checking all the=0Ain-built functionality, have = it's memory and cpu use analyzed, and get=0Athe result back a few seconds l= ater. Aside from needing Dr Who to help=0Aout on parts of that, it's hard, = expensive, and something of a tragedy=0Aof the commons - somebody has to pa= y for all that infrastructure,=0Aelectricity, and testing for everyone to b= enefit.=0A=0AThe gradually improving recent buildbot-disaster is a case-in-= point.=0AIt was working. A bunch of machines died over time. Nobody was pay= ing=0Aattention. It became a disaster. We fixed it with bailing wire, scotc= h=0Atape and stolen resources.=0A=0AI'd certainly love it if we had budget = to where doing the logical=0Athing - build everything, all the time - was p= ractical. Same goes for=0Adeveloping regression tests, having racks of hard= ware that can be=0Areburnt and re-tested every day,=0A=0AThis sort of patte= rn repeats in nearly every low budget project=0A(volunteer or corporate spo= nsored), but unfortunately Elon doesn't=0Ahang out with us, and isn't going= to fly in with the liquid oxygen.=0A=0AAs for regression testing, regressi= ons against what? (the answer is=0Atoo large to fit in the margins of this = email)=0A=0ACertainly multiple companies make wireless test suites (one has= been=0Aactively helping out, actually), there are dozens of benchmark suit= es,=0Athere are zillions of subcomponent tests...=0A=0Aand in this market, = razor thin margins on the vendors side, as well as=0Athe ISPs. Now, I like = to think that our governments and society are=0Awaking up to the chaos that= can ensue if the internet goes down, corps=0Aare realizing that ipv4 can't= last forever and ipv6 has to be made=0Adeployable e2e, and maybe there's a= shift in thinking that making the=0AInternet just work is a civil engineer= ing job that *has* to be done=0Aright ( http://esr.ibiblio.org/?p=3D4213 ht= tp://esr.ibiblio.org/?p=3D4196=0A)...=0A=0Abut at the end of the day we jus= t have to do the best engineering we=0Acan with the resources available.=0A= =0A> You don't have that rate of committers here, but it's not really an ex= cuse=0A> to say -=0A=0AWell, in some ways we do. Adding in a new kernel req= uires depending on=0Aa multitude of other people on having got it right.=0A= Same goes for the other thousand packages.=0A=0AIt has taken a year and a t= on of effort (from multiple volunteers) to=0Aget from where the cerowrt ker= nel lagged the mainline kernel by 3=0Aversions, down to where it is only go= ing to lag by 1. That effort was=0Anecessarily if we wanted to be able to d= o work on both x86 and a=0Arouter simultaneously while investigating buffer= bloat, security, and=0Aipv6, and be able to move forward (And back and fort= h) with a minimum=0Aof backporting. That portion of the effort has eaten mo= re of my time=0Athis year than I care to think about.=0A=0AAt the time we s= tarted hacking on cerowrt, most commercial embedded=0Aproducts were based o= n 5 year old kernels, or older, due to how=0Adifficult it is to track the m= ainline, and a perceived lack of demand=0Afrom consumers for new stuff, des= pite the ISPs increasing frustration=0Awith what's being shipped today not = meeting their needs or=0Aexpectations.=0A=0AWe are trying to change that - = in part by listening to the screams of=0AISPs like comcast - but in also tr= ying out new technologies such as=0Afixes for bufferbloat, ipv6, radical co= ncepts like ccnx and openhip -=0Ato be geek and early adoptor attractors - = to get more of the needed=0Awork done.=0A=0AStill, an effort well beyond th= e original scope of the "wide" project=0Aseems needed to get ipv6 rolled ou= t. The theoretical breakthroughs=0Arequired to fix bufferbloat seem almost = trivial in comparison.=0A=0A> "we have to jam in code without testing it be= cause we don't have a=0A> discipline of testing and it's a waste of time".= =0A=0AIt's a matter of having enough distributed testing.=0A=0A> 50% of wha= t a developer should be doing (if not more) is making sure that=0A> they do= n't break more than they improve.=0A=0Aso try 'packaging' rather than devel= oping, and wrap your head around=0Athe test matrix problem.=0A=0A>=0A>=0A>= =0A> I realize this is tough, not fun, and sometimes very frustrating. But= cool=0A> "new stuff" is far less important than keeping stuff stable.=0A= =0AThis is a classic tension. I note that we're trying to fix the=0Ainterne= t here, before *it's* stability goes unstable.=0A=0ASo a great deal of chan= ge and r&d is needed, and yes, it needs to be=0Amanaged well, but stability= only qualifies as a goal in limited ways.=0A=0A> I'm not trying to be nega= tive - this is stuff I learned at huge personal=0A> cost in very high stres= s environments where people were literally screaming=0A> at me every hour o= f every day.=0A=0AI have been in those too. I would say that the amount of = stress I've=0Aput myself under, trying to ship something by the end of this= month -=0Acompares closely. Personally I would like to like to offload abo= ut 95%=0Aof what I currently do, so I could focus on what's truly important= .=0AI'm glad we have more and more volunteers, self identifying problems,= =0Aleaping forward and going out on their own, to go fix them.=0A=0AStill t= he seat I wish I was sitting in now, with resources I wish I=0Acould comman= d, is Bob Taylor, circa 1968 or so.=0A=0Ahttp://en.wikipedia.org/wiki/Rober= t_Taylor_%28computer_scientist%29=0A=0AHe's always been a real inspiration = to me.=0A=0A> The cerowrt/bufferbloat stuff is worth doing, and it's worth = doing right -=0A> I'm a fan.=0A=0ATHX!=0A=0A>=0A>=0A> -----Original Message= -----=0A> From: "Dave Taht" =0A> Sent: Thursday, April= 5, 2012 10:50pm=0A> To: dpreed@reed.com=0A> Cc: cerowrt-devel@lists.buffer= bloat.net=0A> Subject: Re: [Cerowrt-devel] Cero-state this week and last=0A= >=0A> On Thu, Apr 5, 2012 at 7:33 PM, wrote:=0A>> A small= suggestion.=0A>>=0A>>=0A>>=0A>> Create a regression test suite, and requir= e contributors to *pass* the=0A>> test=0A>> with each submitted patch set.= =0A>=0A> A linear complete build of openwrt takes 17 hours on good hardware= .=0A> It's hard to build in parallel.=0A>=0A> A parallel full build is abou= t 3 hours but requires a bit of monitoring=0A>=0A> Incremental package buil= ds are measured in minutes, however...=0A>=0A>> Be damned politically incor= rect about checkins that don't meet this=0A>> criterion - eliminate=0A>> th= e right to check in code for anyone who contributes something that=0A>> bre= aks=0A>> functionality.=0A>=0A> The number of core committers is quite low,= too low, at present.=0A> However the key problem here is that=0A> the matr= ix of potential breakage is far larger than any one contribute=0A> can deal= with.=0A>=0A> There are:=0A>=0A> 20 + fairly different cpu architectures *= =0A> 150+ platforms *=0A> 3 different libcs *=0A> 3 different (generation) = toolchains *=0A> 5-6 different kernels=0A>=0A> That matrix alone is hardly = concievable to deal with. In there are=0A> arches that are genuinely weird = (avr anyone), arches that have=0A> arbitrary endian, arches that are 32 bit= and 64 bit...=0A>=0A> Add in well over a thousand software packages (every= thing from Apache=0A> to zile), and you have an idea of how much code has d= ependencies on=0A> other code...=0A>=0A> For example, the breakage yesterda= y (or was it the day before) was in=0A> a minor update to libtool, as best = as I recall. It broke 3 packages=0A> that cerowrt has available as options.= =0A>=0A> I'm looking forward, very much, to seeing the buildbot produce a= =0A> known, good build, that I can layer my mere 67 patches and two dozen= =0A> packages on top of without having to think too much.=0A>=0A>> Every pr= oject leader discovers this.=0A>=0A> Cerowrt is an incredibly tiny superset= of the openwrt project. I help=0A> out where I can.=0A>=0A>> Programmers a= re *lazy* and refuse to=0A>> check their inputs unless you shame them into = compliance.=0A>=0A> Volunteer programmers are not lazy.=0A>=0A> They do, ho= wever, have limited resources, and prefer to make progress=0A> rather than = make things perfect. Difficult to pass check-in tests=0A> impeed progress.= =0A>=0A> The fact that you or I can build an entire OS, in a matter of hour= s,=0A> today, and have it work, most often buffuddles me. This is 10s of=0A= > millions of lines of code, all perfect, most of the time.=0A>=0A> It used= to take 500+ people to engineer an os in 1992, and 4 days to=0A> build. I = consider this progress.=0A>=0A> There are all sorts of processes in place, = some can certainly be=0A> improved. For example, discussed last week was me= thods for dealing=0A> with and approving the backlog of submitted patches b= y other=0A> volunteers.=0A>=0A> It mostly just needs more eyeballs. And tes= ting. There's a lot of good=0A> stuff piled up.=0A>=0A> http://patchwork.op= enwrt.org/project/openwrt/list/=0A>>=0A>>=0A>>=0A>> -----Original Message--= ---=0A>> From: "Dave Taht" =0A>> Sent: Thursday, April= 5, 2012 10:27pm=0A>> To: cerowrt-devel@lists.bufferbloat.net=0A>> Subject:= [Cerowrt-devel] Cero-state this week and last=0A>>=0A>> I attended the iet= f conference in Paris (virtually), particularly ccrg=0A>> and homenet.=0A>>= =0A>> I do encourage folk to pay attention to homenet if possible, as layin= g=0A>> out what home networks will look like in the next 10 years is provin= g=0A>> to be a hairball.=0A>> ccrg was productive.=0A>>=0A>> Some news:=0A>= >=0A>> I have been spending time fixing some infrastructural problems.=0A>>= =0A>> 1) After be-ing blindsided by more continuous integration problems in= =0A>> the last month than in the last 5, I found out that one of the root= =0A>> causes was that the openwrt build cluster had declined in size from 8= =0A>> boxes to 1(!!), and time between successful automated builds was in= =0A>> some cases over a month.=0A>>=0A>> The risk of going 1 to 0 build sla= ves seemed untenable. So I sprang=0A>> into action, scammed two boxes and t= ravis has tossed them into the=0A>> cluster. Someone else volunteered a box= .=0A>>=0A>> I am a huge proponent of continuous integration on complex proj= ects.=0A>> http://en.wikipedia.org/wiki/Continuous_integration=0A>>=0A>> Bu= ilding all the components of an OS like openwrt correctly, all the=0A>> tim= e, with the dozens of developers involved, with a minimum delta=0A>> betwee= n commit, breakage, and fix, is really key to simplifying the=0A>> relative= ly simple task we face in bufferbloat.net of merely layering=0A>> on compon= ents and fixes improving the state of the art in networking.=0A>>=0A>> The = tgrid is still looking quite bad at the moment.=0A>>=0A>> http://buildbot.o= penwrt.org:8010/tgrid=0A>>=0A>> There's still a huge backlog of breakage.= =0A>>=0A>> But I hope it gets better. Certainly building a full cluster of = build=0A>> boxes or vms (openwrt@HOME!!) would help a lot more.=0A>>=0A>> I= f anyone would like to help hardware wise, or learn more about how to=0A>> = manage a build cluster using buildbot, please contact travis=0A>> =0A>>=0A>> 2) Bloatlab #1 has been completely rewired and = rebuilt and most of=0A>> the routers in there reflashed to Cerowrt-3.3.1-2 = or later. They=0A>> survived some serious network abuse over the last coupl= e days=0A>> (ironically the only router that crashed was the last rc6 box I= had in=0A>> the mix - and not due to a network fault! I ran it out of flas= h with a=0A>> logging tool).=0A>>=0A>> To deal with the complexity in there= (there's also a sub-lab for some=0A>> sdnat and PCP testing), I ended up w= ith a new ipv6 /48 and some better=0A>> ways to route that I'll write up so= on.=0A>>=0A>> 3) I did finally got back to fully working builds for the ar7= 1xx=0A>> (cerowrt) architecture a few days ago. I also have a working 3.3.1= =0A>> kernel for the x86_64 build I use to test the server side.=0A>> (buff= erbloat is NOT just a router problem. Fixing all sides of a=0A>> connection= helps a lot). That + a new iproute2 + the debloat script=0A>> and YOU TOO = can experience orders of magnitude less latency....=0A>>=0A>> http://europa= .lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for=0A>> x86_64=0A>>=0A= >> Most of the past week has been backwards rather than forwards, but it=0A= >> was negative in a good way, mostly.=0A>>=0A>> I'm sorry it's been three = weeks without a viable build for others to test.=0A>>=0A>> 4) today's build= : http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/=0A>>=0A>> + Linux 3.3.1= (this is missing the sfq patch I liked, but it's good=0A>> enough)=0A>> + = Working wifi is back=0A>> + No more fiddling with ethtool tx rings (up to 6= 4 from 2. BQL does=0A>> this job better)=0A>> + TCP CUBIC is now the defaul= t (no longer westwood)=0A>> after 15+ years of misplaced faith in delay bas= ed tcp for wireless,=0A>> I've collected enough data to convince me the cub= ic wins. all the=0A>> time.=0A>> + alttcp enabled (making it easy to switch= )=0A>> + latest netperf from svn (yea! remotely changable diffserv settings= =0A>> for a test tool!)=0A>>=0A>> - still horrible dependencies on time. Yo= u pretty much have to get on=0A>> it and do a rndc validation disable multi= ple times, restart ntp=0A>> multiple times, killall named multiple times to= get anywhere if you=0A>> want to get dns inside of 10 minutes.=0A>>=0A>> A= t this point sometimes I just turn off named in /etc/xinetd.d/named=0A>> an= d turn on port 53 for dnsmasq... but=0A>> usually after flashing it the fir= st time, wait 10 minutes (let it=0A>> clean flash), reboot, wait another 10= , then it works. Drives me=0A>> crazy... Once it's up and has valid time an= d is working, dnssec works=0A>> great but....=0A>>=0A>> + way cool new stuf= f in dnsmasq for ra and AAAA records=0A>> - huge dependency on keeping bind= in there=0A>> - aqm-scripts. I have not succeed in making hfsc work right.= Period.=0A>> + HTB (vs hfsc) is proving far more tractable. SFQRED is scal= ing=0A>> better than I'd dreamed. Maybe eric dreamed this big, I didn't.=0A= >> - http://www.bufferbloat.net/issues/352=0A>> + Added some essential rand= omness back into the entropy pool=0A>> - hostapd really acts up at high rat= es with the hack in there for more=0A>> entroy (From the openwrt mainline)= =0A>> + named caching the roots idea discarded in favor of classic '.'=0A>>= =0A>>=0A>> --=0A>> Dave T=C3=A4ht=0A>> SKYPE: davetaht=0A>> US Tel: 1-239-8= 29-5608=0A>> http://www.bufferbloat.net=0A>> ______________________________= _________________=0A>> Cerowrt-devel mailing list=0A>> Cerowrt-devel@lists.= bufferbloat.net=0A>> https://lists.bufferbloat.net/listinfo/cerowrt-devel= =0A>=0A>=0A>=0A> --=0A> Dave T=C3=A4ht=0A> SKYPE: davetaht=0A> US Tel: 1-23= 9-829-5608=0A> http://www.bufferbloat.net=0A=0A=0A=0A-- =0ADave T=C3=A4ht= =0ASKYPE: davetaht=0AUS Tel: 1-239-829-5608=0Ahttp://www.bufferbloat.net ------=_20120408215721000000_58253 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Thanks for= the incredibly thoughtful response.   I get the "packager" issue= .   It really compounds the problem if the upstream folks don't b= other to focus on quality at the level one needs, coupled with the goals of= the upstream folks being different than the packager.

=0A

 

=0A

Regarding = Bob Taylor's resources... I worked closely with Bob and his various team me= mbers in a number of dimensions, including consulting for him.  You ar= e right that CeroWRT does not have that kind of resource.  (I have bee= n trying hard in other venues relating to radio issues that are really impo= rtant to me to find a way to assemble a coordinated set of resources at suc= h a scale, and I've failed so far. Still trying.)

=0A

 

=0A

However, I thin= k that the technical issue one could work on in this respect is a way to cr= eate high-level system tests of routing functionality and performance that = would be independent of hardware configuration and also capable of creating= a network environment that would avoid regression.   Jay Lepreau= created a very nice platform framework at Utah that the network "innovatio= n" community might be able to "copy" (where virtualized "networks" could be= configured and tested).  I'd be happy personally, for example, to pro= vide some resources on my various home networks (the one in my home, and th= e ones in my "cloud" instances) to run "system tests" on new releases of Ce= roWRT and other systems - *if* it was run in a way that did not disrupt my = other work, using a bounded percentage of capacity and devices.

=0A

 

=0A

I= participated in the PlanetLab project that HP and Intel supported, coordin= ated by Princeton and others.

=0A

 =

=0A

This was a model for a kind of "co-= op" that incorporated networked resources.

=0A

 

=0A

We need to create a ge= neric networking innovation framework that is *independent* of ISOC, IETF, = Verizon, Cisco, ATT, Alcatel-Lucent, etc.   Those guys may *help*= but they should not be able to block experimentation or innovation (which = was the point of PlanetLab).

=0A

 <= /p>=0A

-----Original Message-----
From= : "Dave Taht" <dave.taht@gmail.com>
Sent: Sunday, April 8, 2012 = 1:31pm
To: dpreed@reed.com
Cc: cerowrt-devel@lists.bufferbloat.ne= t
Subject: Re: [Cerowrt-devel] Cero-state this week and last

=0A
=0A

On Thu, Apr 5, 2012 at 9:09 PM, <dpreed@reed.com> wrote:
>= ; I understand this.  In the end of the day, however, *regression test= s*
> matter, as well as tests to verify that new functionality actu= ally works.
>

The problems with developing global test s= uites, particularly when dealing with
embedded hardware, are manyfold.= I certainly would like to have a full
test suite
that I could ru= n on any router rather than the ad-hoc collection of
tests I run now.<= br />
One of the problems we have is that we are testing for new probl= ems,
and by definition,
you don't know what those are, and after = you fix it, you need to
develop a viable test.

I can think = hundreds of things fixed in the past year that I'd like to
test for, n= ot just on this
hardware, or this software, but over the internet, E2E= .

Example: in june of last year, there was a 10 year old bug in = how ECN
enabled packets
were prioritized by the *default* pfifo_f= ast qdisc. I'll argue that
this has skewed every study
published = about it in the last decade as well, and all that data and
academic pa= pers need
to be reanalyzed - or preferably, thrown out, and we need to= start over.

The harder problem than writing the tests are:

1) Which problems are important? What tests are valid? What is repea= table?
2) Who will write the test?
3) How can the test be deploye= d?
4) How often does it need to run?
5) How can the data be analy= zed?
6) How does that work get paid for?

I am happy that I = have ONE ECN related study to rely on from steve
bauer at MIT. I'd lov= e to have more, and one of my other projects
(thumbgps) is going to gi= ve us a baseline for investigating a bunch of
similar issues. I'm happ= y to have a solution emerging to problems 3
and 4 above. 1,2,5 and 6 r= emain unsolved.

> I've managed projects with 200 daily "commi= tters".  Unless those committers
> get immediate feedback on w= hat they break (accidentally) and design the
> tests for their new = functionality so that others don't break what they
> carefully craf= t, projects go south and never recover.

If I need to establish c= red here, I've been a part of projects with
far more committers and ma= naged up to about 70 staff myself.

When working with the open so= urce community, which is mostly
volunteers, there is no way to be dict= atorial. Consensus needs to
sought. Needed stuff that nobody wants to = do has to get paid for.

The linux kernel - which is at best, 1/1= 0th of the overall code base
in openwrt - goes through about 10,000 ch= angesets every quarter. The
3.2 development cycle had nearly 1300 comm= itters.

http://lwn.net/Articles/472852/

They all have= their own processes for quality control, and somehow
manage to produc= e a usable system on a reliable basis.

And that's just the kerne= l portion of the problem!

A better word for the kind of work tha= t goes on in developing an OS
like a redhat or openwrt is 'packaging'.= You don't actually have
conventional 'developers', packagers are rath= er different catagory of
developer, skilled in make, cross-compilation= techniques, and many
have a good familiarity with architecture-level = issues (like
endianness and bit size and innards of a given architectu= re like arm
or mips). They are capable of basic coding in dozens of la= nguages, all
on the same day, but are only highly skilled in one or tw= o, at best.

in this way they tend to be something of a hybrid be= tween sysadmin and
developer. And they do care, a lot, about the quali= ty of the
engineering, and try to push patches back to the developer, = and make
sysadmins (and users) lives easier.

Packagers do h= ave standards for quality control, and do test on their
own platforms,= but: as the potential test matrix has
several thousand permutations, = having a tool like a buildbot that
tries to give at least some wider = coverage to the most common
combinations, and leverages technology to = give them adaquate feedback
to iteratively get it right.

Li= ke anything else, it can always be done better. In an ideal world, a
p= ackager could do a test commit, get it built against all those
permuta= tions, have run for 24 hours, exaustively checking all the
in-built fu= nctionality, have it's memory and cpu use analyzed, and get
the result= back a few seconds later. Aside from needing Dr Who to help
out on pa= rts of that, it's hard, expensive, and something of a tragedy
of the c= ommons - somebody has to pay for all that infrastructure,
electricity,= and testing for everyone to benefit.

The gradually improving re= cent buildbot-disaster is a case-in-point.
It was working. A bunch of = machines died over time. Nobody was paying
attention. It became a disa= ster. We fixed it with bailing wire, scotch
tape and stolen resources.=

I'd certainly love it if we had budget to where doing the logic= al
thing - build everything, all the time - was practical. Same goes f= or
developing regression tests, having racks of hardware that can bereburnt and re-tested every day,

This sort of pattern repeat= s in nearly every low budget project
(volunteer or corporate sponsored= ), but unfortunately Elon doesn't
hang out with us, and isn't going to= fly in with the liquid oxygen.

As for regression testing, regre= ssions against what? (the answer is
too large to fit in the margins of= this email)

Certainly multiple companies make wireless test sui= tes (one has been
actively helping out, actually), there are dozens of= benchmark suites,
there are zillions of subcomponent tests...
and in this market, razor thin margins on the vendors side, as well as<= br />the ISPs. Now, I like to think that our governments and society arewaking up to the chaos that can ensue if the internet goes down, corpsare realizing that ipv4 can't last forever and ipv6 has to be made
deployable e2e, and maybe there's a shift in thinking that making the
Internet just work is a civil engineering job that *has* to be done
r= ight ( http://esr.ibiblio.org/?p=3D4213 http://esr.ibiblio.org/?p=3D4196)...

but at the end of the day we just have to do the best en= gineering we
can with the resources available.

> You don= 't have that rate of committers here, but it's not really an excuse
&g= t; to say -

Well, in some ways we do. Adding in a new kernel req= uires depending on
a multitude of other people on having got it right.=
Same goes for the other thousand packages.

It has taken a = year and a ton of effort (from multiple volunteers) to
get from where = the cerowrt kernel lagged the mainline kernel by 3
versions, down to w= here it is only going to lag by 1. That effort was
necessarily if we w= anted to be able to do work on both x86 and a
router simultaneously wh= ile investigating bufferbloat, security, and
ipv6, and be able to move= forward (And back and forth) with a minimum
of backporting. That port= ion of the effort has eaten more of my time
this year than I care to t= hink about.

At the time we started hacking on cerowrt, most comm= ercial embedded
products were based on 5 year old kernels, or older, d= ue to how
difficult it is to track the mainline, and a perceived lack = of demand
from consumers for new stuff, despite the ISPs increasing fr= ustration
with what's being shipped today not meeting their needs orexpectations.

We are trying to change that - in part by list= ening to the screams of
ISPs like comcast - but in also trying out new= technologies such as
fixes for bufferbloat, ipv6, radical concepts li= ke ccnx and openhip -
to be geek and early adoptor attractors - to get= more of the needed
work done.

Still, an effort well beyond= the original scope of the "wide" project
seems needed to get ipv6 rol= led out. The theoretical breakthroughs
required to fix bufferbloat see= m almost trivial in comparison.

> "we have to jam in code wit= hout testing it because we don't have a
> discipline of testing and= it's a waste of time".

It's a matter of having enough distribut= ed testing.

> 50% of what a developer should be doing (if not= more) is making sure that
> they don't break more than they improv= e.

so try 'packaging' rather than developing, and wrap your head= around
the test matrix problem.

>
>
>> I realize this is tough, not fun, and sometimes very frustrating.&= nbsp; But cool
> "new stuff" is far less important than keeping stu= ff stable.

This is a classic tension. I note that we're trying t= o fix the
internet here, before *it's* stability goes unstable.
<= br />So a great deal of change and r&d is needed, and yes, it needs to = be
managed well, but stability only qualifies as a goal in limited way= s.

> I'm not trying to be negative - this is stuff I learned = at huge personal
> cost in very high stress environments where peop= le were literally screaming
> at me every hour of every day.
<= br />I have been in those too. I would say that the amount of stress I'veput myself under, trying to ship something by the end of this month -compares closely. Personally I would like to like to offload about 95%<= br />of what I currently do, so I could focus on what's truly important.I'm glad we have more and more volunteers, self identifying problems,leaping forward and going out on their own, to go fix them.

S= till the seat I wish I was sitting in now, with resources I wish I
cou= ld command, is Bob Taylor, circa 1968 or so.

http://en.wikipedia= .org/wiki/Robert_Taylor_%28computer_scientist%29

He's always bee= n a real inspiration to me.

> The cerowrt/bufferbloat stuff i= s worth doing, and it's worth doing right -
> I'm a fan.

THX!

>
>
> -----Original Message-----
&= gt; From: "Dave Taht" <dave.taht@gmail.com>
> Sent: Thursday,= April 5, 2012 10:50pm
> To: dpreed@reed.com
> Cc: cerowrt-= devel@lists.bufferbloat.net
> Subject: Re: [Cerowrt-devel] Cero-sta= te this week and last
>
> On Thu, Apr 5, 2012 at 7:33 PM, &= lt;dpreed@reed.com> wrote:
>> A small suggestion.
>&g= t;
>>
>>
>> Create a regression test suite= , and require contributors to *pass* the
>> test
>> w= ith each submitted patch set.
>
> A linear complete build o= f openwrt takes 17 hours on good hardware.
> It's hard to build in = parallel.
>
> A parallel full build is about 3 hours but re= quires a bit of monitoring
>
> Incremental package builds a= re measured in minutes, however...
>
>> Be damned politi= cally incorrect about checkins that don't meet this
>> criterion= - eliminate
>> the right to check in code for anyone who contri= butes something that
>> breaks
>> functionality.
>
> The number of core committers is quite low, too low, at pre= sent.
> However the key problem here is that
> the matrix o= f potential breakage is far larger than any one contribute
> can de= al with.
>
> There are:
>
> 20 + fairly dif= ferent cpu architectures *
> 150+ platforms *
> 3 different= libcs *
> 3 different (generation) toolchains *
> 5-6 diff= erent kernels
>
> That matrix alone is hardly concievable t= o deal with. In there are
> arches that are genuinely weird (avr an= yone), arches that have
> arbitrary endian, arches that are 32 bit = and 64 bit...
>
> Add in well over a thousand software pack= ages (everything from Apache
> to zile), and you have an idea of ho= w much code has dependencies on
> other code...
>
>= For example, the breakage yesterday (or was it the day before) was in
> a minor update to libtool, as best as I recall. It broke 3 packages> that cerowrt has available as options.
>
> I'm loo= king forward, very much, to seeing the buildbot produce a
> known, = good build, that I can layer my mere 67 patches and two dozen
> pac= kages on top of without having to think too much.
>
>> E= very project leader discovers this.
>
> Cerowrt is an incre= dibly tiny superset of the openwrt project. I help
> out where I ca= n.
>
>> Programmers are *lazy* and refuse to
>&g= t; check their inputs unless you shame them into compliance.
>
> Volunteer programmers are not lazy.
>
> They do, howe= ver, have limited resources, and prefer to make progress
> rather t= han make things perfect. Difficult to pass check-in tests
> impeed = progress.
>
> The fact that you or I can build an entire OS= , in a matter of hours,
> today, and have it work, most often buffu= ddles me. This is 10s of
> millions of lines of code, all perfect, = most of the time.
>
> It used to take 500+ people to engine= er an os in 1992, and 4 days to
> build. I consider this progress.<= br />>
> There are all sorts of processes in place, some can cer= tainly be
> improved. For example, discussed last week was methods = for dealing
> with and approving the backlog of submitted patches b= y other
> volunteers.
>
> It mostly just needs more= eyeballs. And testing. There's a lot of good
> stuff piled up.
>
> http://patchwork.openwrt.org/project/openwrt/list/
&g= t;>
>>
>>
>> -----Original Message-----=
>> From: "Dave Taht" <dave.taht@gmail.com>
>> = Sent: Thursday, April 5, 2012 10:27pm
>> To: cerowrt-devel@lists= .bufferbloat.net
>> Subject: [Cerowrt-devel] Cero-state this wee= k and last
>>
>> I attended the ietf conference in Pa= ris (virtually), particularly ccrg
>> and homenet.
>>=
>> I do encourage folk to pay attention to homenet if possible,= as laying
>> out what home networks will look like in the next = 10 years is proving
>> to be a hairball.
>> ccrg was = productive.
>>
>> Some news:
>>
>&= gt; I have been spending time fixing some infrastructural problems.
&g= t;>
>> 1) After be-ing blindsided by more continuous integrat= ion problems in
>> the last month than in the last 5, I found ou= t that one of the root
>> causes was that the openwrt build clus= ter had declined in size from 8
>> boxes to 1(!!), and time betw= een successful automated builds was in
>> some cases over a mont= h.
>>
>> The risk of going 1 to 0 build slaves seemed= untenable. So I sprang
>> into action, scammed two boxes and tr= avis has tossed them into the
>> cluster. Someone else volunteer= ed a box.
>>
>> I am a huge proponent of continuous i= ntegration on complex projects.
>> http://en.wikipedia.org/wiki/= Continuous_integration
>>
>> Building all the compone= nts of an OS like openwrt correctly, all the
>> time, with the d= ozens of developers involved, with a minimum delta
>> between co= mmit, breakage, and fix, is really key to simplifying the
>> rel= atively simple task we face in bufferbloat.net of merely layering
>= > on components and fixes improving the state of the art in networking.<= br />>>
>> The tgrid is still looking quite bad at the mom= ent.
>>
>> http://buildbot.openwrt.org:8010/tgrid
>>
>> There's still a huge backlog of breakage.
>= ;>
>> But I hope it gets better. Certainly building a full cl= uster of build
>> boxes or vms (openwrt@HOME!!) would help a lot= more.
>>
>> If anyone would like to help hardware wi= se, or learn more about how to
>> manage a build cluster using b= uildbot, please contact travis
>> <thepeople AT openwrt.org&g= t;
>>
>> 2) Bloatlab #1 has been completely rewired a= nd rebuilt and most of
>> the routers in there reflashed to Cero= wrt-3.3.1-2 or later. They
>> survived some serious network abus= e over the last couple days
>> (ironically the only router that = crashed was the last rc6 box I had in
>> the mix - and not due t= o a network fault! I ran it out of flash with a
>> logging tool)= .
>>
>> To deal with the complexity in there (there's= also a sub-lab for some
>> sdnat and PCP testing), I ended up w= ith a new ipv6 /48 and some better
>> ways to route that I'll wr= ite up soon.
>>
>> 3) I did finally got back to fully= working builds for the ar71xx
>> (cerowrt) architecture a few d= ays ago. I also have a working 3.3.1
>> kernel for the x86_64 bu= ild I use to test the server side.
>> (bufferbloat is NOT just a= router problem. Fixing all sides of a
>> connection helps a lot= ). That + a new iproute2 + the debloat script
>> and YOU TOO can= experience orders of magnitude less latency....
>>
>>= ; http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for
>> x86_64
>>
>> Most of the past week has been= backwards rather than forwards, but it
>> was negative in a goo= d way, mostly.
>>
>> I'm sorry it's been three weeks = without a viable build for others to test.
>>
>> 4) t= oday's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
>&g= t;
>> + Linux 3.3.1 (this is missing the sfq patch I liked, but = it's good
>> enough)
>> + Working wifi is back
&= gt;> + No more fiddling with ethtool tx rings (up to 64 from 2. BQL does=
>> this job better)
>> + TCP CUBIC is now the defaul= t (no longer westwood)
>> after 15+ years of misplaced faith in = delay based tcp for wireless,
>> I've collected enough data to c= onvince me the cubic wins. all the
>> time.
>> + altt= cp enabled (making it easy to switch)
>> + latest netperf from s= vn (yea! remotely changable diffserv settings
>> for a test tool= !)
>>
>> - still horrible dependencies on time. You p= retty much have to get on
>> it and do a rndc validation disable= multiple times, restart ntp
>> multiple times, killall named mu= ltiple times to get anywhere if you
>> want to get dns inside of= 10 minutes.
>>
>> At this point sometimes I just tur= n off named in /etc/xinetd.d/named
>> and turn on port 53 for dn= smasq... but
>> usually after flashing it the first time, wait 1= 0 minutes (let it
>> clean flash), reboot, wait another 10, then= it works. Drives me
>> crazy... Once it's up and has valid time= and is working, dnssec works
>> great but....
>>
>> + way cool new stuff in dnsmasq for ra and AAAA records
>= ;> - huge dependency on keeping bind in there
>> - aqm-script= s. I have not succeed in making hfsc work right. Period.
>> + HT= B (vs hfsc) is proving far more tractable. SFQRED is scaling
>> = better than I'd dreamed. Maybe eric dreamed this big, I didn't.
>&g= t; - http://www.bufferbloat.net/issues/352
>> + Added some essen= tial randomness back into the entropy pool
>> - hostapd really a= cts up at high rates with the hack in there for more
>> entroy (= >From the openwrt mainline)
>> + named caching the roots idea dis= carded in favor of classic '.'
>>
>>
>> --=
>> Dave T=C3=A4ht
>> SKYPE: davetaht
>> U= S Tel: 1-239-829-5608
>> http://www.bufferbloat.net
>>= ; _______________________________________________
>> Cerowrt-dev= el mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>= > https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
&g= t;
>
> --
> Dave T=C3=A4ht
> SKYPE: davetah= t
> US Tel: 1-239-829-5608
> http://www.bufferbloat.net



--
Dave T=C3=A4ht
SKYPE: davetaht
US Te= l: 1-239-829-5608
http://www.bufferbloat.net

=0A
------=_20120408215721000000_58253--