From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f47.google.com (mail-wg0-f47.google.com [74.125.82.47]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 8A78D200693 for ; Sun, 8 Apr 2012 08:53:12 -0700 (PDT) Received: by wgbge7 with SMTP id ge7so2800034wgb.28 for ; Sun, 08 Apr 2012 08:53:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=IqCU9SRjWVBcEWaVMclj/dbML6dgjwg8blClo9OYEYM=; b=y0+Z/CGF3wJfwoY7D8LmKBFa9QMP2zpwO/XsN1UDL1geCx1BiiMH1cHvl+BcZcCI0j fN/DpnM6jXUaxkJAlC9V25MVKDv0DQNQHocdOqSusa7/5qC+79qjtMUdp34ior6rvFmd ASXabT2DEU3Bc9xJYInhi0tiwJ1WnSes+2pLo9GjZX1p3DfWX8GMzsVTeiEq2/7WhiQK fduPPWBmYhY22tzvJM92TOcvVyWnUy1sUQ1uc4iLsI+lzvdfl/oq6konl0OQlG+HHpZN KduzY+O9l1pcakhIUYUPr3tiV2EsaMIwiN+omQCd+QXdZnoE2M546hoK+j7qoAMLlb/i EbmA== MIME-Version: 1.0 Received: by 10.180.82.132 with SMTP id i4mr9964041wiy.12.1333900390101; Sun, 08 Apr 2012 08:53:10 -0700 (PDT) Received: by 10.223.127.194 with HTTP; Sun, 8 Apr 2012 08:53:10 -0700 (PDT) In-Reply-To: References: <1333679627.997611294@apps.rackspace.com> Date: Sun, 8 Apr 2012 08:53:10 -0700 Message-ID: From: Dave Taht To: david@lang.hm Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: cerowrt-devel@lists.bufferbloat.net Subject: Re: [Cerowrt-devel] Cero-state this week and last X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Apr 2012 15:53:14 -0000 On Thu, Apr 5, 2012 at 8:07 PM, wrote: > On Thu, 5 Apr 2012, Dave Taht wrote: > >> A linear complete build of openwrt takes 17 hours on good hardware. >> It's hard to build in parallel. > > > distcc doesn't work for this? There are some things that distcc works well on - kernel builds, a major c++ application, stuff like that. Other things are subject to Amdahl's law and hopelessly serial, notably creating toolchains, link steps, etc. Much of the early stages of a build has been as parallized as much as humanly possible. The bulk of the the build currently is on packages. http://buildbot.openwrt.org:8010/builders/ar71xx/builds/127 to use distcc effectively in embedded you hit a limiting factor in that you need to continuously redistribute and install all the toolchain and dependent packages to all the building hosts, OR, use a shared filesystem, which rapidly becomes a limiting factor of it's own, not just on I/O but on single points of failure. Merely parallelizing package building across a multi-cpu box is hard as your number of cpus go up. The number of potential interactions between dependencies go up, as does your I/O. Getting all the dependencies right is a big job. Currently the scaling factor for make -j (with no distcc) is roughly 1/2 the number of cpus for the two data points I have, and appears rather bound on I/O. I get roughly the same results with a an old 4 core box with great I/O (4 disks, hardware raid) vs a more modern 8 core box with merely mirrored drives. If I were to throw, say, a 48 cpu box at this problem, and do builds entirely out of ram (doable) I don't know how much further up a parallel build would scale. Certainly all the dependencies would have to get worked out. I sure wouldn't mind having a couple of these: http://www.penguincomputing.com/hardware/linux_servers/configurator/intel/r= elion2800 or these: http://www.penguincomputing.com/hardware/linux_servers/configurator/amd/alt= us1804 to play with. The ROI of stuff like that... vs the cost of electricity and rack space of the hardware we've had donated to this project - would probably pay off inside of a year or two, and that doesn't count the very real productivity improvement for everyone that could get a full build turned around in under 19 hours. Regrettably up-front capital like that is hard to come by, as bufferbloat.net is not an 'exciting new age startup' with billions of dollars per year of potential market cap. We're merely trying to save billions of people a lot of headache, frustration, and time, and get the technology into everything without any form of direct form of recompense. It's kind of a harder sell. For some reason. It's cheaper short term, if not long term, to bleed out electricity and rack space monthly, and try to have mental processes that cope with overnight builds, and scavange more free hardware wherever we can. Incidentally I did cost out using amazon EC2, etc, last year, and that was highway robbery, given the amount of cpu cycles this task can consume. > > >> A parallel full build is about 3 hours but requires a bit of monitoring > > > can this monitoring be automated? make -j 8 watch some seemingly random package fail to build build it's dependencies build it make -j 8 watch another somewhat random package fail to build repeat until done The next problem is that heavy parallization messes up your logging messages, so it's very hard to find where the error occurred. These are solvable problems, with someone focused on the task, but if you change '8' to '9', something else tends to break, and it's architecture dependent as well. > David Lang > >> Incremental package builds are measured in minutes, however... >> >>> Be damned politically incorrect about checkins that don't meet this >>> criterion - eliminate >>> the right to check in code for anyone who contributes something that >>> breaks >>> functionality. >> >> >> The number of core committers is quite low, too low, at present. >> However the key problem here is that >> the matrix of potential breakage is far larger than any one contribute >> can deal with. >> >> There are: >> >> 20 + fairly different cpu architectures * >> 150+ platforms * >> 3 different libcs * >> 3 different (generation) toolchains * >> 5-6 different kernels >> >> That matrix alone is hardly concievable to deal with. In there are >> arches that are genuinely weird (avr anyone), arches that have >> arbitrary endian, arches that are 32 bit and 64 bit... >> >> Add in well over a thousand software packages (everything from Apache >> to zile), and you have an idea of how much code has dependencies on >> other code... >> >> For example, the breakage yesterday (or was it the day before) was in >> a minor update to libtool, as best as I recall. It broke 3 packages >> that cerowrt has available as options. >> >> I'm looking forward, very much, to seeing the buildbot produce a >> known, good build, that I can layer my mere 67 patches and two dozen >> packages on top of without having to think too much. >> >>> Every project leader discovers this. >> >> >> Cerowrt is an incredibly tiny superset of the openwrt project. I help >> out where I can. >> >>> Programmers are *lazy* and refuse to >>> check their inputs unless you shame them into compliance. >> >> >> Volunteer programmers are not lazy. >> >> They do, however, have limited resources, and prefer to make progress >> rather than make things perfect. Difficult to pass check-in tests >> impeed progress. >> >> The fact that you or I can build an entire OS, in a matter of hours, >> today, and have it work, most often buffuddles me. This is 10s of >> millions of lines of code, all perfect, most of the time. >> >> It used to take 500+ people to engineer an os in 1992, and 4 days to >> build. I consider this progress. >> >> There are all sorts of processes in place, some can certainly be >> improved. For example, discussed last week was methods for dealing >> with and approving the backlog of submitted patches by other >> volunteers. >> >> It mostly just needs more eyeballs. And testing. There's a lot of good >> stuff piled up. >> >> http://patchwork.openwrt.org/project/openwrt/list/ >>> >>> >>> >>> >>> -----Original Message----- >>> From: "Dave Taht" >>> Sent: Thursday, April 5, 2012 10:27pm >>> To: cerowrt-devel@lists.bufferbloat.net >>> Subject: [Cerowrt-devel] Cero-state this week and last >>> >>> I attended the ietf conference in Paris (virtually), particularly ccrg >>> and homenet. >>> >>> I do encourage folk to pay attention to homenet if possible, as laying >>> out what home networks will look like in the next 10 years is proving >>> to be a hairball. >>> ccrg was productive. >>> >>> Some news: >>> >>> I have been spending time fixing some infrastructural problems. >>> >>> 1) After be-ing blindsided by more continuous integration problems in >>> the last month than in the last 5, I found out that one of the root >>> causes was that the openwrt build cluster had declined in size from 8 >>> boxes to 1(!!), and time between successful automated builds was in >>> some cases over a month. >>> >>> The risk of going 1 to 0 build slaves seemed untenable. So I sprang >>> into action, scammed two boxes and travis has tossed them into the >>> cluster. Someone else volunteered a box. >>> >>> I am a huge proponent of continuous integration on complex projects. >>> http://en.wikipedia.org/wiki/Continuous_integration >>> >>> Building all the components of an OS like openwrt correctly, all the >>> time, with the dozens of developers involved, with a minimum delta >>> between commit, breakage, and fix, is really key to simplifying the >>> relatively simple task we face in bufferbloat.net of merely layering >>> on components and fixes improving the state of the art in networking. >>> >>> The tgrid is still looking quite bad at the moment. >>> >>> http://buildbot.openwrt.org:8010/tgrid >>> >>> There's still a huge backlog of breakage. >>> >>> But I hope it gets better. Certainly building a full cluster of build >>> boxes or vms (openwrt@HOME!!) would help a lot more. >>> >>> If anyone would like to help hardware wise, or learn more about how to >>> manage a build cluster using buildbot, please contact travis >>> >>> >>> 2) Bloatlab #1 has been completely rewired and rebuilt and most of >>> the routers in there reflashed to Cerowrt-3.3.1-2 or later. They >>> survived some serious network abuse over the last couple days >>> (ironically the only router that crashed was the last rc6 box I had in >>> the mix - and not due to a network fault! I ran it out of flash with a >>> logging tool). >>> >>> To deal with the complexity in there (there's also a sub-lab for some >>> sdnat and PCP testing), I ended up with a new ipv6 /48 and some better >>> ways to route that I'll write up soon. >>> >>> 3) I did finally got back to fully working builds for the ar71xx >>> (cerowrt) architecture a few days ago. I also have a working 3.3.1 >>> kernel for the x86_64 build I use to test the server side. >>> (bufferbloat is NOT just a router problem. Fixing all sides of a >>> connection helps a lot). That + a new iproute2 + the debloat script >>> and YOU TOO can experience orders of magnitude less latency.... >>> >>> http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for >>> x86_64 >>> >>> Most of the past week has been backwards rather than forwards, but it >>> was negative in a good way, mostly. >>> >>> I'm sorry it's been three weeks without a viable build for others to >>> test. >>> >>> 4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/ >>> >>> + Linux 3.3.1 (this is missing the sfq patch I liked, but it's good >>> enough) >>> + Working wifi is back >>> + No more fiddling with ethtool tx rings (up to 64 from 2. BQL does >>> this job better) >>> + TCP CUBIC is now the default (no longer westwood) >>> after 15+ years of misplaced faith in delay based tcp for wireless, >>> I've collected enough data to convince me the cubic wins. all the >>> time. >>> + alttcp enabled (making it easy to switch) >>> + latest netperf from svn (yea! remotely changable diffserv settings >>> for a test tool!) >>> >>> - still horrible dependencies on time. You pretty much have to get on >>> it and do a rndc validation disable multiple times, restart ntp >>> multiple times, killall named multiple times to get anywhere if you >>> want to get dns inside of 10 minutes. >>> >>> At this point sometimes I just turn off named in /etc/xinetd.d/named >>> and turn on port 53 for dnsmasq... but >>> usually after flashing it the first time, wait 10 minutes (let it >>> clean flash), reboot, wait another 10, then it works. Drives me >>> crazy... Once it's up and has valid time and is working, dnssec works >>> great but.... >>> >>> + way cool new stuff in dnsmasq for ra and AAAA records >>> - huge dependency on keeping bind in there >>> - aqm-scripts. I have not succeed in making hfsc work right. Period. >>> + HTB (vs hfsc) is proving far more tractable. SFQRED is scaling >>> better than I'd dreamed. Maybe eric dreamed this big, I didn't. >>> - http://www.bufferbloat.net/issues/352 >>> + Added some essential randomness back into the entropy pool >>> - hostapd really acts up at high rates with the hack in there for more >>> entroy (From the openwrt mainline) >>> + named caching the roots idea discarded in favor of classic '.' >>> >>> >>> -- >>> Dave T?ht >>> >>> SKYPE: davetaht >>> US Tel: 1-239-829-5608 >>> http://www.bufferbloat.net >>> _______________________________________________ >>> Cerowrt-devel mailing list >>> Cerowrt-devel@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >> >> >> >> >> > --=20 Dave T=E4ht SKYPE: davetaht US Tel: 1-239-829-5608 http://www.bufferbloat.net