From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout-de.gmx.net (mailout-de.gmx.net [213.165.64.23]) by huchra.bufferbloat.net (Postfix) with SMTP id 53B6E200B49 for ; Wed, 15 Aug 2012 22:15:39 -0700 (PDT) Received: (qmail invoked by alias); 16 Aug 2012 05:15:37 -0000 Received: from 75-142-58-156.static.mtpk.ca.charter.com (EHLO dhcp-112.home.lan) [75.142.58.156] by mail.gmx.net (mp037) with SMTP; 16 Aug 2012 07:15:37 +0200 X-Authenticated: #24211782 X-Provags-ID: V01U2FsdGVkX1+DfDci7OiD/BsMpYcjdRJz/eQf76Pt9sHuq8xzra 7+hK9Y4tdP/LZF Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=windows-1252 From: Sebastian Moeller In-Reply-To: Date: Wed, 15 Aug 2012 22:15:35 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <6329D77B-9803-453A-A34F-7B6EA02FE9AA@gmx.de> References: <36D61FDC-9AA9-46CC-ACBB-2D28B250C660@gmx.de> To: Dave Taht X-Mailer: Apple Mail (2.1278) X-Y-GMX-Trusted: 0 Cc: cerowrt-devel@lists.bufferbloat.net Subject: Re: [Cerowrt-devel] cerowrt 3.3.8-17 is released X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Aug 2012 05:15:39 -0000 Hi Dave, thanks for the detailed response... On Aug 15, 2012, at 9:08 PM, Dave Taht wrote: > re: ath: skbuff alloc of size 1926 failed >=20 > as for the ath skbuff problem, I've seen that a lot. I had put hard > packet limits (~600) on fq_codel in -11 and prior that were too low > and it mostly went away, but I hit tail drop behavior everywhere, > instead of codel behavior. What I have now (typically 1200) may well > be too high, but not as overly high as the default (10k packets). Question is this limit per interface or per flow, or fq bin? > There may be another means of increasing the size of that slab pool or > making it less onerous. Interesting idea, I will have a look at that... >=20 > I would like it if codel "kicked in" earlier than it currently does. > The code in ns2 is currently using half the period that the linux code > is. This would control things better, or so I hope (planning on trying > this as I get time) >=20 > I am also considering means of artificially upscaling the drop > scheduler when we get close to queue limits. >=20 > See some discussions on the codel list for these issues. (sims are > easier to deal with than cerowrt, too!) Ah great, more goodness on the way to cerowrt I hope :) >=20 > as for bind, it should be automagically restarted from xinetd, no need > to fiddle with anything. However, since you are already under massive > memory pressure, it may well fail to start up that way, too. Well, once bind is gone and the easement is ver the memory = pressure is gone and there should be enough memory for bind to start = (will check that hypothesis later). But trying to start it manually with = something like 23MB free did not allow me to start bind up again, so = certainly I was doing something wrong (or OOM killed more than just = bind, but that is hard to say as nothing showed up in dmesg or in = logread-f about the OOM killer, so maybe bind died from other causes). > At the > moment, I've largely given up on bind on anything but a more core home > gw, and am running dnsmasq on everything (3700v2, picostations, > nanostations) but the 3800s. (and the ones I run it on, aren't being > used for wifi right now). A that should free some MBs for queues to grow in :) >=20 > Lastly: Swap space won't help you on exhausting kernel limits. I had the naive hope that the swap would allow to push bind's = memory out to the page file and give the kernel some more room to = breathe, but that did only work to some degree. (In 3.3.8-6 one of the = UDP storm tests I did made the router reboot like every other day, = adding swap turned this into survival with killed bind and = non-functional DNS; I am not sure in retrospect whether adding swap was = such a good idea, as after the sudden reboots the router was at least = functional again :)) >=20 > I'm glad you can reproduce the ath: slab problem - I can get it too at > high rates using netperf over wifi. I always wanted to stress this with netsurf, but somehow never = were able to find a netperf server outside of my cable modem with wich = to recreate my failure mode... > I will try a 3700v2 with and > without bind to see if it's still there in 3.3.8-17. In the meantime > if anyone knows how to get more allocations in that (2048? 4096?) slab > by default, perhaps that will help? Thanks so much for all the hard work and such a fun toy to play with=85 Sebastian >=20 >=20 >=20 > On Wed, Aug 15, 2012 at 10:23 AM, Sebastian Moeller = wrote: >> Hi Dave, >>=20 >> great work, as always I upgraded my production router to the latest = and greatest (since I only have one router=85). And it works quite well = for normal usage=85 >> Netalyzr reports around 2800ms seconds of uplink buffering, yet = saturating the uplink does not affect ping times to a remote target = noticeably, basically the same as for all codellized ceo versions I = tested so far... >>=20 >> Some notes and a question: >> I noticed that even given plenty of swap space (1GB on a usb stick), = using http://broadband.mpi-sws.org/residential/ to exercise UDP stress = (on the uplink I assume) I can easily produce (I run the test from a = macosx via 5GHz wireless over 1.5 yards): >> Aug 15 01:16:29 nacktmulle kern.err kernel: [175395.132812] ath: = skbuff alloc of size 1926 failed >> (and plenty of those=85). >> What then happens is that the OOM killer will aim for bind = (reasonable since it is the largest single process) and kill it. When I = try to restart bind by: >> root@nacktmulle:~# /etc/rc.d/S47namedprep start >> root@nacktmulle:~# /etc/rc.d/S48named restart >> Stopping isc-bind >> /etc/chroot/named//var/run/named/named.pid not found, trying brute = force >> killall: named: no process killed >> Kicking isc-bind in xinetd >> rndc: connect failed: 127.0.0.1#953: connection refused >> And bind does not start again and the router becomes less than = useful. Now I assume I am doing something wrong, but what, if you have = any idea how to solve this short of a reboot of the router (my current = method) I would be happy to learn >>=20 >>=20 >>=20 >> best regards >> sebastian >>=20 >> On Aug 12, 2012, at 11:08 PM, Dave Taht wrote: >>=20 >>> I'm too tired to write up a full set of release notes, but I've been >>> testing it all day, >>> and it looks better than -10 and certainly better than -11, but I = won't know >>> until some more folk sit down and test it, so here it is. >>>=20 >>> http://huchra.bufferbloat.net/~cero1/3.3/3.3.8-17/ >>>=20 >>> fresh merge with openwrt, fix to a bind CVE, fixes for 6in4 and = quagga >>> routing problems, >>> and a few tweaks to fq_codel setup that might make voip better. >>>=20 >>> Go forth and break things! >>>=20 >>> In other news: >>>=20 >>> Van Jacobson gave a great talk about bufferbloat, BQL, codel, and = fq_codel >>> at last week's ietf meeting. Well worth watching. At the end he = outlines >>> the deployment problems in particular. >>>=20 >>> = http://recordings.conf.meetecho.com/Recordings/watch.jsp?recording=3DIETF8= 4_TSVAREA&chapter=3Dpart_3 >>>=20 >>> Far more interesting than this email! >>>=20 >>>=20 >>> -- >>> Dave T=E4ht >>> http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out >>> with fq_codel!" >>> _______________________________________________ >>> Cerowrt-devel mailing list >>> Cerowrt-devel@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/cerowrt-devel >>=20 >=20 >=20 >=20 > --=20 > Dave T=E4ht > http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out > with fq_codel!"