From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-24-ewr.dyndns.com (mxout-033-ewr.mailhop.org [216.146.33.33]) by lists.bufferbloat.net (Postfix) with ESMTP id 786FC2E018E for ; Thu, 17 Mar 2011 08:56:19 -0700 (PDT) Received: from scan-21-ewr.mailhop.org (scan-21-ewr.local [10.0.141.243]) by mail-24-ewr.dyndns.com (Postfix) with ESMTP id E19BA5CD248 for ; Thu, 17 Mar 2011 15:56:18 +0000 (UTC) X-Spam-Score: 0.1 () X-Mail-Handler: MailHop by DynDNS X-Originating-IP: 75.145.127.229 Received: from gw.co.teklibre.org (75-145-127-229-Colorado.hfc.comcastbusiness.net [75.145.127.229]) by mail-24-ewr.dyndns.com (Postfix) with ESMTP id 6F1E65CCB90; Thu, 17 Mar 2011 15:56:14 +0000 (UTC) Received: from cruithne.co.teklibre.org (unknown [IPv6:2002:4b91:7fe5:1::20]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "cruithne.co.teklibre.org", Issuer "CA Cert Signing Authority" (verified OK)) by gw.co.teklibre.org (Postfix) with ESMTPS id DBBC85EAD5; Thu, 17 Mar 2011 09:56:13 -0600 (MDT) Received: by cruithne.co.teklibre.org (Postfix, from userid 1000) id 32CC312086C; Thu, 17 Mar 2011 09:56:12 -0600 (MDT) From: d@taht.net (Dave =?utf-8?Q?T=C3=A4ht?=) To: bloat@lists.bufferbloat.net, bloat-devel@lists.bufferbloat.net Subject: Better understanding decision-making across all layers of the stack Organization: Teklibre - http://www.teklibre.com Date: Thu, 17 Mar 2011 09:56:12 -0600 Message-ID: <8739mllp0j.fsf@cruithne.co.teklibre.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: mjschultz@gmail.com, David Miller , corbet X-BeenThere: bloat-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Developers working on AQM, device drivers, and networking stacks" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Mar 2011 15:56:20 -0000 Michael J. Schultz just put up a nice blog entry on how the receive side of the current Linux network stack works.=20 http://blog.beyond-syntax.com/2011/03/diving-into-linux-networking-i/ There are people on the bloat lists that understand wireless RF, people that understand a specific driver, people that grok the mac layer, there's a whole bunch of TCP and AQM folk, we have supercomputer guys and embedded guys, cats and dogs, all talking in one space and yet... Yet in my discussions[0] with specialists working at these various layers of the kernel I've often spotted holes in knowledge on how the layers actually work together to produce a result. I'm no exception[1] - in the last few weeks of fiddling with the debloat-testing tree I've learned that everything I knew about the Linux networking stack is basically obsolete. With the advent of tickless operations, GSO offload, threaded interrupts, soft-irqs, and other new scheduling mechanisms, most of the rationale I'd had for running at a 1000HZ tick rate has vanished. That said, I cannot honestly believe Linux is soft-clocked enough right now to ensure low latency decision making across all the layers of the networking stack, as my struggles with the new eBDP algorithm and the iwl driver seem to be showing.=20 Certainly low hanging fruit remains.=20 For example, Dan Siemon just found (and, with Eric Dumazet fixed) a long-standing bug in Linux's default pfifo_fast qdisc that has been messing up ECN for a decade. [2]. That fix went *straight* into linus's git head and net-stable. It would be nice to have a clear up-to-date picture - a flowchart - a set of diagrams - about how, when, and where how all the different network servo mechanisms in the kernel interact, for several protocols, from layer 0 to layer 7 and back again. Call it, a day in the life of a set of network streams. [3] Michael's piece above is a start, but only handles the receive side at a very low level. When does a TCP packet get put on the txqueue? When does a qdisc get invoked and a packet make it onto the device ring? How does stuff get pulled from the receive buffer and fed back into the TCP server loop? When and at what points do we decide to drop a packet? How is ND handled differently from ARP or other low level packets? When does napi kick in? What's the interaction between wireless retries and packet ag= gregation? Pointers to more existing current and accurate documentation would be nice too. I think that a lot of debloating could be done in-between the layers of the stack on both low and high end devices. Judging from this recent thread [4] here, on the high end, there are disputes between adequate amounts of driver buffering on 10GE[5] and queue management[6], and abstractions such as RED have actually been pushed into silicon[7]. How do we best take advantage of those features going forward? [8] In order to improve responsiveness, reduce delay and excessive buffering up and down the stack we could really use more cross-disciplinary knowledge, and a more common understanding about how all this stuff fits together, but writing such a document would require multiple people get their heads together to get something coherent. [9] Volunteers? --=20 Dave Taht http://nex-6.taht.net [0] Dave T=C3=A4ht & Felix Fietkau (of openwrt) http://mirrors.bufferbloat.net/podcasts/BPR-The_Wireless_Stack.mp3 =20=20 I had intended to turn this discussion into a more formal podcast format. I simply haven't had time. It's listenable as is, however. If you want to learn more about how 802.11 wireless works, in particular, how 802.11n packet aggregratation works, toss that recording onto your mp3 player and call up etags.... [1] I also had to listen to this recording about 6 times to understand where Felix and I had miscommunicated. It was a very educational conversation for me, at least. (And convinced Felix to spend time on bufferbloat, too) I also note that recording as much as possible of everything is the only trait I share with Richard Nixon. [2] ECN + pfifo problem clearly explained: http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/ WAY TO GO DAN! [3] I imagine the work would make for a good (series of) article(s) on LWN, or perhaps the new Byte magazine. [4] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000240.html [5] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000260.html [6] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000265.html [7] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000281.html [8] There have been interesting attempts at simplifying the Linux networking stack, notably VJ's netchannels, which was sidetracked by the problems of interacting with netfilter ( http://lwn.net/Articles/192767/ ) Openflow is also interesting as an example of what can be moved into hardwa= re. [9] I don't want all these footnotes and theoretical stuff to get in the way of actually gaining a good set of pictures and understanding as to how the Linux network stack actually works today, so that new algorithms such as eBDP, A* and tcp-fit be correctly implemented and drivers improved in the right direction.... (And that said, knowing better how other OS's did it would be nice, too)