[Bloat] Better understanding decision-making across all layers of the stack

Thu Mar 17 08:56:12 PDT 2011

Michael J. Schultz just put up a nice blog entry on how the receive side
of the current Linux network stack works. 

http://blog.beyond-syntax.com/2011/03/diving-into-linux-networking-i/

There are people on the bloat lists that understand wireless RF, people
that understand a specific driver, people that grok the mac layer,
there's a whole bunch of TCP and AQM folk, we have supercomputer guys
and embedded guys, cats and dogs, all talking in one space and yet...

Yet in my discussions[0] with specialists working at these various
layers of the kernel I've often spotted holes in knowledge on how
the layers actually work together to produce a result.

I'm no exception[1] - in the last few weeks of fiddling with the
debloat-testing tree I've learned that everything I knew about the Linux
networking stack is basically obsolete. With the advent of tickless
operations, GSO offload, threaded interrupts, soft-irqs, and other new
scheduling mechanisms, most of the rationale I'd had for running at a
1000HZ tick rate has vanished.

That said, I cannot honestly believe Linux is soft-clocked enough right
now to ensure low latency decision making across all the layers of the
networking stack, as my struggles with the new eBDP algorithm and the
iwl driver seem to be showing. 

Certainly low hanging fruit remains. 

For example, Dan Siemon just found (and, with Eric Dumazet fixed) a
long-standing bug in Linux's default pfifo_fast qdisc that has been
messing up ECN for a decade. [2]. That fix went *straight* into linus's
git head and net-stable.

It would be nice to have a clear up-to-date picture - a flowchart - a
set of diagrams - about how, when, and where how all the different
network servo mechanisms in the kernel interact, for several protocols,
from layer 0 to layer 7 and back again.

Call it, a day in the life of a set of network streams. [3]

Michael's piece above is a start, but only handles the receive side at a
very low level. When does a TCP packet get put on the txqueue? When does
a qdisc get invoked and a packet make it onto the device ring? How does
stuff get pulled from the receive buffer and fed back into the TCP
server loop? When and at what points do we decide to drop a packet? How
is ND handled differently from ARP or other low level packets? When does
napi kick in? What's the interaction between wireless retries and packet aggregation?

Pointers to more existing current and accurate documentation would be
nice too.

I think that a lot of debloating could be done in-between the layers of
the stack on both low and high end devices. Judging from this recent
thread [4] here, on the high end, there are disputes between adequate
amounts of driver buffering on 10GE[5] and queue management[6], and
abstractions such as RED have actually been pushed into silicon[7]. How
do we best take advantage of those features going forward? [8]

In order to improve responsiveness, reduce delay and excessive buffering
up and down the stack we could really use more cross-disciplinary
knowledge, and a more common understanding about how all this stuff fits
together, but writing such a document would require multiple people get
their heads together to get something coherent. [9] Volunteers?

-- 
Dave Taht
http://nex-6.taht.net

[0] Dave Täht & Felix Fietkau (of openwrt)

http://mirrors.bufferbloat.net/podcasts/BPR-The_Wireless_Stack.mp3

I had intended to turn this discussion into a more formal podcast
format. I simply haven't had time. It's listenable as is, however. If
you want to learn more about how 802.11 wireless works, in particular,
how 802.11n packet aggregratation works, toss that recording onto your
mp3 player and call up etags....

[1] I also had to listen to this recording about 6 times to understand where
Felix and I had miscommunicated. It was a very educational conversation
for me, at least. (And convinced Felix to spend time on bufferbloat, too)

I also note that recording as much as possible of everything is the only
trait I share with Richard Nixon.

[2] ECN + pfifo problem clearly explained:
    http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/
    WAY TO GO DAN!
[3] I imagine the work would make for a good (series of) article(s) on
LWN, or perhaps the new Byte magazine.

[4] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000240.html
[5] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000260.html
[6] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000265.html
[7] https://lists.bufferbloat.net/pipermail/bloat/2011-March/000281.html
[8] There have been interesting attempts at simplifying the Linux networking
stack, notably VJ's netchannels, which was sidetracked by the problems
of interacting with netfilter ( http://lwn.net/Articles/192767/ )

Openflow is also interesting as an example of what can be moved into hardware.

[9] I don't want all these footnotes and theoretical stuff to get in the
way of actually gaining a good set of pictures and understanding as to
how the Linux network stack actually works today, so that new algorithms
such as eBDP, A* and tcp-fit be correctly implemented and drivers
improved in the right direction....

(And that said, knowing better how other OS's did it would be nice, too)