On Mon, Mar 7, 2011 at 1:28 PM, Jim Gettys <jg@freedesktop.org> wrote:

> Cisco is far from unique.  I found it impossible to get this information
> from Linux.  Dunno about other operating systems.
> It's one of the things we need to fix in general.


So I'm not the only one. :) I'm looking to get this for Linux, and am
willing to implement it if necessary, and was looking for the One True Way.
I assume reporting back through netlink is the way to go.


> Exactly what the right metric(s) is (are), is interesting, of course. The
> problem with only providing instantaneous queue depth is that while it tells
> you you are currently suffering, it won't really help you detect transient
> bufferbloat due to web traffic, etc, unless you sample at a very high rate.
>  I really care about those frequent 100-200ms impulses I see in my traffic.
> So a bit of additional information would be goodness.g
>

My PhD research is focused on automatically diagnosing these sorts of
hiccups on a local host. I collect a common set of statistics across the
entire local stack every 100ms, then run a diagnosis algorithm to detect
which parts of the stack (connections, applications, interfaces) aren't
doing their job sending/receiving packets.

Among the research questions: What stats are necessary/sufficient for this
kind of diagnosis, What should their semantics be, and What's the largest
useful sample interval?

It turns out that when send/recv stops altogether, the queue lengths
indicate where things are being held up, leading to this discussion. I have
them for TCP (via web100), but since my diagnosis rules are generic, I'd
like to get them for the interfaces as well. I don't expect that the
Ethernet driver would stop transmitting for a few 100 ms at a time, but a
wireless driver might have to.

   Justin