[Bloat] Getting current interface queue sizes

Mon Mar 7 14:01:49 PST 2011

On 03/07/2011 04:18 PM, Justin McCann wrote:
> On Mon, Mar 7, 2011 at 1:28 PM, Jim Gettys <jg at freedesktop.org
> <mailto:jg at freedesktop.org>> wrote:
>
>     Cisco is far from unique.  I found it impossible to get this
>     information from Linux.  Dunno about other operating systems.
>     It's one of the things we need to fix in general.
>
>
> So I'm not the only one. :) I'm looking to get this for Linux, and am
> willing to implement it if necessary, and was looking for the One True
> Way. I assume reporting back through netlink is the way to go.

Please do.

The lack of these stats meant I had to base my conclusions on the 
experiments I described in my blog using ethtool, and extrapolating to 
wireless (since Linux wireless drivers have not implemented the ring 
control commands in the past).  I first went looking for the simple 
instantaneous number of packets queued, and came up empty.  The hardware 
drivers of course are often running multiple queues these days. There 
went several more days of head scratching and emailing people, when a 
simple number would have answered the question so much faster.

>
>     Exactly what the right metric(s) is (are), is interesting, of
>     course. The problem with only providing instantaneous queue depth is
>     that while it tells you you are currently suffering, it won't really
>     help you detect transient bufferbloat due to web traffic, etc,
>     unless you sample at a very high rate.  I really care about those
>     frequent 100-200ms impulses I see in my traffic. So a bit of
>     additional information would be goodness.g
>
>
> My PhD research is focused on automatically diagnosing these sorts of
> hiccups on a local host. I collect a common set of statistics across the
> entire local stack every 100ms, then run a diagnosis algorithm to detect
> which parts of the stack (connections, applications, interfaces) aren't
> doing their job sending/receiving packets.

I don't mind (for diagnosis) a tool querying once every 100ms; the 
problem is transient bufferbloat  has much finer time scales than this, 
and you can easily miss these short impulses that are themselves only 
tens or a hundred milliseconds.  So instantaneous measurements by 
themselves will miss a lot.

>
> Among the research questions: What stats are necessary/sufficient for
> this kind of diagnosis, What should their semantics be, and What's the
> largest useful sample interval?

Dunno.  I suspect we should take this class of questions to the bloat 
list until it's time to write code, as we'll drown out people doing 
development here on the bloat-devel list.

>
> It turns out that when send/recv stops altogether, the queue lengths
> indicate where things are being held up, leading to this discussion. I
> have them for TCP (via web100), but since my diagnosis rules are
> generic, I'd like to get them for the interfaces as well. I don't expect
> that the Ethernet driver would stop transmitting for a few 100 ms at a
> time, but a wireless driver might have to.
>

Certainly true under some circumstances, particularly on 3g.  On 3g, 
unless you are really obnoxious to the carrier (which I don't 
recommend), when you go idle, it will take a quite a while before you 
can get airtime again.
			- Jim