[Bloat] Please enter issues into the issue tracker - Issue system organisation needed.

Thu Feb 24 07:00:15 PST 2011

Thanks, Jim.

One thing that would help me; I have been a fan of RFC 2309 and RFC 3168 for some time. I suspect is that between them any given queue should be manageable to a set depth; tests I have run suggest that with RED settings, average queue depth under load approximates min-threshold pretty closely, and ECN has the advantage that it manages to do so without dropping traffic. I suspect that this community's efforts will support that. Some thoughts:

First, if the premise is wrong or there is a materially better solution, I'm all ears.

Second, if the premise is correct, I'd like data that I can put in front of people to get them to configure it.

Third, there is a long-standing debate between Van and Sally on what units to use with min-threshold. Sally argues, or argued, in favor of byte count, as that correlates with time and biases mark/drop toward large datagrams, which is to say datagrams carrying data - which happen to be the datagrams that act as signals to Reno-et-al. Van argues, or argued, in favor of buffers, as what is being managed is the router's buffer resource. In our implementations, we provide for both options, and personally Sally's model makes more mathematical sense to me. Is there a "best practice" we can document? Is there a "best practice" we can document regarding min-threshold and max-threshold settings?

In private email, I shared an approach that might make tuning a little more reliable and not require a max-threshold. If there is material being developed - an updated version of RED-Lite, or experience with other approaches, anything that would allow us to make the AQM algorithm self-tuning would be of great interest. The result of any such self-tuning algorithm is that it should be usable with dropping or marking, should keep the line operating at full utilization as long as there is traffic to send (eg not depend on the line occasionally going idle), maintain the queue at a "reasonably low delay" level under normal circumstances, not result in a given session being forced to shut down entirely, and not result in multiple drops on the same session within the same RTT in the normal case.

There is one special case that I have wondered about from time to time; the impact of loss of SYNs or SYN-ACKs. The network I started thinking about that in what an African network that was seriously underprovisioned (they needed to, and eventually did, spend more money) on a satcom link. In essence, I wondered if there was a way that one could permit the first or second retransmission of a SYN as opposed to the initial one to get through in times of heavy load. The effect might be to let an existing session quiesce. That falls under "research" :-)

We have issues with at least some of our hardware in this; on the GSR, for example, queues are on the output card but IP processing is on the input card, meaning that we have lost all IP-related information by the time one would like to set ECN CE or inspect the DSCP value, and on the input card we have no real-time (microsecond-scale) way to inspect queue depth or integrated rate of a queue on the output card. The GSR is a mite elderly, but still widely used, and no, folks aren't going to replace cards at this stage in its life. So, ideas people have on working around such issues would be of interest.

On Feb 24, 2011, at 7:19 AM, Jim Gettys wrote:

> We have lots of different issues to track. We are uncovering more and more with time, and the responsibility for the issues is all over the Internet ecology.
> 
> These issues include drivers in multiple operating systems, queue disciplines, OS distribution problems, broken networks, broadband gear, ISP's with broken configurations, routers with broken configurations, etc, etc, etc.  Many of the responsible organizations are completely unaware they have issues at the moment, and when they do wake up, the need to have a work list.  Serious as bufferbloat is, and generating tremendous support costs as it does, it is hidden among most organisations issue tracking as obscure, hard to explain problems, that have heretofore defied analysis.
> 
> I think both for the sanity of the upstream open source projects and companies that depend on it, commercial software and hardware vendors, and our own sanity, it's time to start to keep track of these problems.
> 
> A simple example is in the following mail, where Juliusz identified a bunch of Linux drivers with problems communicating back-pressure.
> https://lists.bufferbloat.net/pipermail/bloat/2011-February/000036.html
> 
> These driver bugs, of course, can and will be worked upstream in the project and/or responsible organisation; but from a practical point of view, these issues aren't really going to be fixed until people can actually take action on their own (by upgrading affected OS's, routers, broadband gear, etc. as appropriate).
> 
> So I think we need to track bufferbloat issues in possibly a different way (and maybe with a bit different work flow) than a usual tracking system.
> 
> First
> =====
> I think we need to capture what we know.  I encourage people to start entering issues in the bloat tracker found at:
> 
> http://www.bufferbloat.net/projects/bloat/issues/new
> 
> Note that redmine lets us move issues from one (sub)project to another, so we're best off capturing what we know immediately; we can sort and redeal later.
> 
> Note: "We're all bozos on this glass bus, no stones allowed".  We know there are problems all over; issue descriptions should always be polite and constructive, please!
> 
> Noting these issues will help people already involved (the mailing list had > 120 people the last I looked, from large numbers of organisations) take concrete action.  Issues buried in mail threads are too easy to lose.
> 
> Second
> ======
> As this effort grows, we'll need to organise the result, and delegate it appropriately as the effort scales.
> 
> Today, we're probably best off with a single project: but we hope certainly that won't be reasonable with time, possibly almost immediately.
> 
> We installed Redmine in particular as it has a competent issue tracking system, as well as good (sub)project management, which can easily be delegated to others (one of the huge problems with Bugzilla or Trac is the lack of project management).
> 
> If anyone is looking for a way to help bufferbloat and has experience with tracking systems on large, complex projects, I'd love to see someone organise this effort, and put some thought and structure into the categories, (sub)projects and work flow of issue states. I know from my OLPC experience just how important this can be, though this is a somewhat different situation.
> 
> 
> 			Best regards,
> 				- Jim
> 
> 
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat