[Bloat] Rebecca Drucker's talk sounds like it exposes an addressable bloat issue in Ciscos

Erik Auerswald auerswal at unix-ag.uni-kl.de
Sun Jan 10 00:39:19 EST 2021


Hi,

On Sat Jan 9 18:01:32 EST 2021, David Collier-Brown wrote:
> At work, I recently had a database outage due to network saturation and
> timeouts, which we proposed to address by setting up a QOS policy for
> the machines in question. However, from the discussion in Ms Drucker's
> BBR talk, that could lead us to doing /A Bad Thing/ (;-))

QoS policies are dangerous, they seldomly work exactly as intended.

I'll assume you have already convinced yourself that you want to apply
the QoS policy at a congested link, e.g., a WAN router, as opposed to
shallow buffered LAN switches running Cisco IOS.

If you are not using Cisco gear, then please do not assume that Cisco
documentation can help you solve your problem. ;-)

> Let's start at the beginning, though.  The talk, mentioned before
> in the list[1], was about the interaction of BBR and large values of
> buffering, specifically for video traffic.  I attended it, and listened
> with interest to the questions from the committee. She subsequently
> gave me a copy of the paper and presentation, which I appreciate:
> it's very good work.

The link to the talk announcement leads to an error page now.  I did
not find slides or a paper either. :-(

> [...]
> Increasing the buffering in the test environment turns perfectly 
> reasonable performance into a real disappointment
> [...]

Since I did neither attend the talk, nor could I read a paper or look
at presentation slides, I'll just continue with the assumption that
BBR does not successfully mitigate bufferbloat effects even for video
delivery (which I would assume to be an important use case for Google
resp. YouTube).

> [...]
> One of the interesting questions was about the token-bucket algorithm
> used in the router to limit performance. In her paper, she discusses
> the token bucket filter used by OpenWRT 19.07.1 on a Linksys WRT1900ACS
> router. Allowing more than the actual bandwidth of the interface as
> the /burst rate/ can exacerbate the buffering problem, so the listener
> was concerned that routers "in the wild" might also be contributing
> to the poor performance by using token-bucket algorithms with "excess
> burst size" parameters.

The burst *time* is essential in any QoS configuration, because only
the combination of time, size and interface speed allows to reason
about the behaviour.  Most QoS documentation for enterprise networking
gear glosses over this, since the time is usually not configurable,
and varies widely between devices and device generations.

In my experience, asking about token-bucket algorithm details is often
a sign for the asker to not see the forest for the trees.

> The very first Cisco manual I found in a Google search explained how
> to */set/* excess burst size (!)
> 
> https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_plcshp/configuration/12-4/qos-plcshp-12-4-book.pdf 

IOS 12.4 is quite old.  I do not expect current documentation to have
improved significantly, but IOS 12.4 was a thing well before CoDel existed.

> [...]
> A little later, it explains that the excess or "extended" burst size 
> /exists so as to avoid tail-drop behavior, and, instead,
> engage behavior like that of Random Early Detection (RED)./

The desire to avoid tail-drop is rooted in the desire to maximize
throughput.  As long as the queue is short, tail-drop is not a problem
in practice.  Assuming you just want a working network and have sufficient
network capacity to support your applications.

> [...]
> So, folks, am I right in thinking that Cisco's recommendation just
> might be a /terrible/ piece of advice?

No comment. ;-)

> As a capacity planner, it sounds a lot like they're praying for a
> conveniently timed lull after every time they let too many bytes
> through.

Yes.

This is a necessary assumption if you want your packet switched
network to actually function.  The network must not be consistently
overloaded such that buffers only absorb bursts and are mostly empty.
TCP's congestion control is an attempt to reach this despite end points
having the capacity to overwhelm the network, combined with the desire
to make good use of the available network capacity.

Bloated buffers break this scheme by delaying the signals TCP's congestion
control requires to work too much.  Thus excessive buffers lead to
persistent congestion, limited only by end points timing out.

> As a follower of the discussion here, the reference to tail drop and
> RED sound faintly ... antique.

One reason might be that you looked at antique documentation. ;-)
Looking at recent documentation does not really change this impression,
though, at least in my experience.

Anyway, in an attempt to actually help you:  Cisco IOS routers allow the
configuration of the queue size (in packets).  Thus you could consider
to just limit the queue size to guarantee a maximum queuing delay
with MTU sized packets.  That may well hurt throughput, but transfers
you back to a pre-bufferbloat time.  As long as the queues are short,
you can consider fair queuing.  I would suggest to not even attempt
any prioritization, because chances are that makes the situation worse.
With Cisco IOS, beware (strict) priority queuing, since a priority queue
there *always* has a policer, and your traffic usually does not conform
to your mental model.

Please be aware that Cisco sells routers with different operating systems,
and even within one operating system family, QoS details vary widely,
thus I would suggest you carefully search for the documentation for your
specific devices.

Cisco (and most of the other enterprise network device vendors)
provide many tuning knobs.  Many even try to give helpful advice in
their documentation.  But QoS is a sufficiently hard problem that it
is not yet solved by a widely available "do the right thing" tuning
knob in specialized networking gear (I am explicitly excluding Linux
based home routers and similar devices here).  Generic advice on how to
tune networking gear for QoS purposes is nigh impossible.  As a result,
QoS configurations often create more problems than they solve.

As a result, I do not even think this an addressable documentation
issue.

Here be dragons.  Just Say No.

To preemp vendor fan persons:  I am not bashing Cisco, but the original
email explicitly mentioned Cisco.  IMHO all the vendors are similar
in a generic sense, with specific differences for specific use cases.
Some vendors are worse because they hide their documentation from the
public, and hide more of their implementation details than, for arguments
sake, Cisco.

Thanks and HTH,
Erik

P.S. I actually solved quite a few QoS related problems by disabling QoS.

P.P.S. Sometimes I solved QoS related problems by introducing a QoS
       configuration.  YMMV.
-- 
In the beginning, there was static routing.
                        -- RFC 1118


More information about the Bloat mailing list