From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailgw1.uni-kl.de (mailgw1.uni-kl.de [IPv6:2001:638:208:120::220]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 6B82F3CB37 for ; Sun, 10 Jan 2021 00:39:22 -0500 (EST) Received: from sushi.unix-ag.uni-kl.de (sushi.unix-ag.uni-kl.de [IPv6:2001:638:208:ef34:0:ff:fe00:65]) by mailgw1.uni-kl.de (8.14.4/8.14.4/Debian-8+deb8u2) with ESMTP id 10A5dJot034383 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Sun, 10 Jan 2021 06:39:19 +0100 Received: from sushi.unix-ag.uni-kl.de (ip6-localhost [IPv6:::1]) by sushi.unix-ag.uni-kl.de (8.14.4/8.14.4/Debian-4+deb7u1) with ESMTP id 10A5dJAC021298 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Sun, 10 Jan 2021 06:39:19 +0100 Received: (from auerswal@localhost) by sushi.unix-ag.uni-kl.de (8.14.4/8.14.4/Submit) id 10A5dJvs021294 for bloat@lists.bufferbloat.net; Sun, 10 Jan 2021 06:39:19 +0100 Date: Sun, 10 Jan 2021 06:39:19 +0100 From: Erik Auerswald To: bloat@lists.bufferbloat.net Message-ID: <20210110053919.GA14073@unix-ag.uni-kl.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6d393265-cae1-6108-ed90-58ef53abe46a@rogers.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, hits=-0.999, tests=ALL_TRUSTED=-1,FAKE_REPLY_C=0.001 X-Spam-Score: (-0.999) X-Spam-Flag: NO Subject: Re: [Bloat] Rebecca Drucker's talk sounds like it exposes an addressable bloat issue in Ciscos X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Jan 2021 05:39:22 -0000 Hi, On Sat Jan 9 18:01:32 EST 2021, David Collier-Brown wrote: > At work, I recently had a database outage due to network saturation and > timeouts, which we proposed to address by setting up a QOS policy for > the machines in question. However, from the discussion in Ms Drucker's > BBR talk, that could lead us to doing /A Bad Thing/ (;-)) QoS policies are dangerous, they seldomly work exactly as intended. I'll assume you have already convinced yourself that you want to apply the QoS policy at a congested link, e.g., a WAN router, as opposed to shallow buffered LAN switches running Cisco IOS. If you are not using Cisco gear, then please do not assume that Cisco documentation can help you solve your problem. ;-) > Let's start at the beginning, though. The talk, mentioned before > in the list[1], was about the interaction of BBR and large values of > buffering, specifically for video traffic. I attended it, and listened > with interest to the questions from the committee. She subsequently > gave me a copy of the paper and presentation, which I appreciate: > it's very good work. The link to the talk announcement leads to an error page now. I did not find slides or a paper either. :-( > [...] > Increasing the buffering in the test environment turns perfectly > reasonable performance into a real disappointment > [...] Since I did neither attend the talk, nor could I read a paper or look at presentation slides, I'll just continue with the assumption that BBR does not successfully mitigate bufferbloat effects even for video delivery (which I would assume to be an important use case for Google resp. YouTube). > [...] > One of the interesting questions was about the token-bucket algorithm > used in the router to limit performance. In her paper, she discusses > the token bucket filter used by OpenWRT 19.07.1 on a Linksys WRT1900ACS > router. Allowing more than the actual bandwidth of the interface as > the /burst rate/ can exacerbate the buffering problem, so the listener > was concerned that routers "in the wild" might also be contributing > to the poor performance by using token-bucket algorithms with "excess > burst size" parameters. The burst *time* is essential in any QoS configuration, because only the combination of time, size and interface speed allows to reason about the behaviour. Most QoS documentation for enterprise networking gear glosses over this, since the time is usually not configurable, and varies widely between devices and device generations. In my experience, asking about token-bucket algorithm details is often a sign for the asker to not see the forest for the trees. > The very first Cisco manual I found in a Google search explained how > to */set/* excess burst size (!) > > https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos_plcshp/configuration/12-4/qos-plcshp-12-4-book.pdf IOS 12.4 is quite old. I do not expect current documentation to have improved significantly, but IOS 12.4 was a thing well before CoDel existed. > [...] > A little later, it explains that the excess or "extended" burst size > /exists so as to avoid tail-drop behavior, and, instead, > engage behavior like that of Random Early Detection (RED)./ The desire to avoid tail-drop is rooted in the desire to maximize throughput. As long as the queue is short, tail-drop is not a problem in practice. Assuming you just want a working network and have sufficient network capacity to support your applications. > [...] > So, folks, am I right in thinking that Cisco's recommendation just > might be a /terrible/ piece of advice? No comment. ;-) > As a capacity planner, it sounds a lot like they're praying for a > conveniently timed lull after every time they let too many bytes > through. Yes. This is a necessary assumption if you want your packet switched network to actually function. The network must not be consistently overloaded such that buffers only absorb bursts and are mostly empty. TCP's congestion control is an attempt to reach this despite end points having the capacity to overwhelm the network, combined with the desire to make good use of the available network capacity. Bloated buffers break this scheme by delaying the signals TCP's congestion control requires to work too much. Thus excessive buffers lead to persistent congestion, limited only by end points timing out. > As a follower of the discussion here, the reference to tail drop and > RED sound faintly ... antique. One reason might be that you looked at antique documentation. ;-) Looking at recent documentation does not really change this impression, though, at least in my experience. Anyway, in an attempt to actually help you: Cisco IOS routers allow the configuration of the queue size (in packets). Thus you could consider to just limit the queue size to guarantee a maximum queuing delay with MTU sized packets. That may well hurt throughput, but transfers you back to a pre-bufferbloat time. As long as the queues are short, you can consider fair queuing. I would suggest to not even attempt any prioritization, because chances are that makes the situation worse. With Cisco IOS, beware (strict) priority queuing, since a priority queue there *always* has a policer, and your traffic usually does not conform to your mental model. Please be aware that Cisco sells routers with different operating systems, and even within one operating system family, QoS details vary widely, thus I would suggest you carefully search for the documentation for your specific devices. Cisco (and most of the other enterprise network device vendors) provide many tuning knobs. Many even try to give helpful advice in their documentation. But QoS is a sufficiently hard problem that it is not yet solved by a widely available "do the right thing" tuning knob in specialized networking gear (I am explicitly excluding Linux based home routers and similar devices here). Generic advice on how to tune networking gear for QoS purposes is nigh impossible. As a result, QoS configurations often create more problems than they solve. As a result, I do not even think this an addressable documentation issue. Here be dragons. Just Say No. To preemp vendor fan persons: I am not bashing Cisco, but the original email explicitly mentioned Cisco. IMHO all the vendors are similar in a generic sense, with specific differences for specific use cases. Some vendors are worse because they hide their documentation from the public, and hide more of their implementation details than, for arguments sake, Cisco. Thanks and HTH, Erik P.S. I actually solved quite a few QoS related problems by disabling QoS. P.P.S. Sometimes I solved QoS related problems by introducing a QoS configuration. YMMV. -- In the beginning, there was static routing. -- RFC 1118