Bufferbloat is a huge drag on Internet performance created, ironically, by previous attempts to make it work better. The bad news is that bufferbloat is everywhere, in more devices and programs than you can shake a stick at. The good news is, bufferbloat is relatively easy to fix. The even better news is that fixing it may solve a lot of the service problems now addressed by bandwidth caps and metering, making the Internet faster and less expensive for both consumers and providers. == Packets on the Highway == To fix bufferbloat, you first have to understand it. Start by imagining cars traveling down an imaginary road. They're trying to get from one end to the other as fast as possible, so they travel nearly bumper to bumper at the road's highest safe speed. Our "cars" are standing in for Internet packets, of course, and our road is a network link. The 'bandwidth' of the link is like the total amount of stuff the cars can carry from one end to the other per second; the 'latency' is like the amount of time it takes any given car to get from one end to the other. One of the problems road networks have to cope with is traffic congestion. If too many cars try to use the road at once, bad things happen. One of those bad things is cars running off the road and crashing. The Internet analog of this is called 'packet loss'. We want to hold it to a minimum. There's an easy way to attack a road congestion problem that's not actually used much because human drivers hate it. That's to interrupt the road with a parking lot. A car drives in, waits to be told when it can leave, and then drives out. By controlling the timing and rate at which you tell cars they can leave, you can hold the number of cars on the road downstream of the lot to a safe level. For this technique to work, cars must enter the parking lot without slowing down, otherwise you'd cause a backup on the upstream side of the lot. Real cars can't do that, but Internet packets can, so please think of this as a minor bug in the analogy and then ignore it. The other thing that has to be true is that the lot doesn't exceed its maximum capacity. That is, cars leave often enough relative to the speed at which they come in that there's always space in the lot for incoming cars. In the real world, this is a serious problem. On the Internet, extremely large parking lots are so cheap to build that it's difficult to fill them to capacity. So we can (mostly) ignore this problem with the analogy as well. We'll explain later what happens when we can't. The Internet analog of our parking lot is a packet buffer. People who build network hardware and software have been raised up to hate losing packets the same way highway engineers hate auto crashes. So they put lots of huge buffers everywhere on the network. In network jargon, this optimizes for bandwidth. That is, it maximizes the amount of stuff you can bulk-ship through the network without loss. The problem is that it does horrible things to latency. To see why, let's go back to our cars on the road. Suppose your rule for when a car gets to leave the parking lot is the simplest possible: it fills up until it overflows, then cars are let out the downstream side as fast as they can go. This is not a very smart rule, and human beings wouldn't use it, but many Internet devices actually do and it's a good place to start in understanding bufferbloat. (We'll call this rule Simple Overflow Queuing, or SOQU for short. Pronounce it "sock-you" or "soak-you"; you'll see why in a moment.) Now, here's how the flow of cars will look if the lot starts empty and the road is in light use. Cars will arrive at the parking lot, fill it up, and then proceed out the other side and nobody will go off the road. But - each car will be delayed by the time required to initially fill up the parking lot. There's another effect, too. The parking lot turns smooth traffic into clumpy traffic. A constantly spaced string of cars coming in tends to turn into a series of clumps coming out, with size of each clump controlled by the width of the exit from the the parking lot. This is a problem, because car clumps tend to cause car crashes. When this happens on the Internet, the buffer adds latency to the connection. Packets that arrive where they're supposed to go will have large time delays. Smooth network traffic turns into a herky-jerky stuttering thing; as a result, packet loss rises. Performance is worse than if the buffer weren't there at all. And - this is an important point - the larger the buffer is, the worse the problems are. == From Highway to Network == Now imagine a whole network of highways, each with parking lots scattered randomly along them and at their intersections. Cars trying to get through it will experience multiple delays, and initially smooth traffic will become clumpy and chaotic. Clumps from upstream buffers will clog downstream buffers that might have handled the same volume of traffic as a smooth flow, leading to serious and sometimrds unrecoverable packet loss. As the total traffic becomes heavier, network traffic patterns will grow burstier and more chaotic. Usage of individual links will swing rapidly and crazily between emptiness and overload. Latency, and total packet times, will zig from instantaneous to check-again-next-week-please and zag back again in no predictable pattern. Packet losses - the problem all those buffers were put in to prevent - will begin to increase once all the buffers are full, because the occasional crash is the only thing that can currently tell Internet routers to slow down their sending. It doesn't take too long before you start getting the Internet equivalent of 60-car pileups. Bad consequences of this are legion. One of the most obvious is what latency spikes do to the service that converts things like website names to actual network addresses - DNS lookups get painfully slow. Voice-over-IP services like Skype and video streamers like YouTube become stuttery, prone to dropouts, and painful to use. Gamers get fragged more. For the more technically-inclined reader, there are several other important Internet service protocols that degrade badly in an enviroment with serious latency spikes: NTP, ARP, DHCP, and various routing protocols. Yes, things as basic as your system clock time can get messed up! And - this is the key point - the larger and more numerous the buffers on the network are, the worse these problems get. This is the bufferbloat problem in a nutshell. One of the most insidious things about bufferbloat is that it easily masquerades as something else: underprovisioning of the network. But buying fatter pipes doesn't fix the bufferbloat cascades, and buying larger buffers actually makes them worse! Those of us who have been studying bufferbloat believe that many of the problems now attributed to under-capacity and bandwidth hogging are actually symptoms of bufferbloat. We think fixing the bufferbloat problem may well make many contentious arguments about usage metering, bandwidth caps, and tiered pricing unnecessary. At the very least, we think networks should be systematically adited for bufferbloat before more resources are plowed into fixing problems that may be completely misdiagnosed. == Three Cures and a Blind Alley == Now that we understand it, what can we do about it? We can start by understanding how we got into this mess; mainly, by equating "The data must get through!" with zero packet loss. Hating packet loss enough to want to stamp it out completely is actually a bad mental habit. Unlike real cars on real highways, the Internet is designed to respond to crashes by resending an identical copy when a packet send is not acknowledged. In fact, the Internet's normal mechanisms for avoiding congestion rely on the occasional packet loss to trigger them. Thus, the perfect is the enemy of the good; some packet loss is essential. But, historically, the designers of network hardware and software have tended to break in the other direction, bloating buffers in order to drive packet losses to zero. Undoing this mistake will pay off hugely in improved network oerformance. There are three main tactics: First, we can *pay attention*! Bufferbloat is easy to test for once you know how to spot it. Watching networks for bufferbloat cascades and fixing them needs to be part of the normal job duties of every network administrator. Second, we can decrease buffer sizes. This cuts the delay due to latency and decreases the clumping effect on the traffic. It can increase packet loss, but that problem is coped with pretty well by the Internet's normal congestion-avoidance methods. As long as packet losses remain unusual events (below the levels produced by bufferbloat cascades), resends will happen as needed and the data will get through. Third, we can use smarter rules than SOQU for when and by how much a buffer should try to empty itself. That is, we need buffer-management rules that we can expect to statistically smooth network traffic rather than clumpifying it. The reasons smarter rules have not been universally deployed already are mainly historical; now, this can and should be fixed. Next we need to point out one tactic that won't work. Some people think the answer to Internet congestion is to turn each link into a multi-lane highway, with fast lanes and slow lanes. The theory of QoS ("Quality Of Service") is that you can put priority traffic in fast lanes and bulk traffic in slow ones. This approach has historical roots in things telephone companies used to do. It works well for analog traffic that doesn't use buffering, only switching. It doesn't work for Internet traffic, because all the lanes have to use the same buffers. If you try to implement QoS on a digital packet network, what you end up with is extremely complicated buffer-management rules with so many brittle assumptions baked into them that they harm performance when the shape of network demand is even slightly different than the rule-designer expected. Really smart buffer-management rules are simple enough not to have strange corner cases where they break down and jam up the traffic. Complicated ones break down and jam up the traffic. QOS rules are complicated. == Less Hard == We started by asserting that bufferbloat is easy to fix. Here are the reasons for optimism: First, it's easy to detect once you understand it - and verifying that you've fixed it is easy, too. Second, the fixes are cheap and give direct benefits as soon as they're applied. You don't have to wait for other people to fix bufferbloat in their devices to improve the performance of your own. Third, you usually only have to fix it once per device; continual tuning isn't necessary. Fourth, it's basically all software fixes. No expensive hardware upgrades are required. Finally (and importantly!), trying to fix it won't significantly increase your risk of a network failure. If you fumble the first time, it's reversible.