From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-14-iad.dyndns.com (mxout-079-iad.mailhop.org [216.146.32.79]) by lists.bufferbloat.net (Postfix) with ESMTP id 835202E0601 for ; Thu, 10 Mar 2011 15:29:32 -0800 (PST) Received: from scan-11-iad.mailhop.org (scan-11-iad.local [10.150.0.208]) by mail-14-iad.dyndns.com (Postfix) with ESMTP id CA50E44B60E for ; Thu, 10 Mar 2011 23:29:31 +0000 (UTC) X-Spam-Score: -1.0 (-) X-Mail-Handler: MailHop by DynDNS X-Originating-IP: 209.85.215.171 Received: from mail-ey0-f171.google.com (mail-ey0-f171.google.com [209.85.215.171]) by mail-14-iad.dyndns.com (Postfix) with ESMTP id 4B1A444B5A7 for ; Thu, 10 Mar 2011 23:29:31 +0000 (UTC) Received: by eydd26 with SMTP id d26so894717eyd.16 for ; Thu, 10 Mar 2011 15:29:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:content-type:subject:date:message-id:to :mime-version:x-mailer; bh=wZQloHKOUn2gMYOQRXkIiwVCvySAGjngAyZ1Fe2bLPU=; b=iwi8qVxc6gbOQm2Kqa1ADPBE2Yy5vZVaR8pHSWHgz20oyt9+ON670vwPcbEfR4pol5 +/y9LQ/o2G/co7tmMWs9AlwTccmgyEPWqtaclFztsdmUnImDeUiRk++Aq5u/MaPt+gKM 1RzjqPOSlWynh3QyVjoH3vCjilHxulc6fABOM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:content-type:subject:date:message-id:to:mime-version:x-mailer; b=fxF0ErUJesKULkScDE6AXaD9MeK/ETwJMbgc5DuVGwtaXKbSrZZ/WfzIf2h6Y6dzQr wvTr+M25Y0VXhYf2hPfv4XFUhEChFVjDNI5ZeN2uA+a+pJ4ZYI7FSfYxlyVwGeGYHc5M J4fEHyZeKNlcEql+xURL4+u+mnFB+o8oQv5E0= Received: by 10.213.27.6 with SMTP id g6mr5260144ebc.14.1299799770385; Thu, 10 Mar 2011 15:29:30 -0800 (PST) Received: from [192.168.239.42] (xdsl-83-150-84-172.nebulazone.fi [83.150.84.172]) by mx.google.com with ESMTPS id x54sm2952270eeh.23.2011.03.10.15.29.29 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 10 Mar 2011 15:29:30 -0800 (PST) From: Jonathan Morton Content-Type: multipart/mixed; boundary=Apple-Mail-18--629978632 Date: Fri, 11 Mar 2011 01:29:28 +0200 Message-Id: To: bloat@lists.bufferbloat.net Mime-Version: 1.0 (Apple Message framework v1082) X-Mailer: Apple Mail (2.1082) Subject: [Bloat] Mitigating bufferbloat at the receiver X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2011 23:29:32 -0000 --Apple-Mail-18--629978632 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii So far I've seen plenty of talk about removing bufferbloat in the = sending side (TCP congestion window, device/driver buffers) and in the = network (AQM, ECN). These are all good, but I'd like to talk about what = we can do today at the receiver. Once upon a time, I spent a few months living in the back of my parents' = house. By British standards, this was in the middle of absolutely = nowhere, and the phone line quality was *dreadful*. This meant that to = avoid my analogue modem dropping out more than once an hour, or = suffering painfully long periods retraining to a type of noise that = retraining really wasn't designed to solve, I had to force it all the = way down to 4800 baud, the lowest speed available without dropping down = to ancient modulations without any robustness designed in. Needless to say, at 5kbps the bufferbloat problem in the ISP's modem = bank was pretty bad - and this was in about 2003, so anyone who says = this is a new problem is either lying or ignorant. I soon got fed up = enough to tune Linux' receive window limit down to about 4 packets, = which was still several seconds' worth but allowed me to use the = connection for more than one thing at once - handy when I wanted to = experiment with Gentoo. Incidentally, I was already using a form of split-TCP, in that I had = installed a webcache on the modem-equipped machine. This meant that I = only had to tune one TCP in order to get the vast majority of the = benefits. Bittorrent hadn't taken off at that point, so Web and FTP = traffic were the main bandwidth users, and everything else was = interactive and didn't need tuning. Meanwhile, for uploads I turned on = SFQ and left it at that, all the while wondering why ISPs didn't use SFQ = in their modem banks. Without receive window scaling, I didn't see any particular problem with = congestion control per se. I just saw that TCP was opening the = congestion window to match the receive window, which was tuned for LANs = and clean phone lines at that time, and that interactive packets had to = wait behind the bulk traffic. SFQ would have reduced the latency for = interactive traffic and setting up new connections, while bulk traffic = would still work as well as before. Much more recently, I have more than once had to spend extended periods = using a 3G modem (well, a tethered Nokia phone) as my primary connection = (thank goodness that here in Finland, unlimited connections are actually = available). I soon discovered that the problem I had seen at 5kbps was = just as bad at 500kbps. By now receive window scaling was enabled by default in Linux, so the = receive window and congestion window were both growing until they hit = the end of the buffer in the 3G base - which proved to be something like = 30 seconds. Furthermore, interactive traffic was *still* waiting behind = the bulk traffic, indicating a plain drop-tail queue. Read that again. THIRTY SECONDS of latency under a single bulk TCP. The practical effect of this was that I could watch the progress of = Gentoo downloading source packages, and it would proceed smoothly at = line speed for a while, and then abruptly stop - due to a dropped = packet. Many seconds later, it would suddenly jump ahead, whereupon it = would usually continue smoothly for a little longer before abruptly = stopping again. This was while using a geographically local mirror. I quickly concluded that since ISPs were *still* not using anything like = SFQ despite the enormous cost of 3G base equipment, they were simply as = dumb as rocks - and the lack of ECN also confirmed it. So I started = poking around to see what I could do about it, half-remembering the = effect of cutting the receive window years before. I've attached the kernel patch that I came up with, and which I've been = running on my gateway box ever since (even though I have my ADSL back). = It measures the actual bandwidth of the flow (based on RTT and window = size), and calculates an appropriate window size, which it then = increments towards. This got me down to about 1 second of buffering on = the 3G link, which I considered basically acceptable (in comparison). = At higher bandwidths the latency is lower, or to put it another way, at = lower latencies the available bandwidth is increased. The acceptable = latency is also capped at 2 seconds as a safety valve. As it happens, 2 seconds latency pretty much is the maximum for = acceptable TCP setup performance. This is because the initial RTO for = TCP is supposed to be 3 seconds. With a 30-second latency, TCP will = *always* retransmit several times during the setup phase, even if no = packets are actually lost. So that's something to tell your = packet-loss-obsessed router chums: forget packet loss, minimise = retransmits - and explain why packet loss and retransmits are not = synonymous! - Jonathan --Apple-Mail-18--629978632 Content-Disposition: attachment; filename=blackpool.patch Content-Type: application/octet-stream; name="blackpool.patch" Content-Transfer-Encoding: 7bit diff -urp linux-2.6.28.7/net/ipv4/tcp_input.c linux-2.6.28.7-blackpool/net/ipv4/tcp_input.c --- linux-2.6.28.7/net/ipv4/tcp_input.c 2009-02-21 00:41:27.000000000 +0200 +++ linux-2.6.28.7-blackpool/net/ipv4/tcp_input.c 2009-10-01 00:13:32.000000000 +0300 @@ -70,6 +70,7 @@ #include #include #include +#include int sysctl_tcp_timestamps __read_mostly = 1; int sysctl_tcp_window_scaling __read_mostly = 1; @@ -435,29 +436,42 @@ static void tcp_rcv_rtt_update(struct tc if (m == 0) m = 1; - if (new_sample != 0) { - /* If we sample in larger samples in the non-timestamp - * case, we could grossly overestimate the RTT especially - * with chatty applications or bulk transfer apps which - * are stalled on filesystem I/O. - * - * Also, since we are only going for a minimum in the - * non-timestamp case, we do not smooth things out - * else with timestamps disabled convergence takes too - * long. - */ - if (!win_dep) { - m -= (new_sample >> 3); - new_sample += m; - } else if (m < new_sample) - new_sample = m << 3; + /* If we sample in larger samples in the non-timestamp + * case, we could grossly overestimate the RTT especially + * with chatty applications or bulk transfer apps which + * are stalled on filesystem I/O. + * + * Therefore, since we are going for a min-RTT estimate, + * we do not smooth things out without timestamps. + */ + if (new_sample != 0 && !win_dep) { + m -= (new_sample >> 3); + new_sample += m; } else { - /* No previous measure. */ + /* No previous measure, or we're using timestamps. */ new_sample = m << 3; } - if (tp->rcv_rtt_est.rtt != new_sample) - tp->rcv_rtt_est.rtt = new_sample; +#if 0 + if (tp->rcv_rtt_est.rtt < new_sample && + tp->rcvq_space.space <= 4*tp->advmss) { + printk(KERN_DEBUG "Increased min-RTT detected: %u\n", new_sample >> 3); + } +#endif + + /* Allow for increasing the min-RTT if the recv window + * has shrunk to minimum, which might happen if path + * characteristics worsen. + */ + if (!tp->rcv_rtt_est.rtt || + tp->rcv_rtt_est.rtt > new_sample || + tp->rcvq_space.space <= 4*tp->advmss) + { + if(new_sample > tp->rcv_rtt_est.rtt) + tp->rcv_rtt_est.rtt++; + else + tp->rcv_rtt_est.rtt = new_sample; + } } static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp) @@ -483,6 +497,46 @@ static inline void tcp_rcv_rtt_measure_t tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0); } +/* Piecewise linear approximation of logarithm. */ +static inline int tcp_log2(int in) +{ + int t = fls(in); + int m = in & ((1 << t) - 1); + + if(t > 8) + return (t << 8) | (m >> (t-8)); + else + return (t << 8) | (m << (8-t)); +} + +/* Inverse of the above. */ +static inline int tcp_exp2(int in) +{ + int t = in >> 8; + int m = in & 0xFF; + + if(t <= 0) + return 0; + + if(t > 8) + return (1 << t) | (m << (t-8)); + else + return (1 << t) | (m >> (8-t)); +} + +/* Given a window that has been filled in some time, calculate from the + * bandwidth this implies an appropriate window size. + * Presently this is 64 * sqrt(bytes_per_sec). + */ +static inline int tcp_rcv_window_calc(int time, int space) +{ + int ltime = tcp_log2(time); + int lspace = tcp_log2(space); + int lbandwidth = lspace + tcp_log2(HZ) - ltime; + int lwindow = min((lbandwidth >> 1) + (6 << 8), lbandwidth - 1); + return tcp_exp2(lwindow); +} + /* * This function should be called every time data is copied to user space. * It calculates the appropriate TCP receive buffer space. @@ -497,12 +551,22 @@ void tcp_rcv_space_adjust(struct sock *s goto new_measure; time = tcp_time_stamp - tp->rcvq_space.time; - if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0) + space = tp->copied_seq - tp->rcvq_space.seq; + + if (tp->rcv_rtt_est.rtt == 0 || space < tp->rcvq_space.space) return; - space = 2 * (tp->copied_seq - tp->rcvq_space.seq); + /* Codename: BLACKPOOL (Vegas' distant cousin) + * + * We shrink the window if the throughput goes down + * or the latency goes up. We still grow the window + * when appropriate, but don't change it too quickly. + */ - space = max(tp->rcvq_space.space, space); + space = tcp_rcv_window_calc(time, space); + space = min(space, tp->rcvq_space.space + (tp->rcvq_space.space >> 2)); + space = max(space, tp->rcvq_space.space - (tp->rcvq_space.space >> 4)); + space = max(space, 4 * tp->advmss); if (tp->rcvq_space.space != space) { int rcvmem; @@ -525,13 +589,19 @@ void tcp_rcv_space_adjust(struct sock *s while (tcp_win_from_space(rcvmem) < tp->advmss) rcvmem += 128; space *= rcvmem; + + space = max(space, sysctl_tcp_rmem[0]); space = min(space, sysctl_tcp_rmem[2]); - if (space > sk->sk_rcvbuf) { - sk->sk_rcvbuf = space; - /* Make the window clamp follow along. */ - tp->window_clamp = new_clamp; - } +#if 1 + printk(KERN_DEBUG "New TCP recv window: %d (time=%d space=%d)\n", + new_clamp, time, tp->copied_seq - tp->rcvq_space.seq); +#endif + + sk->sk_rcvbuf = space; + tp->window_clamp = new_clamp; + + /* tcp_clamp_window(sk); */ } } --Apple-Mail-18--629978632--