General list for discussing Bufferbloat
 help / color / mirror / Atom feed
* [Bloat] Mitigating bufferbloat at the receiver
@ 2011-03-10 23:29 Jonathan Morton
  0 siblings, 0 replies; only message in thread
From: Jonathan Morton @ 2011-03-10 23:29 UTC (permalink / raw)
  To: bloat

[-- Attachment #1: Type: text/plain, Size: 4798 bytes --]

So far I've seen plenty of talk about removing bufferbloat in the sending side (TCP congestion window, device/driver buffers) and in the network (AQM, ECN).  These are all good, but I'd like to talk about what we can do today at the receiver.

Once upon a time, I spent a few months living in the back of my parents' house.  By British standards, this was in the middle of absolutely nowhere, and the phone line quality was *dreadful*.  This meant that to avoid my analogue modem dropping out more than once an hour, or suffering painfully long periods retraining to a type of noise that retraining really wasn't designed to solve, I had to force it all the way down to 4800 baud, the lowest speed available without dropping down to ancient modulations without any robustness designed in.

Needless to say, at 5kbps the bufferbloat problem in the ISP's modem bank was pretty bad - and this was in about 2003, so anyone who says this is a new problem is either lying or ignorant.  I soon got fed up enough to tune Linux' receive window limit down to about 4 packets, which was still several seconds' worth but allowed me to use the connection for more than one thing at once - handy when I wanted to experiment with Gentoo.

Incidentally, I was already using a form of split-TCP, in that I had installed a webcache on the modem-equipped machine.  This meant that I only had to tune one TCP in order to get the vast majority of the benefits.  Bittorrent hadn't taken off at that point, so Web and FTP traffic were the main bandwidth users, and everything else was interactive and didn't need tuning.  Meanwhile, for uploads I turned on SFQ and left it at that, all the while wondering why ISPs didn't use SFQ in their modem banks.

Without receive window scaling, I didn't see any particular problem with congestion control per se.  I just saw that TCP was opening the congestion window to match the receive window, which was tuned for LANs and clean phone lines at that time, and that interactive packets had to wait behind the bulk traffic.  SFQ would have reduced the latency for interactive traffic and setting up new connections, while bulk traffic would still work as well as before.

Much more recently, I have more than once had to spend extended periods using a 3G modem (well, a tethered Nokia phone) as my primary connection (thank goodness that here in Finland, unlimited connections are actually available).  I soon discovered that the problem I had seen at 5kbps was just as bad at 500kbps.

By now receive window scaling was enabled by default in Linux, so the receive window and congestion window were both growing until they hit the end of the buffer in the 3G base - which proved to be something like 30 seconds.  Furthermore, interactive traffic was *still* waiting behind the bulk traffic, indicating a plain drop-tail queue.

Read that again.  THIRTY SECONDS of latency under a single bulk TCP.

The practical effect of this was that I could watch the progress of Gentoo downloading source packages, and it would proceed smoothly at line speed for a while, and then abruptly stop - due to a dropped packet.  Many seconds later, it would suddenly jump ahead, whereupon it would usually continue smoothly for a little longer before abruptly stopping again.  This was while using a geographically local mirror.

I quickly concluded that since ISPs were *still* not using anything like SFQ despite the enormous cost of 3G base equipment, they were simply as dumb as rocks - and the lack of ECN also confirmed it.  So I started poking around to see what I could do about it, half-remembering the effect of cutting the receive window years before.

I've attached the kernel patch that I came up with, and which I've been running on my gateway box ever since (even though I have my ADSL back).  It measures the actual bandwidth of the flow (based on RTT and window size), and calculates an appropriate window size, which it then increments towards.  This got me down to about 1 second of buffering on the 3G link, which I considered basically acceptable (in comparison).  At higher bandwidths the latency is lower, or to put it another way, at lower latencies the available bandwidth is increased.  The acceptable latency is also capped at 2 seconds as a safety valve.

As it happens, 2 seconds latency pretty much is the maximum for acceptable TCP setup performance.  This is because the initial RTO for TCP is supposed to be 3 seconds.  With a 30-second latency, TCP will *always* retransmit several times during the setup phase, even if no packets are actually lost.  So that's something to tell your packet-loss-obsessed router chums: forget packet loss, minimise retransmits - and explain why packet loss and retransmits are not synonymous!

 - Jonathan


[-- Attachment #2: blackpool.patch --]
[-- Type: application/octet-stream, Size: 5176 bytes --]

diff -urp linux-2.6.28.7/net/ipv4/tcp_input.c linux-2.6.28.7-blackpool/net/ipv4/tcp_input.c
--- linux-2.6.28.7/net/ipv4/tcp_input.c	2009-02-21 00:41:27.000000000 +0200
+++ linux-2.6.28.7-blackpool/net/ipv4/tcp_input.c	2009-10-01 00:13:32.000000000 +0300
@@ -70,6 +70,7 @@
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <net/netdma.h>
+#include <linux/bitops.h>
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -435,29 +436,42 @@ static void tcp_rcv_rtt_update(struct tc
 	if (m == 0)
 		m = 1;
 
-	if (new_sample != 0) {
-		/* If we sample in larger samples in the non-timestamp
-		 * case, we could grossly overestimate the RTT especially
-		 * with chatty applications or bulk transfer apps which
-		 * are stalled on filesystem I/O.
-		 *
-		 * Also, since we are only going for a minimum in the
-		 * non-timestamp case, we do not smooth things out
-		 * else with timestamps disabled convergence takes too
-		 * long.
-		 */
-		if (!win_dep) {
-			m -= (new_sample >> 3);
-			new_sample += m;
-		} else if (m < new_sample)
-			new_sample = m << 3;
+	/* If we sample in larger samples in the non-timestamp
+	 * case, we could grossly overestimate the RTT especially
+	 * with chatty applications or bulk transfer apps which
+	 * are stalled on filesystem I/O.
+	 *
+	 * Therefore, since we are going for a min-RTT estimate,
+	 * we do not smooth things out without timestamps.
+	 */
+	if (new_sample != 0 && !win_dep) {
+		m -= (new_sample >> 3);
+		new_sample += m;
 	} else {
-		/* No previous measure. */
+		/* No previous measure, or we're using timestamps. */
 		new_sample = m << 3;
 	}
 
-	if (tp->rcv_rtt_est.rtt != new_sample)
-		tp->rcv_rtt_est.rtt = new_sample;
+#if 0
+	if (tp->rcv_rtt_est.rtt < new_sample &&
+	    tp->rcvq_space.space <= 4*tp->advmss) {
+		printk(KERN_DEBUG "Increased min-RTT detected: %u\n", new_sample >> 3);
+	}
+#endif
+
+	/* Allow for increasing the min-RTT if the recv window
+	 * has shrunk to minimum, which might happen if path
+	 * characteristics worsen.
+	 */
+	if (!tp->rcv_rtt_est.rtt ||
+	     tp->rcv_rtt_est.rtt > new_sample ||
+	     tp->rcvq_space.space <= 4*tp->advmss)
+	{
+		if(new_sample > tp->rcv_rtt_est.rtt)
+			tp->rcv_rtt_est.rtt++;
+		else
+			tp->rcv_rtt_est.rtt = new_sample;
+	}
 }
 
 static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
@@ -483,6 +497,46 @@ static inline void tcp_rcv_rtt_measure_t
 		tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0);
 }
 
+/* Piecewise linear approximation of logarithm. */
+static inline int tcp_log2(int in)
+{
+	int t = fls(in);
+	int m = in & ((1 << t) - 1);
+
+	if(t > 8)
+		return (t << 8) | (m >> (t-8));
+	else
+		return (t << 8) | (m << (8-t));
+}
+
+/* Inverse of the above. */
+static inline int tcp_exp2(int in)
+{
+	int t = in >> 8;
+	int m = in & 0xFF;
+
+	if(t <= 0)
+		return 0;
+
+	if(t > 8)
+		return (1 << t) | (m << (t-8));
+	else
+		return (1 << t) | (m >> (8-t));
+}
+
+/* Given a window that has been filled in some time, calculate from the
+ * bandwidth this implies an appropriate window size.
+ * Presently this is 64 * sqrt(bytes_per_sec).
+ */
+static inline int tcp_rcv_window_calc(int time, int space)
+{
+	int ltime  = tcp_log2(time);
+	int lspace = tcp_log2(space);
+	int lbandwidth = lspace + tcp_log2(HZ) - ltime;
+	int lwindow    = min((lbandwidth >> 1) + (6 << 8), lbandwidth - 1);
+	return tcp_exp2(lwindow);
+}
+
 /*
  * This function should be called every time data is copied to user space.
  * It calculates the appropriate TCP receive buffer space.
@@ -497,12 +551,22 @@ void tcp_rcv_space_adjust(struct sock *s
 		goto new_measure;
 
 	time = tcp_time_stamp - tp->rcvq_space.time;
-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+	space = tp->copied_seq - tp->rcvq_space.seq;
+
+	if (tp->rcv_rtt_est.rtt == 0 || space < tp->rcvq_space.space)
 		return;
 
-	space = 2 * (tp->copied_seq - tp->rcvq_space.seq);
+	/* Codename: BLACKPOOL (Vegas' distant cousin)
+	 *
+	 * We shrink the window if the throughput goes down
+	 * or the latency goes up.  We still grow the window
+	 * when appropriate, but don't change it too quickly.
+	 */
 
-	space = max(tp->rcvq_space.space, space);
+	space = tcp_rcv_window_calc(time, space);
+	space = min(space, tp->rcvq_space.space + (tp->rcvq_space.space >> 2));
+	space = max(space, tp->rcvq_space.space - (tp->rcvq_space.space >> 4));
+	space = max(space, 4 * tp->advmss);
 
 	if (tp->rcvq_space.space != space) {
 		int rcvmem;
@@ -525,13 +589,19 @@ void tcp_rcv_space_adjust(struct sock *s
 			while (tcp_win_from_space(rcvmem) < tp->advmss)
 				rcvmem += 128;
 			space *= rcvmem;
+
+			space = max(space, sysctl_tcp_rmem[0]);
 			space = min(space, sysctl_tcp_rmem[2]);
-			if (space > sk->sk_rcvbuf) {
-				sk->sk_rcvbuf = space;
 
-				/* Make the window clamp follow along.  */
-				tp->window_clamp = new_clamp;
-			}
+#if 1
+			printk(KERN_DEBUG "New TCP recv window: %d (time=%d space=%d)\n",
+				new_clamp, time, tp->copied_seq - tp->rcvq_space.seq);
+#endif
+
+			sk->sk_rcvbuf = space;
+			tp->window_clamp = new_clamp;
+
+			/* tcp_clamp_window(sk); */
 		}
 	}
 

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2011-03-10 23:29 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-10 23:29 [Bloat] Mitigating bufferbloat at the receiver Jonathan Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox