From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-fx0-f43.google.com (mail-fx0-f43.google.com [209.85.161.43]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 83D8D2006D6 for ; Wed, 17 Aug 2011 17:32:24 -0700 (PDT) Received: by fxg17 with SMTP id 17so1777810fxg.16 for ; Wed, 17 Aug 2011 18:26:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=Avyb7/pu0TdwfcqXY2NmqWAqZgBPxoVJxgB+Pjq4dJE=; b=BOR2uJTRrVgMDm8qsIrkjTp+hHg6qpETlGz5Lkk4QbSa0GJ6S35sKxjmJmnbh8qBjK C49+SKWwB74DUHM4TKbySVUrsc3AiD6Pu848Pr7j+xviIusLSuJEufPZPfLeVQ0NYAnx 25+A2ZWWep8pWabBZrLNzWRuE7OVNZd36fXxY= MIME-Version: 1.0 Received: by 10.223.155.74 with SMTP id r10mr211067faw.32.1313630761237; Wed, 17 Aug 2011 18:26:01 -0700 (PDT) Received: by 10.223.106.142 with HTTP; Wed, 17 Aug 2011 18:26:00 -0700 (PDT) Date: Wed, 17 Aug 2011 18:26:00 -0700 Message-ID: From: "Patrick J. LoPresti" To: bloat@lists.bufferbloat.net Content-Type: text/plain; charset=ISO-8859-1 Subject: [Bloat] Not all the world's a WAN X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Aug 2011 00:32:25 -0000 Hello, BufferBloat crusaders. Permit me briefly to describe my application. I have a rack full of Linux systems, all with 10GbE NICs tied together by a 10GbE switch. There are no routers or broader Internet connectivity. (At least, none that matters here.) Round trip "ping" times between systems are 100 microseconds or so. Some of the systems are "servers", some are "clients". Any single client may decide to slurp data from multiple servers. For example, the servers could be serving up a distributed file system, so when a client accesses a file striped across multiple servers, it tries to pull data from multiple servers simultaneously. (This is not my literal application, but it does represent the same access pattern.) The purpose of my cluster is to process data sets measured in hundreds of gigabytes, as fast as possible. So, for my application: - Speed = Throughput (latency is irrelevant) - TCP retransmissions are a disaster, not least because - 200ms is an eternity The problem I have is this: At 10 gigabits/second, it takes very little time to overrun even a sizable buffer in a 10GbE switch. Although each client does have a 10GbE connection, it is reading multiple sockets from multiple servers, so over short intervals the switch's aggregate incoming bandwidth (multiple 10GbE links from servers) is larger than its outgoing bandwidth (single 10GbE link to client). If the servers do not throttle themselves -- say, because the TCP windows are large -- packets overrun the switch's buffer and get lost. I have "fixed" this problem by using a switch with a _large_ buffer, plus using TCP_WINDOW_CLAMP on the clients to ensure the TCP window never gets very large. This ensures that the servers never send so much data that they overrun the switch. And it is actually working great; I am able to saturate all of my 10GbE links with zero retransmissions. I have not read all of the messages on this list, but I have read enough to make me a little nervous. And thus I send this message in the hope that, in your quest to slay the "buffer bloat" dragon, you do not completely forget applications like mine. I would hate to have to switch to Infiniband or whatever just because everyone decided that Web browsers are the only TCP/IP application in the world. Thanks for reading. - Pat