From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-fx0-f43.google.com (mail-fx0-f43.google.com [209.85.161.43]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 14297200968 for ; Sun, 15 May 2011 17:22:23 -0700 (PDT) Received: by fxm3 with SMTP id 3so4417950fxm.16 for ; Sun, 15 May 2011 17:31:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to:x-mailer; bh=FVHIhijVdlFo1skpDcZkI9vdJuXuRRuGRt0kcwq2r3s=; b=m9IsskRIXF9BSclkuR+2+LzDiJZVTTQqI0P2RugVHu6V53y6ixxVGJyVszPho4uCq9 XlPeVhq9TEjoNlR5RQ2HbXAqnPcQe/7itdcpEEG+Z42lARlGK4L9kXwb2AWvQSKQMLRx RvjjdTJitBsADnqNY8CZikMIh08q6kIf5GJ2U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; b=f5OLNVzgbSTsTCKBHXHb0IoBoGD5HnmuyVuorBuJM93Bv5XYFESZT6tYxmY3n+TiGh IEEBjlJr3mtjr+lOC5rpNul4zjxrVEwtNC2y6tMmONtlA/CAGJVSoygKZmnhHF7vPwNK HMi021alSVFfaufQGTanVARlIRQ2xMlrWm3po= Received: by 10.223.79.151 with SMTP id p23mr3997840fak.78.1305505905472; Sun, 15 May 2011 17:31:45 -0700 (PDT) Received: from [192.168.239.42] (xdsl-83-150-84-172.nebulazone.fi [83.150.84.172]) by mx.google.com with ESMTPS id o10sm1621948faa.19.2011.05.15.17.31.43 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 15 May 2011 17:31:44 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Jonathan Morton In-Reply-To: <5946BA6B-4E00-43AF-A8A2-17FB3769F37B@cisco.com> Date: Mon, 16 May 2011 03:31:41 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <2EEFB9D5-E9CC-4612-8D91-F6B382E3C2FB@gmail.com> References: <4DB70FDA.6000507@mti-systems.com> <4DC2C9D2.8040703@freedesktop.org> <20110505091046.3c73e067@nehalam> <6E25D2CF-D0F0-4C41-BABC-4AB0C00862A6@pnsol.com> <35D8AC71C7BF46E29CC3118AACD97FA6@srichardlxp2> <1304964368.8149.202.camel@tardy> <4DD9A464-8845-49AA-ADC4-A0D36D91AAEC@cisco.com> <1305297321.8149.549.camel@tardy> <014c01cc11a8$de78ac10$9b6a0430$@gross@avanw.com> <8A928839-1D91-4F18-8252-F06BD004E37D@cisco.com> <5946BA6B-4E00-43AF-A8A2-17FB3769F37B@cisco.com> To: Fred Baker X-Mailer: Apple Mail (2.1084) Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] Jumbo frames and LAN buffers (was: RE: Burst Loss) X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2011 00:22:24 -0000 On 15 May, 2011, at 11:49 pm, Fred Baker wrote: >=20 > On May 15, 2011, at 11:28 AM, Jonathan Morton wrote: >> The fundamental thing is that the sender must be able to know when = sent frames can be flushed from the buffer because they don't need to be = retransmitted. So if there's a NACK, there must also be an ACK - at = which point the ACK serves the purpose of the NACK, as it does in TCP. = The only alternative is a wall-time TTL, which is doable on single hops = but requires careful design. >=20 > To a point. NORM holds a frame for possible retransmission for a = stated period of time, and if retransmission isn't requested in that = interval forgets it. So the ack isn't actually necessary; what is = necessary is that the retention interval be long enough that a nack has = a high probability of succeeding in getting the message through. Okay, so because it can fall back to TCP's retransmit, the retention = requirements can be relaxed. >> ...recent versions of Ethernet *do* support a throttling feedback = mechanism, and this can and should be exploited to tell the edge host or = router that ECN *might* be needed. Also, with throttling feedback = throughout the LAN, the Ethernet can for practical purposes be treated = as almost-reliable. This is *better* in terms of packet loss than ARQ = or NACK, although if the Ethernet's buffers are large, it will still = increase delay. (With small buffers, it will just decrease throughput = to the capacity, which is fine.) >=20 > It increases the delay anyway. It just pushes the retention buffer to = another place. What do you think the packet is doing during the "don't = transmit" interval? Most packets delayed by Ethernet throttling would, with small buffers, = end up waiting in the sending host (or router). They thus spend more = time in a potentially active queue instead of in a dumb one. But even = if the host queue is dumb, the overall delay is no worse than with the = larger Ethernet buffers. > Throughput never exceeds capacity. If I have a 10 GBPS link, I will = never get more than 10 GBPS through it. Buffer fill rate is = statistically predictable. With small buffers, the fill rate acheives = the top sooner. They increase the probability that the buffers are full, = which is to say the drop probability. Which puts us to an end to end = retransmission, which is the worst case of what you were worried about. Let's suppose someone has generously provisioned an office with GigE = throughout, using a two-level hierarchy of switches. Some dumb schmuck = then schedules every single computer to run it's backups (to a single = fileserver) at the same time. That's say 100 computers all competing = for one GigE link to the fileserver. If the switches are fair, each = computer should get 10Mbps - that's the capacity. With throttling, each computer sees the link closed 99% of the time. It = can send at link rate for the remaining 1% of the time. On medium = timescales, that looks like a 10Mbps bottleneck at the first link. So = the throughput on that link equals the capacity, and hopefully the = goodput is also thus. The only queue that is likely to overflow is the = one on the sending computer, and one would hope there is enough feedback = in a host's own TCP/IP stack to prevent that. Without throttling but with ARQ, NACK or whatever you want to call it, = the host has no signal to tell it to slow down - so the throughput on = the edge link is more than 10Mbps (but the goodput will be less). The = buffer in the outer switch fills up - no matter how big or small it is - = and starts dropping packets. The switch then won't ask for = retransmission of packets it's just dropped, because it has nowhere to = put them. The same process then repeats at the inner switch. Finally, = the server sees the missing packets, and asks for the retransmission - = but these requests have to be switched all the way back to the clients, = because the missing packets aren't in the switches' buffers. It's = therefore no better than a TCP SACK retransmission. So there you have a classic congested network scenario in which = throttling solves the problem, but link-level retransmission can't. Where ARQ and/or NACK come in handy is where the link itself is = unreliable, such as on WLANs (hence the use in amateur radio) and = last-mile links. In that case, the reason for the packet loss is not a = full receive buffer, so asking for a retransmission is not inherently = self-defeating. > I'm not going to argue against letting retransmission go end to end; = it's an endless debate. I'll simply note that several link layers, = including but not limited to those you mention, find that applications = using them work better if there is a high high probability of = retransmission in an interval on the order of the link RTT as opposed to = the end to end RTT. You brought up data centers (aka variable delays in = LAN networks); those have been heavily the province of fiberchannel, = which is a link layer protocol with retransmission. Think about it. What I'd like to see is a complete absence of need for retransmission on = a properly built wired network. Obviously the capability still needs to = be there to cope with the parts that aren't properly built or aren't = wired, but TCP can do that. Throttling (in the form of Ethernet PAUSE) = is simply the third possible method of signalling congestion in the = network, alongside delay and loss - and it happens to be quite widely = deployed already. - Jonathan