From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-23-ewr.dyndns.com (mxout-047-ewr.mailhop.org [216.146.33.47]) by lists.bufferbloat.net (Postfix) with ESMTP id 8F81D2E0077 for ; Fri, 11 Feb 2011 11:25:25 -0800 (PST) Received: from scan-21-ewr.mailhop.org (scan-21-ewr.local [10.0.141.243]) by mail-23-ewr.dyndns.com (Postfix) with ESMTP id 0E6B840C6E for ; Fri, 11 Feb 2011 19:25:23 +0000 (UTC) X-Spam-Score: 0.0 () X-Mail-Handler: MailHop by DynDNS X-Originating-IP: 213.165.64.22 Received: from mailout-de.gmx.net (mailout-de.gmx.net [213.165.64.22]) by mail-23-ewr.dyndns.com (Postfix) with SMTP id 1A3B440887 for ; Fri, 11 Feb 2011 19:25:19 +0000 (UTC) Received: (qmail invoked by alias); 11 Feb 2011 19:25:17 -0000 Received: from unknown (EHLO srichardlxp2) [213.143.107.142] by mail.gmx.net (mp018) with SMTP; 11 Feb 2011 20:25:17 +0100 X-Authenticated: #20720068 X-Provags-ID: V01U2FsdGVkX19Lw6R9YDlLj227v8CghyA2AGWX4PFMRwQ6Xivn71 1+GQCz2UCGm9zq Message-ID: From: "Richard Scheffenegger" To: "richard" , "bloat" References: <1297441797.29639.8.camel@amd.pacdat.net> Date: Fri, 11 Feb 2011 20:23:13 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5994 X-Y-GMX-Trusted: 0 Subject: Re: [Bloat] Failure to convince X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Feb 2011 19:25:26 -0000 some comments: a) I can confirm the rumor about data center switches getting more and more buffering - SRAM is "dead cheap" (or at least the cost can be argued), and afaik, switches with multi-gigabyte buffers (10/40GE) are in the pipeline of numerous vendors. b) even a wirespeed switch will eventually have to buffer (or drop) packets - as soon as you don't operate a network with 1:1 connectivity (each single host talks only to one other host - as you can see, this scenario is quite unrealistic), but with multiple hosts potentially talking at the same time to the same host, even your perfect wire-speed switch will need to send out up to twice the bandwidth of the link to the receiving end... For some reason, people don't appear to think about that scenario - which is actually the most common scenario in networking (if all your connectivity was 1:1, why would you need a switch in the first place....) Furthermore, emerging protocols in the datacenter, such as pNFS (4.1) will cause much more boundary-synchronized data streams. That typically leads to Incast / Burst Drops (multiple senders overloading the egress port of a single receiver, TCP reacting badly to lost retransmissions...). More buffering helps somewhat, but drives up latency - as you know. Nevertheless, there are many papers about Incast / datacenter networks, where tuning TCP timing parameters and adding more buffering are presented as mitigation strategy.. c) IETF-standards compliant NewReno TCP not being able to utilize a high speed link when there is 1% packet loss (sum of the loss at different intermediate links) is quite old news. However, even TCP stacks have moved on. Since a couple of years SACK (mostly with poor-mans FACK) is used almost ubiquitary (about 85-90% of all TCP sessions negotiate SACK today - more than using timestamps - accortding to the last stats I know). But even IETF-compliant SACK is very conservative, and doesn't utilize all the information which is obtained with the signalling, to make best use of the available bandwidth. "Fortunately" non-standard compliant stacks based on Linux are deployed increasingly on the server end, and the Linux TCP stack is much better at delivering nearly optimal goodput vs. throughput... (Most notably, and easiest to remotely detect, is the capability of Linux to retransmit lost retransmissions under most circumstances. All other stack require the retramission timeout timer to fire, to recover from a lost retransmission....; only a few minor corner cases could be tweaked (ie - SACK recovery at end-of-stream is behaving like Reno, not like NewReno -> RTO recovery needed) without additional / modified signalling. With modified signalling (timestamps), not only uni-directional latency variation measurements would become possible (see Chirp-TCP, LEDBAT, µTP), but also a even more optimal (than linux; not talking about ietf compliant stacks, that are way behind) loss recovery strategy would be feasible, which could recovery lost retransmissions even sooner than Linux. (With only 1 RTT; Linux currently requires about 2-3 RTT to unambigously detect, and then recover, from lost retransmissions). Again, the key here is RTT - bufferbloat increases this servo feedback loop artifically, making improvements in the loss recovery strategy not very meaningful. d) If he is so concerned about packet loss, why hasn't he deployed ECN then - for explicit marking of flows that cause buffers to grow? That's a 10+ year old standard, almost as old as SACK. With ECN, one can have one's cake and eat it too - AND widespead ECN marking (at least on congested edge networks + L2 switches (!) would allow more innovation with transport protocols. (Read about Re-ECN and DCTCP). ECN would allow a more close coupling between the network and the edge devices, while still maintaining the complexity of the main control loop in the edge - the foundation that allowed the internet to prosper. Best regards, Richard ----- Original Message ----- From: "richard" To: "bloat" Sent: Friday, February 11, 2011 5:29 PM Subject: [Bloat] Failure to convince >I had an email exchange yesterday with the top routing person at a local > ISP yesterday. Unlike my exchanges with non-tech people, this one ended > with him saying Bufferbloat was not a problem because... > > "I for for one never want to see packet loss. I spent several years > working on a national US IP network, and it was nothing but complaints > from customers about 1% packet loss between two points. Network > engineers hate packet loss, because it generates so many complaints. > And packet loss punishes TCP more than deep buffers. > > So I'm sure that you can find a bunch of network engineers who think > big buffers are bad. But the trend in network equipment in 2010 and > 2011 has been even deeper buffers. Vendors starting shipping data > centre switches with over 700MB of buffer space. Large buffers are > needed to flatten out microbursts. But these are also intelligent > buffers." > > His point about network people hating packet loss points up the problem > we'll have with educating them and the purchasing public that at least > some is necessary for TCP to function. > > Not having been in charge of a major backbone recently, I have to admit > that my understanding of today's switching hardware was to be able to > deal with everything "at wire speed" with cut-through switching, unlike > the store-and-forward typical switches and routers at the consumer > level. > > richard > > -- > Richard C. Pitt Pacific Data Capture > rcpitt@pacdat.net 604-644-9265 > http://digital-rag.com www.pacdat.net > PGP Fingerprint: FCEF 167D 151B 64C4 3333 57F0 4F18 AF98 9F59 DD73 > > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat