[Bloat] Failure to convince

Fri Feb 11 14:23:13 EST 2011

some comments:

a) I can confirm the rumor about data center switches getting more and more 
buffering - SRAM is "dead cheap" (or at least the cost can be argued), and 
afaik, switches with multi-gigabyte buffers (10/40GE) are in the pipeline of 
numerous vendors.

b) even a wirespeed switch will eventually have to buffer (or drop) 
packets - as soon as you don't operate a network with 1:1 connectivity (each 
single host talks only to one other host - as you can see, this scenario is 
quite unrealistic), but with multiple hosts potentially talking at the same 
time to the same host, even your perfect wire-speed switch will need to send 
out up to twice the bandwidth of the link to the receiving end...

For some reason, people don't appear to think about that scenario - which is 
actually the most common scenario in networking (if all your connectivity 
was 1:1, why would you need a switch in the first place....)

Furthermore, emerging protocols in the datacenter, such as pNFS (4.1) will 
cause much more boundary-synchronized data streams. That typically leads to 
Incast / Burst Drops (multiple senders overloading the egress port of a 
single receiver, TCP reacting badly to lost retransmissions...). More 
buffering helps somewhat, but drives up latency - as you know. Nevertheless, 
there are many papers about Incast / datacenter networks, where tuning TCP 
timing parameters and adding more buffering are presented as mitigation 
strategy..

c) IETF-standards compliant NewReno TCP not being able to utilize a high 
speed link when there is 1% packet loss (sum of the loss at different 
intermediate links) is quite old news. However, even TCP stacks have moved 
on. Since a couple of years SACK (mostly with poor-mans FACK) is used almost 
ubiquitary (about 85-90% of all TCP sessions negotiate SACK today - more 
than using timestamps - accortding to the last stats I know).

But even IETF-compliant SACK is very conservative, and doesn't utilize all 
the information which is obtained with the signalling, to make best use of 
the available bandwidth. "Fortunately" non-standard compliant stacks based 
on Linux are deployed increasingly on the server end, and the Linux TCP 
stack is much better at delivering nearly optimal goodput vs. throughput... 
(Most notably, and easiest to remotely detect, is the capability of Linux to 
retransmit lost retransmissions under most circumstances. All other stack 
require the retramission timeout timer to fire, to recover from a lost 
retransmission....; only a few minor corner cases could be tweaked (ie - 
SACK recovery at end-of-stream is behaving like Reno, not like NewReno -> 
RTO recovery needed) without additional / modified signalling.

With modified signalling (timestamps), not only uni-directional latency 
variation measurements would become possible (see Chirp-TCP, LEDBAT, µTP), 
but also a even more optimal (than linux; not talking about ietf compliant 
stacks, that are way behind) loss recovery strategy would be feasible, which 
could recovery lost retransmissions even sooner than Linux. (With only 1 
RTT; Linux currently requires about 2-3 RTT to unambigously detect, and then 
recover, from lost retransmissions).

Again, the key here is RTT - bufferbloat increases this servo feedback loop 
artifically, making improvements in the loss recovery strategy not very 
meaningful.

d) If he is so concerned about packet loss, why hasn't he deployed ECN 
then - for explicit marking of flows that cause buffers to grow? That's a 
10+ year old standard, almost as old as SACK. With ECN, one can have one's 
cake and eat it too - AND widespead ECN marking (at least on congested edge 
networks + L2 switches (!) would allow more innovation with transport 
protocols. (Read about Re-ECN and DCTCP). ECN would allow a more close 
coupling between the network and the edge devices, while still maintaining 
the complexity of the main control loop in the edge - the foundation that 
allowed the internet to prosper.

Best regards,
   Richard

----- Original Message ----- 
From: "richard" <richard at pacdat.net>
To: "bloat" <bloat at lists.bufferbloat.net>
Sent: Friday, February 11, 2011 5:29 PM
Subject: [Bloat] Failure to convince

>I had an email exchange yesterday with the top routing person at a local
> ISP yesterday. Unlike my exchanges with non-tech people, this one ended
> with him saying Bufferbloat was not a problem because...
>
> "I for for one never want to see packet loss.  I spent several years
> working on a national US IP network, and it was nothing but complaints
> from customers about 1% packet loss between two points.  Network
> engineers hate packet loss, because it generates so many complaints.
> And packet loss punishes TCP more than deep buffers.
>
>  So I'm sure that you can find a bunch of network engineers who think
> big buffers are bad.  But the trend in network equipment in 2010 and
> 2011 has been even deeper buffers.  Vendors starting shipping data
> centre switches with over 700MB of buffer space.  Large buffers are
> needed to flatten out microbursts.  But these are also intelligent
> buffers."
>
> His point about network people hating packet loss points up the problem
> we'll have with educating them and the purchasing public that at least
> some is necessary for TCP to function.
>
> Not having been in charge of a major backbone recently, I have to admit
> that my understanding of today's switching hardware was to be able to
> deal with everything "at wire speed" with cut-through switching, unlike
> the store-and-forward typical switches and routers at the consumer
> level.
>
> richard
>
> -- 
> Richard C. Pitt                 Pacific Data Capture
> rcpitt at pacdat.net               604-644-9265
> http://digital-rag.com          www.pacdat.net
> PGP Fingerprint: FCEF 167D 151B 64C4 3333  57F0 4F18 AF98 9F59 DD73
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat