[Bloat] Failure to convince
Richard Scheffenegger
rscheff at gmx.at
Fri Feb 11 14:23:13 EST 2011
some comments:
a) I can confirm the rumor about data center switches getting more and more
buffering - SRAM is "dead cheap" (or at least the cost can be argued), and
afaik, switches with multi-gigabyte buffers (10/40GE) are in the pipeline of
numerous vendors.
b) even a wirespeed switch will eventually have to buffer (or drop)
packets - as soon as you don't operate a network with 1:1 connectivity (each
single host talks only to one other host - as you can see, this scenario is
quite unrealistic), but with multiple hosts potentially talking at the same
time to the same host, even your perfect wire-speed switch will need to send
out up to twice the bandwidth of the link to the receiving end...
For some reason, people don't appear to think about that scenario - which is
actually the most common scenario in networking (if all your connectivity
was 1:1, why would you need a switch in the first place....)
Furthermore, emerging protocols in the datacenter, such as pNFS (4.1) will
cause much more boundary-synchronized data streams. That typically leads to
Incast / Burst Drops (multiple senders overloading the egress port of a
single receiver, TCP reacting badly to lost retransmissions...). More
buffering helps somewhat, but drives up latency - as you know. Nevertheless,
there are many papers about Incast / datacenter networks, where tuning TCP
timing parameters and adding more buffering are presented as mitigation
strategy..
c) IETF-standards compliant NewReno TCP not being able to utilize a high
speed link when there is 1% packet loss (sum of the loss at different
intermediate links) is quite old news. However, even TCP stacks have moved
on. Since a couple of years SACK (mostly with poor-mans FACK) is used almost
ubiquitary (about 85-90% of all TCP sessions negotiate SACK today - more
than using timestamps - accortding to the last stats I know).
But even IETF-compliant SACK is very conservative, and doesn't utilize all
the information which is obtained with the signalling, to make best use of
the available bandwidth. "Fortunately" non-standard compliant stacks based
on Linux are deployed increasingly on the server end, and the Linux TCP
stack is much better at delivering nearly optimal goodput vs. throughput...
(Most notably, and easiest to remotely detect, is the capability of Linux to
retransmit lost retransmissions under most circumstances. All other stack
require the retramission timeout timer to fire, to recover from a lost
retransmission....; only a few minor corner cases could be tweaked (ie -
SACK recovery at end-of-stream is behaving like Reno, not like NewReno ->
RTO recovery needed) without additional / modified signalling.
With modified signalling (timestamps), not only uni-directional latency
variation measurements would become possible (see Chirp-TCP, LEDBAT, µTP),
but also a even more optimal (than linux; not talking about ietf compliant
stacks, that are way behind) loss recovery strategy would be feasible, which
could recovery lost retransmissions even sooner than Linux. (With only 1
RTT; Linux currently requires about 2-3 RTT to unambigously detect, and then
recover, from lost retransmissions).
Again, the key here is RTT - bufferbloat increases this servo feedback loop
artifically, making improvements in the loss recovery strategy not very
meaningful.
d) If he is so concerned about packet loss, why hasn't he deployed ECN
then - for explicit marking of flows that cause buffers to grow? That's a
10+ year old standard, almost as old as SACK. With ECN, one can have one's
cake and eat it too - AND widespead ECN marking (at least on congested edge
networks + L2 switches (!) would allow more innovation with transport
protocols. (Read about Re-ECN and DCTCP). ECN would allow a more close
coupling between the network and the edge devices, while still maintaining
the complexity of the main control loop in the edge - the foundation that
allowed the internet to prosper.
Best regards,
Richard
----- Original Message -----
From: "richard" <richard at pacdat.net>
To: "bloat" <bloat at lists.bufferbloat.net>
Sent: Friday, February 11, 2011 5:29 PM
Subject: [Bloat] Failure to convince
>I had an email exchange yesterday with the top routing person at a local
> ISP yesterday. Unlike my exchanges with non-tech people, this one ended
> with him saying Bufferbloat was not a problem because...
>
> "I for for one never want to see packet loss. I spent several years
> working on a national US IP network, and it was nothing but complaints
> from customers about 1% packet loss between two points. Network
> engineers hate packet loss, because it generates so many complaints.
> And packet loss punishes TCP more than deep buffers.
>
> So I'm sure that you can find a bunch of network engineers who think
> big buffers are bad. But the trend in network equipment in 2010 and
> 2011 has been even deeper buffers. Vendors starting shipping data
> centre switches with over 700MB of buffer space. Large buffers are
> needed to flatten out microbursts. But these are also intelligent
> buffers."
>
> His point about network people hating packet loss points up the problem
> we'll have with educating them and the purchasing public that at least
> some is necessary for TCP to function.
>
> Not having been in charge of a major backbone recently, I have to admit
> that my understanding of today's switching hardware was to be able to
> deal with everything "at wire speed" with cut-through switching, unlike
> the store-and-forward typical switches and routers at the consumer
> level.
>
> richard
>
> --
> Richard C. Pitt Pacific Data Capture
> rcpitt at pacdat.net 604-644-9265
> http://digital-rag.com www.pacdat.net
> PGP Fingerprint: FCEF 167D 151B 64C4 3333 57F0 4F18 AF98 9F59 DD73
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
More information about the Bloat
mailing list