From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dpreed@reed.com>
Received: from smtp89.iad3a.emailsrvr.com (smtp89.iad3a.emailsrvr.com
 [173.203.187.89])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 5CE293B260
 for <cerowrt-devel@lists.bufferbloat.net>;
 Mon,  6 Jun 2016 22:52:38 -0400 (EDT)
Received: from smtp20.relay.iad3a.emailsrvr.com (localhost.localdomain
 [127.0.0.1])
 by smtp20.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 2122718041C;
 Mon,  6 Jun 2016 22:52:38 -0400 (EDT)
Received: from app5.wa-webapps.iad3a (relay-webapps.rsapps.net
 [172.27.255.140])
 by smtp20.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id F23811801C9;
 Mon,  6 Jun 2016 22:52:37 -0400 (EDT)
X-Sender-Id: dpreed@reed.com
Received: from app5.wa-webapps.iad3a (relay-webapps.rsapps.net
 [172.27.255.140]) by 0.0.0.0:25 (trex/5.5.4);
 Mon, 06 Jun 2016 22:52:38 -0400
Received: from reed.com (localhost [127.0.0.1])
 by app5.wa-webapps.iad3a (Postfix) with ESMTP id E17C7A1B18;
 Mon,  6 Jun 2016 22:52:37 -0400 (EDT)
Received: by apps.rackspace.com
 (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) 
 with HTTP; Mon, 6 Jun 2016 22:52:37 -0400 (EDT)
Date: Mon, 6 Jun 2016 22:52:37 -0400 (EDT)
From: dpreed@reed.com
To: "Ketan Kulkarni" <ketkulka@gmail.com>
Cc: "Mikael Abrahamsson" <swmike@swm.pp.se>,
 "Jonathan Morton" <chromatix99@gmail.com>,
 "cerowrt-devel@lists.bufferbloat.net" <cerowrt-devel@lists.bufferbloat.net>
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="----=_20160606225237000000_23875"
Importance: Normal
X-Priority: 3 (Normal)
X-Type: html
In-Reply-To: <CAD6NSj6vA=bjHt3Txyw8VuV9tqg-A7wvLd6ovJG4Jxabvvjw4g@mail.gmail.com>
References: <55fdf513-9c54-bea9-1f53-fe2c5229d7ba@eggo.org>
 <871t4as1h9.fsf@toke.dk> 
 <3D32F19B-5DEA-48AD-97E7-D043C4EAEC51@gmail.com> 
 <alpine.DEB.2.02.1606062029380.28955@uplift.swm.pp.se> 
 <CAD6NSj6vA=bjHt3Txyw8VuV9tqg-A7wvLd6ovJG4Jxabvvjw4g@mail.gmail.com>
X-Auth-ID: dpreed@reed.com
Message-ID: <1465267957.902610235@apps.rackspace.com>
X-Mailer: webmail/12.4.2-RC
Subject: Re: [Cerowrt-devel] trying to make sense of what switch vendors say
	wrt buffer bloat
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
 <cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
 <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
 <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 02:52:38 -0000

------=_20160606225237000000_23875
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=0ASo did anyone write a response debunking their paper?   Their NS-2 simul=
ation is most likely the erroneous part of their analysis - the white paper=
 would not pass a review by qualified referees because there is no way to c=
heck their results and some of what they say beggars belief.=0A =0ABechtols=
heim is one of those guys who can write any damn thing and it becomes "trut=
h" - mostly because he co-founded Sun. But that doesn't mean that he can't =
make huge errors - any of us can.=0A =0AThe so-called TCP/IP Bandwidth Capt=
ure effect that he refers to doesn't sound like any capture effect I've eve=
r heard of.  There is an "Ethernet Capture Effect" (which is cited), which =
is due to properties of CSMA/CD binary exponential backoff, not anything to=
 do with TCP's flow/congestion control.  So it has that "truthiness" that m=
akes glib people sound like they know what they are talking about, but I'd =
like to see a reference that says this is a property of TCP!=0A =0AWhat's i=
nteresting is that the reference to the Ethernet Capture Effect in that whi=
te paper proposes a solution that involves changing the backoff algorithm s=
lightly at the Ethernet level - NOT increasing buffer size!=0A =0AAnother t=
hing that would probably improve matters a great deal would be to drop/ECN-=
mark packets when a contended output port on an Arista switch develops a ba=
cklog.  This will throttle TCP sources sharing the path.=0A =0AThe comments=
 in the white paper that say that ACK contention in TCP in the reverse dire=
ction are the problem that causes the "so-called TCP/IP Bandwidth Capture e=
ffect" that is invented by the authors appears to be hogwash of the first o=
rder.=0A =0ADebunking Bechtolsheim credibly would get a lot of attention to=
 the bufferbloat cause, I suspect.=0A =0A=0A=0AOn Monday, June 6, 2016 5:16=
pm, "Ketan Kulkarni" <ketkulka@gmail.com> said:=0A=0A=0A=0Asome time back t=
hey had this whitepaper -=0A"Why Big Data Needs Big Buffer Switches"=0A=0A[=
 http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf=
 ]( http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.=
pdf )=0Athe type of apps they talk about is big data, hadoop etc=0A=0A=0AOn=
 Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <[ swmike@swm.pp.se ]( ma=
ilto:swmike@swm.pp.se )> wrote:=0AOn Mon, 6 Jun 2016, Jonathan Morton wrote=
:=0A=0AAt 100ms buffering, their 10Gbps switch is effectively turning any D=
C it=E2=80=99s installed in into a transcontinental Internet path, as far a=
s peak latency is concerned.  Just because RAM is cheap these days=E2=80=A6=
Nono, nononononono. I can tell you they're spending serious money on insert=
ing this kind of buffering memory into these kinds of devices. Buying these=
 devices without deep buffers is a lot lower cost.=0A=0A These types of swi=
tch chips either have on-die memory (usually 16MB or less), or they have ve=
ry expensive (a direct cost of lowered port density) off-chip buffering mem=
ory.=0A=0A Typically you do this:=0A=0A ports ---|-------=0A ports ---|    =
  |=0A ports ---| chip |=0A ports ---|-------=0A=0A Or you do this=0A=0A po=
rts ---|------|---buffer=0A ports ---| chip |---TCAM=0A          --------=
=0A=0A or if you do a multi-linecard-device=0A=0A ports ---|------|---buffe=
r=0A          | chip |---TCAM=0A          --------=0A             |=0A     =
    switch fabric=0A=0A (or any variant of them)=0A=0A So basically if you =
want to buffer and if you want large L2-L4 lookup tables, you have to sacri=
fice ports. Sacrifice lots of ports.=0A=0A So never say these kinds of devi=
ces add buffering because RAM is cheap. This is most definitely not why the=
y're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.=0A=0A -- =0A =
Mikael Abrahamsson    email: [ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se =
)=0A_______________________________________________=0A Cerowrt-devel mailin=
g list=0A[ Cerowrt-devel@lists.bufferbloat.net ]( mailto:Cerowrt-devel@list=
s.bufferbloat.net )=0A[ https://lists.bufferbloat.net/listinfo/cerowrt-deve=
l ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )=0A=0A
------=_20160606225237000000_23875
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<font face=3D"times new roman" size=3D"2"><p style=3D"margin:0;padding:0;fo=
nt-family: 'times new roman'; font-size: 10pt; word-wrap: break-word;">So d=
id anyone write a response&nbsp;debunking their paper? &nbsp; Their NS-2 si=
mulation is most likely the erroneous part of their analysis - the white pa=
per would not pass a review by qualified referees because there is no way t=
o check their results and some of what they say beggars belief.</p>=0A<p st=
yle=3D"margin:0;padding:0;font-family: 'times new roman'; font-size: 10pt; =
word-wrap: break-word;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;font-fa=
mily: 'times new roman'; font-size: 10pt; word-wrap: break-word;">Bechtolsh=
eim is one of those guys who can write any damn thing and it becomes "truth=
" - mostly because he co-founded Sun. But that doesn't mean that he can't m=
ake huge errors - any of us can.</p>=0A<p style=3D"margin:0;padding:0;font-=
family: 'times new roman'; font-size: 10pt; word-wrap: break-word;">&nbsp;<=
/p>=0A<p style=3D"margin:0;padding:0;font-family: 'times new roman'; font-s=
ize: 10pt; word-wrap: break-word;">The so-called TCP/IP Bandwidth Capture e=
ffect that he refers to doesn't sound like any capture effect I've ever hea=
rd of. &nbsp;There is an "Ethernet Capture Effect" (which is cited), which =
is due to properties of CSMA/CD binary exponential backoff, not anything to=
 do with TCP's flow/congestion control. &nbsp;So it has that "truthiness" t=
hat makes glib people sound like they know what they are talking about, but=
 I'd like to see a reference that says this is a property of TCP!</p>=0A<p =
style=3D"margin:0;padding:0;font-family: 'times new roman'; font-size: 10pt=
; word-wrap: break-word;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;font-=
family: 'times new roman'; font-size: 10pt; word-wrap: break-word;">What's =
interesting is that the reference to the Ethernet Capture Effect in that wh=
ite paper proposes a solution that involves changing the backoff algorithm =
slightly at the Ethernet level - NOT increasing buffer size!</p>=0A<p style=
=3D"margin:0;padding:0;font-family: 'times new roman'; font-size: 10pt; wor=
d-wrap: break-word;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;font-famil=
y: 'times new roman'; font-size: 10pt; word-wrap: break-word;">Another thin=
g that would probably improve matters a great deal would be to drop/ECN-mar=
k packets when a contended output port on an Arista switch develops a backl=
og. &nbsp;This will throttle TCP sources sharing the path.</p>=0A<p style=
=3D"margin:0;padding:0;font-family: 'times new roman'; font-size: 10pt; wor=
d-wrap: break-word;">&nbsp;</p>=0A<p style=3D"margin:0;padding:0;font-famil=
y: 'times new roman'; font-size: 10pt; word-wrap: break-word;">The comments=
 in the white paper that say that ACK contention in TCP in the reverse dire=
ction are the problem that causes the "so-called TCP/IP Bandwidth Capture e=
ffect" that is invented by the authors appears to be hogwash of the first o=
rder.</p>=0A<p style=3D"margin:0;padding:0;font-family: 'times new roman'; =
font-size: 10pt; word-wrap: break-word;">&nbsp;</p>=0A<p style=3D"margin:0;=
padding:0;font-family: 'times new roman'; font-size: 10pt; word-wrap: break=
-word;">Debunking Bechtolsheim credibly would get a lot of attention to the=
 bufferbloat cause, I suspect.</p>=0A<p style=3D"margin:0;padding:0;font-fa=
mily: 'times new roman'; font-size: 10pt; word-wrap: break-word;">&nbsp;</p=
>=0A<!--WM_COMPOSE_SIGNATURE_START--><!--WM_COMPOSE_SIGNATURE_END-->=0A<p s=
tyle=3D"margin:0;padding:0;font-family: 'times new roman'; font-size: 10pt;=
 word-wrap: break-word;"><br /><br />On Monday, June 6, 2016 5:16pm, "Ketan=
 Kulkarni" &lt;ketkulka@gmail.com&gt; said:<br /><br /></p>=0A<div id=3D"Sa=
feStyles1465266791">=0A<div dir=3D"ltr">some time back they had this whitep=
aper -=0A<div>"Why Big Data Needs Big Buffer Switches"<br />=0A<div><a href=
=3D"http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.=
pdf">http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP=
.pdf</a></div>=0A</div>=0A<div>the type of apps they talk about is big data=
, hadoop etc</div>=0A</div>=0A<div class=3D"gmail_extra"><br />=0A<div clas=
s=3D"gmail_quote">On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:swmike@swm.pp.se" target=3D"_blank">swmi=
ke@swm.pp.se</a>&gt;</span> wrote:<br />=0A<blockquote class=3D"gmail_quote=
" style=3D"margin: 0 0 0 .8ex; border-left: 1px #ccc solid; padding-left: 1=
ex;"><span class=3D""><span class=3D"">On Mon, 6 Jun 2016, Jonathan Morton =
wrote:<br /><br /></span></span>=0A<blockquote class=3D"gmail_quote" style=
=3D"margin: 0 0 0 .8ex; border-left: 1px #ccc solid; padding-left: 1ex;">At=
 100ms buffering, their 10Gbps switch is effectively turning any DC it=E2=
=80=99s installed in into a transcontinental Internet path, as far as peak =
latency is concerned.&nbsp; Just because RAM is cheap these days=E2=80=A6</=
blockquote>=0ANono, nononononono. I can tell you they're spending serious m=
oney on inserting this kind of buffering memory into these kinds of devices=
. Buying these devices without deep buffers is a lot lower cost.<br /><br /=
> These types of switch chips either have on-die memory (usually 16MB or le=
ss), or they have very expensive (a direct cost of lowered port density) of=
f-chip buffering memory.<br /><br /> Typically you do this:<br /><br /> por=
ts ---|-------<br /> ports ---|&nbsp; &nbsp; &nbsp; |<br /> ports ---| chip=
 |<br /> ports ---|-------<br /><br /> Or you do this<br /><br /> ports ---=
|------|---buffer<br /> ports ---| chip |---TCAM<br /> &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp;--------<br /><br /> or if you do a multi-linecard-device<br =
/><br /> ports ---|------|---buffer<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
;| chip |---TCAM<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;--------<br /> &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |<br /> &nbsp; &nbsp; &nbsp; &nbsp; =
switch fabric<br /><br /> (or any variant of them)<br /><br /> So basically=
 if you want to buffer and if you want large L2-L4 lookup tables, you have =
to sacrifice ports. Sacrifice lots of ports.<br /><br /> So never say these=
 kinds of devices add buffering because RAM is cheap. This is most definite=
ly not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.=
<span class=3D"HOEnZb"><span style=3D"color: #888888;"><br /><br /> -- <br =
/> Mikael Abrahamsson&nbsp; &nbsp; email: <a href=3D"mailto:swmike@swm.pp.s=
e" target=3D"_blank">swmike@swm.pp.se</a></span></span><br />______________=
_________________________________<br /> Cerowrt-devel mailing list<br /><a =
href=3D"mailto:Cerowrt-devel@lists.bufferbloat.net">Cerowrt-devel@lists.buf=
ferbloat.net</a><br /><a rel=3D"noreferrer" href=3D"https://lists.bufferblo=
at.net/listinfo/cerowrt-devel" target=3D"_blank">https://lists.bufferbloat.=
net/listinfo/cerowrt-devel</a><br /><br /></blockquote>=0A</div>=0A</div>=
=0A</div></font>
------=_20160606225237000000_23875--