From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp89.iad3a.emailsrvr.com (smtp89.iad3a.emailsrvr.com [173.203.187.89]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 5CE293B260 for ; Mon, 6 Jun 2016 22:52:38 -0400 (EDT) Received: from smtp20.relay.iad3a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp20.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 2122718041C; Mon, 6 Jun 2016 22:52:38 -0400 (EDT) Received: from app5.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp20.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id F23811801C9; Mon, 6 Jun 2016 22:52:37 -0400 (EDT) X-Sender-Id: dpreed@reed.com Received: from app5.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by 0.0.0.0:25 (trex/5.5.4); Mon, 06 Jun 2016 22:52:38 -0400 Received: from reed.com (localhost [127.0.0.1]) by app5.wa-webapps.iad3a (Postfix) with ESMTP id E17C7A1B18; Mon, 6 Jun 2016 22:52:37 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Mon, 6 Jun 2016 22:52:37 -0400 (EDT) Date: Mon, 6 Jun 2016 22:52:37 -0400 (EDT) From: dpreed@reed.com To: "Ketan Kulkarni" Cc: "Mikael Abrahamsson" , "Jonathan Morton" , "cerowrt-devel@lists.bufferbloat.net" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20160606225237000000_23875" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: <55fdf513-9c54-bea9-1f53-fe2c5229d7ba@eggo.org> <871t4as1h9.fsf@toke.dk> <3D32F19B-5DEA-48AD-97E7-D043C4EAEC51@gmail.com> X-Auth-ID: dpreed@reed.com Message-ID: <1465267957.902610235@apps.rackspace.com> X-Mailer: webmail/12.4.2-RC Subject: Re: [Cerowrt-devel] trying to make sense of what switch vendors say wrt buffer bloat X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 02:52:38 -0000 ------=_20160606225237000000_23875 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0ASo did anyone write a response debunking their paper? Their NS-2 simul= ation is most likely the erroneous part of their analysis - the white paper= would not pass a review by qualified referees because there is no way to c= heck their results and some of what they say beggars belief.=0A =0ABechtols= heim is one of those guys who can write any damn thing and it becomes "trut= h" - mostly because he co-founded Sun. But that doesn't mean that he can't = make huge errors - any of us can.=0A =0AThe so-called TCP/IP Bandwidth Capt= ure effect that he refers to doesn't sound like any capture effect I've eve= r heard of. There is an "Ethernet Capture Effect" (which is cited), which = is due to properties of CSMA/CD binary exponential backoff, not anything to= do with TCP's flow/congestion control. So it has that "truthiness" that m= akes glib people sound like they know what they are talking about, but I'd = like to see a reference that says this is a property of TCP!=0A =0AWhat's i= nteresting is that the reference to the Ethernet Capture Effect in that whi= te paper proposes a solution that involves changing the backoff algorithm s= lightly at the Ethernet level - NOT increasing buffer size!=0A =0AAnother t= hing that would probably improve matters a great deal would be to drop/ECN-= mark packets when a contended output port on an Arista switch develops a ba= cklog. This will throttle TCP sources sharing the path.=0A =0AThe comments= in the white paper that say that ACK contention in TCP in the reverse dire= ction are the problem that causes the "so-called TCP/IP Bandwidth Capture e= ffect" that is invented by the authors appears to be hogwash of the first o= rder.=0A =0ADebunking Bechtolsheim credibly would get a lot of attention to= the bufferbloat cause, I suspect.=0A =0A=0A=0AOn Monday, June 6, 2016 5:16= pm, "Ketan Kulkarni" said:=0A=0A=0A=0Asome time back t= hey had this whitepaper -=0A"Why Big Data Needs Big Buffer Switches"=0A=0A[= http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.pdf= ]( http://www.arista.com/assets/data/pdf/Whitepapers/BigDataBigBuffers-WP.= pdf )=0Athe type of apps they talk about is big data, hadoop etc=0A=0A=0AOn= Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <[ swmike@swm.pp.se ]( ma= ilto:swmike@swm.pp.se )> wrote:=0AOn Mon, 6 Jun 2016, Jonathan Morton wrote= :=0A=0AAt 100ms buffering, their 10Gbps switch is effectively turning any D= C it=E2=80=99s installed in into a transcontinental Internet path, as far a= s peak latency is concerned. Just because RAM is cheap these days=E2=80=A6= Nono, nononononono. I can tell you they're spending serious money on insert= ing this kind of buffering memory into these kinds of devices. Buying these= devices without deep buffers is a lot lower cost.=0A=0A These types of swi= tch chips either have on-die memory (usually 16MB or less), or they have ve= ry expensive (a direct cost of lowered port density) off-chip buffering mem= ory.=0A=0A Typically you do this:=0A=0A ports ---|-------=0A ports ---| = |=0A ports ---| chip |=0A ports ---|-------=0A=0A Or you do this=0A=0A po= rts ---|------|---buffer=0A ports ---| chip |---TCAM=0A --------= =0A=0A or if you do a multi-linecard-device=0A=0A ports ---|------|---buffe= r=0A | chip |---TCAM=0A --------=0A |=0A = switch fabric=0A=0A (or any variant of them)=0A=0A So basically if you = want to buffer and if you want large L2-L4 lookup tables, you have to sacri= fice ports. Sacrifice lots of ports.=0A=0A So never say these kinds of devi= ces add buffering because RAM is cheap. This is most definitely not why the= y're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.=0A=0A -- =0A = Mikael Abrahamsson email: [ swmike@swm.pp.se ]( mailto:swmike@swm.pp.se = )=0A_______________________________________________=0A Cerowrt-devel mailin= g list=0A[ Cerowrt-devel@lists.bufferbloat.net ]( mailto:Cerowrt-devel@list= s.bufferbloat.net )=0A[ https://lists.bufferbloat.net/listinfo/cerowrt-deve= l ]( https://lists.bufferbloat.net/listinfo/cerowrt-devel )=0A=0A ------=_20160606225237000000_23875 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

So d= id anyone write a response debunking their paper?   Their NS-2 si= mulation is most likely the erroneous part of their analysis - the white pa= per would not pass a review by qualified referees because there is no way t= o check their results and some of what they say beggars belief.

=0A

 

=0A

Bechtolsh= eim is one of those guys who can write any damn thing and it becomes "truth= " - mostly because he co-founded Sun. But that doesn't mean that he can't m= ake huge errors - any of us can.

=0A

 <= /p>=0A

The so-called TCP/IP Bandwidth Capture e= ffect that he refers to doesn't sound like any capture effect I've ever hea= rd of.  There is an "Ethernet Capture Effect" (which is cited), which = is due to properties of CSMA/CD binary exponential backoff, not anything to= do with TCP's flow/congestion control.  So it has that "truthiness" t= hat makes glib people sound like they know what they are talking about, but= I'd like to see a reference that says this is a property of TCP!

=0A

 

=0A

What's = interesting is that the reference to the Ethernet Capture Effect in that wh= ite paper proposes a solution that involves changing the backoff algorithm = slightly at the Ethernet level - NOT increasing buffer size!

=0A

 

=0A

Another thin= g that would probably improve matters a great deal would be to drop/ECN-mar= k packets when a contended output port on an Arista switch develops a backl= og.  This will throttle TCP sources sharing the path.

=0A

 

=0A

The comments= in the white paper that say that ACK contention in TCP in the reverse dire= ction are the problem that causes the "so-called TCP/IP Bandwidth Capture e= ffect" that is invented by the authors appears to be hogwash of the first o= rder.

=0A

 

=0A

Debunking Bechtolsheim credibly would get a lot of attention to the= bufferbloat cause, I suspect.

=0A

 =0A=0A



On Monday, June 6, 2016 5:16pm, "Ketan= Kulkarni" <ketkulka@gmail.com> said:

=0A
=0A
some time back they had this whitep= aper -=0A=0A
the type of apps they talk about is big data= , hadoop etc
=0A
=0A

=0A
On Mon, Jun 6, 2016 at 11:37 AM, Mikael Abrahamsson <swmi= ke@swm.pp.se> wrote:
=0A
On Mon, 6 Jun 2016, Jonathan Morton = wrote:

=0A
At= 100ms buffering, their 10Gbps switch is effectively turning any DC it=E2= =80=99s installed in into a transcontinental Internet path, as far as peak = latency is concerned.  Just because RAM is cheap these days=E2=80=A6=0ANono, nononononono. I can tell you they're spending serious m= oney on inserting this kind of buffering memory into these kinds of devices= . Buying these devices without deep buffers is a lot lower cost.

These types of switch chips either have on-die memory (usually 16MB or le= ss), or they have very expensive (a direct cost of lowered port density) of= f-chip buffering memory.

Typically you do this:

por= ts ---|-------
ports ---|      |
ports ---| chip= |
ports ---|-------

Or you do this

ports ---= |------|---buffer
ports ---| chip |---TCAM
     =    --------

or if you do a multi-linecard-device

ports ---|------|---buffer
         = ;| chip |---TCAM
         --------
&nb= sp;           |
        = switch fabric

(or any variant of them)

So basically= if you want to buffer and if you want large L2-L4 lookup tables, you have = to sacrifice ports. Sacrifice lots of ports.

So never say these= kinds of devices add buffering because RAM is cheap. This is most definite= ly not why they're doing it. Buffer memory for them is EXTREMELY EXPENSIVE.=

--
Mikael Abrahamsson    email: swmike@swm.pp.se

______________= _________________________________
Cerowrt-devel mailing list
Cerowrt-devel@lists.buf= ferbloat.net
https://lists.bufferbloat.= net/listinfo/cerowrt-devel

=0A
=0A
= =0A
------=_20160606225237000000_23875--