From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp201.iad.emailsrvr.com (smtp201.iad.emailsrvr.com [207.97.245.201]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by huchra.bufferbloat.net (Postfix) with ESMTPS id E0EBC21F0BA for ; Thu, 21 Jun 2012 07:25:29 -0700 (PDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp50.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id D1B64370AD7; Thu, 21 Jun 2012 10:25:26 -0400 (EDT) X-Virus-Scanned: OK Received: from legacy27.wa-web.iad1a (legacy27.wa-web.iad1a.rsapps.net [192.168.4.180]) by smtp50.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id B69F5370A90; Thu, 21 Jun 2012 10:25:26 -0400 (EDT) Received: from reed.com (localhost.localdomain [127.0.0.1]) by legacy27.wa-web.iad1a (Postfix) with ESMTP id A462D3E002F; Thu, 21 Jun 2012 10:25:26 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@reed.com, from: dpreed@reed.com) with HTTP; Thu, 21 Jun 2012 10:25:26 -0400 (EDT) Date: Thu, 21 Jun 2012 10:25:26 -0400 (EDT) From: dpreed@reed.com To: "Robert Bradley" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20120621102526000000_76707" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: References: Message-ID: <1340288726.671712236@apps.rackspace.com> X-Mailer: webmail7.0 Cc: cerowrt-devel@lists.bufferbloat.net Subject: Re: [Cerowrt-devel] =?utf-8?q?Baby_jumbo_frames_support=3F?= X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jun 2012 14:25:32 -0000 ------=_20120621102526000000_76707 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AI understand Dave Taht's long lecture - actually understood it years ago= . But frame aggregation is not the same thing as jumbo frames in a multi-t= echnology Ethernet LAN. Jumbo frames provide a way to exploit *end-to-end= * frame sizes greater than 1500 bytes. That means the source and destinati= on TCPs get frames that are "whole" (and not random subassemblies of frames= that may arrive close together in time).=0A =0A9000 byte frames were inven= ted for 1 GigE transports. Today's 802.11n and futures approach 1 GigE, a= nd 1 GigE is the standard wiring for most homes, etc. It does not matter = how the underlying radio links chop up the Ethernet frame, retransmit them,= ack them etc. The value I am disucssing is at the *endpoints*.=0A =0AIt'= s tempting for transport link providers to *ignore* TCP and so forth when t= hey design their transports, and focus only on transport-level efficiencies= and reliabilities. This temptation created bufferbloat and also the exces= sive retry problem. (and in the past it created the historical predecesso= r of "bufferbloat" - Frame Relay's "Reliable delivery mode" which would go = to extraordinary lengths to never drop a packet, including storing the pack= ets *on disk* in some cases - talk about bloated buffers!)=0A =0AThe conver= sation here (including, but not limited to Taht's comments) shows exactly t= hat *temptation*.=0A =0AAggregation is NOT the same as large frames. Not a= t all. It achieves internal efficiencies, but not the endpoint efficiencie= s of receiving a coherent frame, that can be processed immediately and by a= single code path. At 1 Gigabit/sec this was important enough to introduce= such frame sizes.=0A =0AThe alternative ways to achieve the endpoint goals= would be to allow reordering of data delivery to the endpoint app, perhaps= by making SCTP work instead of TCP, using a flow/congestion/rate control m= echanism other than a window on sequence numbers, etc. But that would mean= changing the entire stack to a new end-to-end theory of operation.=0A =0AT= here is a real tradeoff space, but unilaterally declaring that packet aggre= gation is the same as jumbo Ethernet frames is choosing a poor point in the= tradeoff space.=0A =0ARegarding "header overhead" - that is minor in the s= cheme of things. Obsessing about that indicates a lack of perspective on t= he systems level issues.=0A =0A =0A =0A =0A-----Original Message-----=0AFro= m: "Robert Bradley" =0ASent: Thursday, June 21, = 2012 9:33am=0ATo: cerowrt-devel@lists.bufferbloat.net=0ASubject: Re: [Cerow= rt-devel] Baby jumbo frames support?=0A=0A=0A=0AOn 21 June 2012 01:58, Dave= Taht wrote:=0A> As for PPoE with a size 1508... um..= . one or the other device is going=0A> to get in your way here. I presume t= hat 1500 works? You would do=0A> better to contact the author of the driver= (juhosg) to get your=0A> question answered as I'm under the impression he = is under the right=0A> NDAs.=0A>=0A=0AI think the point here is that MTU=3D= 1500 works, but once you add in the=0APPPoE header, you end up with an effe= ctive MTU of 1492 for outbound=0Apackets:=0A=0Ahttp://aa.net.uk/kb-broadban= d-mtu.html=0Ahttp://tools.ietf.org/html/rfc4638=0A=0AThe short answer is th= at without baby-jumbo support, you either end up=0Afragmenting packets or y= ou need to somehow restrict the MTU manually.=0AYou can do that either thro= ugh MSS clamping or simply configuring each=0Ainternal machine to use MTU= =3D1492. To get around this, the BT ADSL=0Amodems started to support MTU= =3D1508. This means that the MTU within=0Athe PPPoE tunnel remains at Ethe= rnet-standard 1500, and avoids the=0Afragmentation or reconfiguration issue= s.=0A=0AAs for supporting it in CeroWRT ... the ag71xx driver defines=0AAG7= 1XX_TX_MTU_LEN=3D1540, so it looks safe enough to use MTU 1508,=0Aespeciall= y if you know that no vlans or other additions to the=0Astandard header wil= l be used. To enable that, you need to reimplement=0Athe eth_change_mtu fu= nction for the driver. The current code uses the=0Akernel's implementation= , which restricts the MTU to 1500. An initial,=0Anaive patch would look so= mething like:=0A=0A----=0A--- C:/Users/robert/AppData/Local/Temp/ag71x-revB= ASE.svn000.tmp.c=09Mon=0AMay 28 03:55:59 2012=0A+++ C:/Users/robert/Desktop= /ag71xx/ag71xx_main.c=09Thu Jun 21 13:58:44 2012=0A@@ -1042,13 +1042,25 @@= =0A }=0A #endif=0A=0A+/*=0A+ * Copied from eth_change_mtu and modified so t= hat baby jumbo packets=0A+ * may be used. This has not been tested!=0A+ */= =0A+int ag71xx_change_mtu(struct net_device *dev, int new_mtu)=0A+{=0A+ = if (new_mtu < 68 || new_mtu > (ETH_DATA_LEN + 8))=0A+ re= turn -EINVAL;=0A+ dev->mtu =3D new_mtu;=0A+ return 0;=0A+}=0A= +=0A static const struct net_device_ops ag71xx_netdev_ops =3D {=0A .ndo_ope= n=09=09=3D ag71xx_open,=0A .ndo_stop=09=09=3D ag71xx_stop,=0A .ndo_start_xm= it=09=09=3D ag71xx_hard_start_xmit,=0A .ndo_do_ioctl=09=09=3D ag71xx_do_ioc= tl,=0A .ndo_tx_timeout=09=09=3D ag71xx_tx_timeout,=0A-=09.ndo_change_mtu=09= =09=3D eth_change_mtu,=0A+=09.ndo_change_mtu=09=09=3D ag71xx_change_mtu,=0A= .ndo_set_mac_address=09=3D eth_mac_addr,=0A .ndo_validate_addr=09=3D eth_v= alidate_addr,=0A #ifdef CONFIG_NET_POLL_CONTROLLER=0A=0A----=0A=0Awhere I'v= e copied the original function and changed the upper limit to=0AETH_DATA_LE= N+8, then set up the netdev_ops structure to call the new=0Aversion. In re= ality, you probably want to add some better checks=0A(testing for MTU+all p= ossible headers<1540?) and remove the magic=0Aconstant - in the worst case,= something closer to the e1000 driver's=0Aimplementation. I wouldn't recom= mend using the present version on=0Aanything other than an experimental bui= ld, but the default MTU would=0Abe 1500 anyway so should avoid causing too = much damage. Those on BT=0AADSL lines can change the MTU on ge00 themselve= s and see what breaks.=0A-- =0ARobert Bradley=0A___________________________= ____________________=0ACerowrt-devel mailing list=0ACerowrt-devel@lists.buf= ferbloat.net=0Ahttps://lists.bufferbloat.net/listinfo/cerowrt-devel ------=_20120621102526000000_76707 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I understa= nd Dave Taht's long lecture - actually understood it years ago.  But f= rame aggregation is not the same thing as jumbo frames in a multi-technolog= y Ethernet LAN.   Jumbo frames provide a way to exploit *end-to-e= nd* frame sizes greater than 1500 bytes.  That means the source and de= stination TCPs get frames that are "whole" (and not random subassemblies of= frames that may arrive close together in time).

=0A

 

=0A

9000 byte frames= were invented for 1 GigE transports.   Today's 802.11n and futur= es approach 1 GigE, and 1 GigE is the standard wiring for most homes, etc.&= nbsp;  It does not matter how the underlying radio links chop up the E= thernet frame, retransmit them, ack them etc.   The value I am di= sucssing is at the *endpoints*.

=0A

&nbs= p;

=0A

It's tempting for transport link = providers to *ignore* TCP and so forth when they design their transports, a= nd focus only on transport-level efficiencies and reliabilities.  This= temptation created bufferbloat and also the excessive retry problem. =   (and in the past it created the historical predecessor of "bufferblo= at" - Frame Relay's "Reliable delivery mode" which would go to extraordinar= y lengths to never drop a packet, including storing the packets *on disk* i= n some cases - talk about bloated buffers!)

=0A

 

=0A

The conversation here= (including, but not limited to Taht's comments) shows exactly that *tempta= tion*.

=0A

 

=0A

Aggregation is NOT the same as large frames.  Not at = all.  It achieves internal efficiencies, but not the endpoint efficien= cies of receiving a coherent frame, that can be processed immediately and b= y a single code path.  At 1 Gigabit/sec this was important enough to i= ntroduce such frame sizes.

=0A

 =0A

The alternative ways to achieve the en= dpoint goals would be to allow reordering of data delivery to the endpoint = app, perhaps by making SCTP work instead of TCP, using a flow/congestion/ra= te control mechanism other than a window on sequence numbers, etc.  Bu= t that would mean changing the entire stack to a new end-to-end theory of o= peration.

=0A

 

=0A

There is a real tradeoff space, but unilaterally declar= ing that packet aggregation is the same as jumbo Ethernet frames is choosin= g a poor point in the tradeoff space.

=0A

 

=0A

Regarding "header overhead"= - that is minor in the scheme of things.  Obsessing about that indica= tes a lack of perspective on the systems level issues.

=0A

 

=0A

 

= =0A

 

=0A

 

=0A

-----Original Message-= ----
From: "Robert Bradley" <robert.bradley1@gmail.com>
Sen= t: Thursday, June 21, 2012 9:33am
To: cerowrt-devel@lists.bufferbloat.= net
Subject: Re: [Cerowrt-devel] Baby jumbo frames support?

=0A
=0A

On 21 June 2012 01:58, Dave Taht <dave.taht@gmail.com> wrote:
= > As for PPoE with a size 1508... um... one or the other device is going=
> to get in your way here. I presume that 1500 works? You would do=
> better to contact the author of the driver (juhosg) to get your<= br />> question answered as I'm under the impression he is under the rig= ht
> NDAs.
>

I think the point here is that MTU= =3D1500 works, but once you add in the
PPPoE header, you end up with a= n effective MTU of 1492 for outbound
packets:

http://aa.net= .uk/kb-broadband-mtu.html
http://tools.ietf.org/html/rfc4638

The short answer is that without baby-jumbo support, you either end upfragmenting packets or you need to somehow restrict the MTU manually.You can do that either through MSS clamping or simply configuring eachinternal machine to use MTU=3D1492. To get around this, the BT ADSLmodems started to support MTU=3D1508. This means that the MTU withinthe PPPoE tunnel remains at Ethernet-standard 1500, and avoids the
= fragmentation or reconfiguration issues.

As for supporting it in= CeroWRT ... the ag71xx driver defines
AG71XX_TX_MTU_LEN=3D1540, so it= looks safe enough to use MTU 1508,
especially if you know that no vla= ns or other additions to the
standard header will be used. To enable = that, you need to reimplement
the eth_change_mtu function for the driv= er. The current code uses the
kernel's implementation, which restrict= s the MTU to 1500. An initial,
naive patch would look something like:=

----
--- C:/Users/robert/AppData/Local/Temp/ag71x-revBASE.= svn000.tmp.c=09Mon
May 28 03:55:59 2012
+++ C:/Users/robert/Deskt= op/ag71xx/ag71xx_main.c=09Thu Jun 21 13:58:44 2012
@@ -1042,13 +1042,2= 5 @@
}
#endif

+/*
+ * Copied from eth_change_mt= u and modified so that baby jumbo packets
+ * may be used. This has n= ot been tested!
+ */
+int ag71xx_change_mtu(struct net_device *de= v, int new_mtu)
+{
+ if (new_mtu < 68 || new_mtu > (= ETH_DATA_LEN + 8))
+ return -EINVAL;
+ dev-= >mtu =3D new_mtu;
+ return 0;
+}
+
static co= nst struct net_device_ops ag71xx_netdev_ops =3D {
.ndo_open=09=09=3D = ag71xx_open,
.ndo_stop=09=09=3D ag71xx_stop,
.ndo_start_xmit=09= =09=3D ag71xx_hard_start_xmit,
.ndo_do_ioctl=09=09=3D ag71xx_do_ioctl= ,
.ndo_tx_timeout=09=09=3D ag71xx_tx_timeout,
-=09.ndo_change_mt= u=09=09=3D eth_change_mtu,
+=09.ndo_change_mtu=09=09=3D ag71xx_change_= mtu,
.ndo_set_mac_address=09=3D eth_mac_addr,
.ndo_validate_add= r=09=3D eth_validate_addr,
#ifdef CONFIG_NET_POLL_CONTROLLER
----

where I've copied the original function and changed the = upper limit to
ETH_DATA_LEN+8, then set up the netdev_ops structure to= call the new
version. In reality, you probably want to add some bett= er checks
(testing for MTU+all possible headers<1540?) and remove t= he magic
constant - in the worst case, something closer to the e1000 d= river's
implementation. I wouldn't recommend using the present versio= n on
anything other than an experimental build, but the default MTU wo= uld
be 1500 anyway so should avoid causing too much damage. Those on = BT
ADSL lines can change the MTU on ge00 themselves and see what break= s.
--
Robert Bradley
______________________________________= _________
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbl= oat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

=0A
------=_20120621102526000000_76707--