From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-x230.google.com (mail-oi0-x230.google.com [IPv6:2607:f8b0:4003:c06::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 06ADF3B2A3 for ; Fri, 27 Jan 2017 14:57:03 -0500 (EST) Received: by mail-oi0-x230.google.com with SMTP id s203so31339097oie.1 for ; Fri, 27 Jan 2017 11:57:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=qFt4cSlomlUSaE7R1qTjykf7LJXyaDhvDmKmUB5JIWs=; b=UYDPlVc9D5YAJbScFyokjO0JKuqsstx8rTFLbQtq4+/GjaPFKovFNoqGopob6NmCWe uMSk2XWYCMi3DNlz/c6OAu8xkxHGjAy42TCid8iyWK1yuUKmrPLP47y331ICMtArLdiC 80Qxnr69ZqfT7Vm0QGym+KITHtcm/8uMV5ZX0jk03FIyTSV55Ro5V+hFkzUvMgRRgOTO PN8hIUGm/BneEpYwGceXsnz4f2Hxb8D/B6D+wPXvvMjJzyobZS7zpOyYRfl8WE9z7Ftm Y3tT/7QXq4YY8Wuw9TeETN1smjlzE9ISY/nAViRZDmUN34xjtnWg8V48oL1rIUP57G3w VE1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=qFt4cSlomlUSaE7R1qTjykf7LJXyaDhvDmKmUB5JIWs=; b=DA4HiYqIS5I8QzAaUakKD9/jjYcWRisLeVjoSig4eL3EcTY6m8U+ISgdLaUf645gaN DPhyvJK5qbMkbGfDN9DHrOCCjIfDGlUVTEHunv1O9A6R4zXjMg3W8Mg/H4xwM0OA3L6k 2qpTcwa+WhpFmvIcXaE3l7rbrPS74I1ukw2whqDEV7VKqKdxMRyVzcRxwUsWLrvGN7Kh Ra4e9DINBquUCoYOGa+SOb0IjtZTRQyKHhxMs9E1EpF83wyMP1nCJCeLJRLbmpiRMvY6 KBDqYVf1bTPzUqhdNQatxoMoT9TVOhXfqnhZS89Im2kbNaWXQV6ml4+Juz/1aSki0f2L puTg== X-Gm-Message-State: AIkVDXIgWssI9LS4ryxOnPdVdOl1jSyDntodn6AF82GWe3fiOghFBZF/iWX8rf7Bx9HvvVLO6H0D6/YyeOnlkA== X-Received: by 10.202.84.143 with SMTP id i137mr5791340oib.202.1485547023255; Fri, 27 Jan 2017 11:57:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.157.1.21 with HTTP; Fri, 27 Jan 2017 11:57:02 -0800 (PST) In-Reply-To: References: <0496946b-827a-8527-643d-0b186f52e192@taht.net> <1485528030.6360.35.camel@edumazet-glaptop3.roam.corp.google.com> From: Hans-Kristian Bakke Date: Fri, 27 Jan 2017 20:57:02 +0100 Message-ID: To: bloat Content-Type: multipart/alternative; boundary=001a113debb89e5129054718dec4 Subject: [Bloat] Fwd: Recommendations for fq_codel and tso/gso in 2017 X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jan 2017 19:57:04 -0000 --001a113debb89e5129054718dec4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 27 January 2017 at 15:40, Eric Dumazet wrote: > On Thu, 2017-01-26 at 23:55 -0800, Dave T=C3=A4ht wrote: > > > > On 1/26/17 11:21 PM, Hans-Kristian Bakke wrote: > > > Hi > > > > > > After having had some issues with inconcistent tso/gso configuration > > > causing performance issues for sch_fq with pacing in one of my system= s, > > > I wonder if is it still recommended to disable gso/tso for interfaces > > > used with fq_codel qdiscs and shaping using HTB etc. > > > > At lower bandwidths gro can do terrible things. Say you have a 1Mbit > > uplink, and IW10. (At least one device (mvneta) will synthesise 64k of > > gro packets) > > > > a single IW10 burst from one flow injects 130ms of latency. > > That is simply a sign of something bad happening from the source. > > The router will spend too much time trying to fix the TCP sender by > smoothing things. > > Lets fix the root cause, instead of making everything slow or burn mega > watts. > > GRO aggregates trains of packets for the same flow, in sub ms window. > > Why ? Because GRO can not predict the future : It can not know when next > interrupt might come from the device telling : here is some additional > packet(s). Maybe next packet is coming in 5 seconds. > > Take a look at napi_poll() > > 1) If device driver called napi_complete(), all packets are flushed > (given) to upper stack. No packet will wait in GRO for additional > segments. > > 2) Under flood (we exhausted the napi budget and did not call > napi_complete()), we make sure no packet can sit in GRO for more than 1 > ms. > > Only when the device is under flood and cpu can not drain fast enough RX > queue, GRO can aggregate packets more aggressively, and the size of GRO > packets exactly fits the CPU budget. > > In a nutshell, GRO is exactly the mechanism that adapts the packet sizes > to available cpu power. > > If your cpu is really fast, then it will dequeue one packet at a time > and GRO wont kick in. > > So the real problem here is that some device drivers implemented a poor > interrupt mitigation logic, inherited from other OS that had not GRO and > _had_ to implement their own crap, hurting latencies. > > Make sure you disable interrupt mitigation, and leave GRO enabled. > > e1000e is notoriously bad for interrupt mitigation. > > At Google, we let the NIC sends its RX interrupt ASAP. > =E2=80=8BInteresting. Do I understand you correctly that you basically reco= mmend =E2=80=8Bloading the e1000e module with InterruptThrottleRate set to 0, or = is interrupt mitigation something else? options e1000e InterruptThrottleRate=3D0(,0,0,0...) https://www.kernel.org/doc/Documentation/networking/e1000e.txt I haven't fiddled with interruptthrottlerate since before I even heard of bufferbloat. > > Every usec matters. > > So the model for us is very clear : Use GRO and TSO as much as we can, > but make sure the producers (TCP senders) are smart and control their > burst sizes. > > Think about 50Gbit and 100Gbit, and really the question of having or not > TSO and GRO is simply moot. > > > Even at 1Gbit, GRO is helping to reduce cpu cycles and thus reduce > latencies. > > Adding a sysctl to limit GRO max size would be trivial, I already > mentioned that, but nobody cared enough to send a patch. > > > > > > > > > If there is a trade off, at which bandwith does it generally make mor= e > > > sense to enable tso/gso than to have it disabled when doing HTB shape= d > > > fq_codel qdiscs? > > > > I stopped caring about tuning params at > 40Mbit. < 10 gbit, or rather, > > trying get below 200usec of jitter|latency. (Others care) > > > > And: My expectation was generally that people would ignore our > > recommendations on disabling offloads! > > > > Yes, we should revise the sample sqm code and recommendations for a pos= t > > gigabit era to not bother with changing network offloads. Were you > > modifying the old debloat script? > > > > TBF & sch_Cake do peeling of gro/tso/gso back into packets, and then > > interleave their scheduling, so GRO is both helpful (transiting the > > stack faster) and harmless, at all bandwidths. > > > > HTB doesn't peel. We just ripped out hsfc for sqm-scripts (too buggy), > > alsp. Leaving: tbf + fq_codel, htb+fq_codel, and cake models there. > > > > > > > ... > > > > Cake is coming along nicely. I'd love a test in your 2Gbit bonding > > scenario, particularly in a per host fairness test, at line or shaped > > rates. We recently got cake working well with nat. > > > > http://blog.cerowrt.org/flent/steam/down_working.svg (ignore the latenc= y > > figure, the 6 flows were to spots all over the world) > > > > > Regards, > > > Hans-Kristian > > > > > > > > > _______________________________________________ > > > Bloat mailing list > > > Bloat@lists.bufferbloat.net > > > https://lists.bufferbloat.net/listinfo/bloat > > > > > _______________________________________________ > > Bloat mailing list > > Bloat@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/bloat > > > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat > --001a113debb89e5129054718dec4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 27 January 2017 at 15:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:=
On Thu, 2017-01-26 at 23:55 -0800, Dave T=C3=A4ht wrote:
>
> On 1/26/17 11:21 PM, Hans-Kristian Bakke wrote:
> > Hi
> >
> > After having had some issues with inconcistent tso/gso configurat= ion
> > causing performance issues for sch_fq with pacing in one of my sy= stems,
> > I wonder if is it still recommended to disable gso/tso for interf= aces
> > used with fq_codel qdiscs and shaping using HTB etc.
>
> At lower bandwidths gro can do terrible things. Say you have a 1Mbit > uplink, and IW10. (At least one device (mvneta) will synthesise 64k of=
> gro packets)
>
> a single IW10 burst from one flow injects 130ms of latency.

That is simply a sign of something bad happening from the source.
The router will spend too much time trying to fix the TCP sender by
smoothing things.

Lets fix the root cause, instead of making everything slow or burn mega
watts.

GRO aggregates trains of packets for the same flow, in sub ms window.

Why ? Because GRO can not predict the future : It can not know when next interrupt might come from the device telling : here is some additional
packet(s). Maybe next packet is coming in 5 seconds.

Take a look at napi_poll()

1) If device driver called napi_complete(), all packets are flushed
(given) to upper stack. No packet will wait in GRO for additional
segments.

2) Under flood (we exhausted the napi budget and did not call
napi_complete()), we make sure no packet can sit in GRO for more than 1
ms.

Only when the device is under flood and cpu can not drain fast enough RX queue, GRO can aggregate packets more aggressively, and the size of GRO
packets exactly fits the CPU budget.

In a nutshell, GRO is exactly the mechanism that adapts the packet sizes to available cpu power.

If your cpu is really fast, then it will dequeue one packet at a time
and GRO wont kick in.

So the real problem here is that some device drivers implemented a poor
interrupt mitigation logic, inherited from other OS that had not GRO and _had_ to implement their own crap, hurting latencies.

Make sure you disable interrupt mitigation, and leave GRO enabled.

e1000e is notoriously bad for interrupt mitigation.

At Google, we let the NIC sends its RX interrupt ASAP.

= =E2=80=8BInteresting. Do I understand you correctly that you basically reco= mmend =E2=80=8Bloading the e1000e module with InterruptThrottleRate set to = 0, or is interrupt mitigation something else?

opt= ions e1000e InterruptThrottleRate=3D0(,0,0,0...)
=

= I haven't fiddled with interruptthrottlerate since before I even heard = of bufferbloat.


=C2=A0

Every usec matters.

So the model for us is very clear : Use GRO and TSO as much as we can,
but make sure the producers (TCP senders) are smart and control their
burst sizes.

Think about 50Gbit and 100Gbit, and really the question of having or not TSO and GRO is simply moot.


Even at 1Gbit, GRO is helping to reduce cpu cycles and thus reduce
latencies.

Adding a sysctl to limit GRO max size would be trivial, I already
mentioned that, but nobody cared enough to send a patch.

>
> >
> > If there is a trade off, at which bandwith does it generally make= more
> > sense to enable tso/gso than to have it disabled when doing HTB s= haped
> > fq_codel qdiscs?
>
> I stopped caring about tuning params at > 40Mbit. < 10 gbit, or = rather,
> trying get below 200usec of jitter|latency. (Others care)
>
> And: My expectation was generally that people would ignore our
> recommendations on disabling offloads!
>
> Yes, we should revise the sample sqm code and recommendations for a po= st
> gigabit era to not bother with changing network offloads. Were you
> modifying the old debloat script?
>
> TBF & sch_Cake do peeling of gro/tso/gso back into packets, and th= en
> interleave their scheduling, so GRO is both helpful (transiting the > stack faster) and harmless, at all bandwidths.
>
> HTB doesn't peel. We just ripped out hsfc for sqm-scripts (too bug= gy),
> alsp. Leaving: tbf + fq_codel, htb+fq_codel, and cake models there. >



> ...
>
> Cake is coming along nicely. I'd love a test in your 2Gbit bonding=
> scenario, particularly in a per host fairness test, at line or shaped<= br> > rates. We recently got cake working well with nat.
>
> http://blog.cerowrt.org/flent/steam/= down_working.svg (ignore the latency
> figure, the 6 flows were to spots all over the world)
>
> > Regards,
> > Hans-Kristian
> >
> >
> > _______________________________________________
> > Bloat mailing list
> > = Bloat@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bl= oat
> >
> _______________________________________________
> Bloat mailing list
> Bloat= @lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


_______________________________________________
Bloat mailing list
Bloat@list= s.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


--001a113debb89e5129054718dec4--