On 27 January 2017 at 15:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2017-01-26 at 23:55 -0800, Dave Täht wrote:
> >
> > On 1/26/17 11:21 PM, Hans-Kristian Bakke wrote:
> > > Hi
> > >
> > > After having had some issues with inconcistent tso/gso configuration
> > > causing performance issues for sch_fq with pacing in one of my systems,
> > > I wonder if is it still recommended to disable gso/tso for interfaces
> > > used with fq_codel qdiscs and shaping using HTB etc.
> >
> > At lower bandwidths gro can do terrible things. Say you have a 1Mbit
> > uplink, and IW10. (At least one device (mvneta) will synthesise 64k of
> > gro packets)
> >
> > a single IW10 burst from one flow injects 130ms of latency.
>
> That is simply a sign of something bad happening from the source.
>
> The router will spend too much time trying to fix the TCP sender by
> smoothing things.
>
> Lets fix the root cause, instead of making everything slow or burn mega
> watts.
>
> GRO aggregates trains of packets for the same flow, in sub ms window.
>
> Why ? Because GRO can not predict the future : It can not know when next
> interrupt might come from the device telling : here is some additional
> packet(s). Maybe next packet is coming in 5 seconds.
>
> Take a look at napi_poll()
>
> 1) If device driver called napi_complete(), all packets are flushed
> (given) to upper stack. No packet will wait in GRO for additional
> segments.
>
> 2) Under flood (we exhausted the napi budget and did not call
> napi_complete()), we make sure no packet can sit in GRO for more than 1
> ms.
>
> Only when the device is under flood and cpu can not drain fast enough RX
> queue, GRO can aggregate packets more aggressively, and the size of GRO
> packets exactly fits the CPU budget.
>
> In a nutshell, GRO is exactly the mechanism that adapts the packet sizes
> to available cpu power.
>
> If your cpu is really fast, then it will dequeue one packet at a time
> and GRO wont kick in.
>
> So the real problem here is that some device drivers implemented a poor
> interrupt mitigation logic, inherited from other OS that had not GRO and
> _had_ to implement their own crap, hurting latencies.
>
> Make sure you disable interrupt mitigation, and leave GRO enabled.
>
> e1000e is notoriously bad for interrupt mitigation.
>
> At Google, we let the NIC sends its RX interrupt ASAP.
>

​Interesting. Do I understand you correctly that you basically recommend
​loading the e1000e module with InterruptThrottleRate set to 0, or is
interrupt mitigation something else?

options e1000e InterruptThrottleRate=0(,0,0,0...)

https://www.kernel.org/doc/Documentation/networking/e1000e.txt

I haven't fiddled with interruptthrottlerate since before I even heard of
bufferbloat.




>
> Every usec matters.
>
> So the model for us is very clear : Use GRO and TSO as much as we can,
> but make sure the producers (TCP senders) are smart and control their
> burst sizes.
>
> Think about 50Gbit and 100Gbit, and really the question of having or not
> TSO and GRO is simply moot.
>
>
> Even at 1Gbit, GRO is helping to reduce cpu cycles and thus reduce
> latencies.
>
> Adding a sysctl to limit GRO max size would be trivial, I already
> mentioned that, but nobody cared enough to send a patch.
>
> >
> > >
> > > If there is a trade off, at which bandwith does it generally make more
> > > sense to enable tso/gso than to have it disabled when doing HTB shaped
> > > fq_codel qdiscs?
> >
> > I stopped caring about tuning params at > 40Mbit. < 10 gbit, or rather,
> > trying get below 200usec of jitter|latency. (Others care)
> >
> > And: My expectation was generally that people would ignore our
> > recommendations on disabling offloads!
> >
> > Yes, we should revise the sample sqm code and recommendations for a post
> > gigabit era to not bother with changing network offloads. Were you
> > modifying the old debloat script?
> >
> > TBF & sch_Cake do peeling of gro/tso/gso back into packets, and then
> > interleave their scheduling, so GRO is both helpful (transiting the
> > stack faster) and harmless, at all bandwidths.
> >
> > HTB doesn't peel. We just ripped out hsfc for sqm-scripts (too buggy),
> > alsp. Leaving: tbf + fq_codel, htb+fq_codel, and cake models there.
> >
>
>
>
> > ...
> >
> > Cake is coming along nicely. I'd love a test in your 2Gbit bonding
> > scenario, particularly in a per host fairness test, at line or shaped
> > rates. We recently got cake working well with nat.
> >
> > http://blog.cerowrt.org/flent/steam/down_working.svg (ignore the latency
> > figure, the 6 flows were to spots all over the world)
> >
> > > Regards,
> > > Hans-Kristian
> > >
> > >
> > > _______________________________________________
> > > Bloat mailing list
> > > Bloat@lists.bufferbloat.net
> > > https://lists.bufferbloat.net/listinfo/bloat
> > >
> > _______________________________________________
> > Bloat mailing list
> > Bloat@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
>
>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>