[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Mon Nov 16 07:05:23 EST 2020

On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote:

> On Fri, 13 Nov 2020 07:31:26 +0100
> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>
>> On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:
>>
>>> On Thu, 12 Nov 2020 14:42:59 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>>>
>>>>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that
>>>>> this
>>>>> adaptive algorithm in the driver code can guess wrong (due to not
>>>>> taking TSO into account) and cause issues for
>>>>>
>>>>> Try to turn this adaptive algorithm off:
>>>>>
>>>>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
>>>>>
>>> [...]
>>>>>>
>>>>>> rx-usecs: 32
>>>>>
>>>>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
>>>>> (calc 1/(32/10^6) = 31250).
>>>>>
>>>>>> rx-frames: 64
>>> [...]
>>>>>> tx-usecs-irq: 0
>>>>>> tx-frames-irq: 0
>>>>>>
>>>>> [...]
>>>>
>>>> I have now updated the settings to:
>>>>
>>>> ethtool -c eth4
>>>> Coalesce parameters for eth4:
>>>> Adaptive RX: off  TX: off
>>>> stats-block-usecs: 0
>>>> sample-interval: 0
>>>> pkt-rate-low: 0
>>>> pkt-rate-high: 0
>>>>
>>>> rx-usecs: 0
>>>
>>> Please put a value in rx-usecs, like 20 or 10.
>>> The value 0 is often used to signal driver to do adaptive.
>>
>> Ok, put it now to 10.
>
> Setting it to 10 is a little aggressive, as you ask it to generate
> 100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)
>
>  1/(10/10^6) = 100000 interrupts/sec
>
>> Goes a bit quicker (transfer up to 26 MB/s), but discards and pci 
>> stalls
>> are still there.
>
> Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)

yep 208 MBits

>
> If you still have ethtool PHY-discards, then you still have a problem.
>
>> Ping times are noticable improved:
>
> Okay so this means these changes did have a positive effect.  So, this
> can be related to OS is not getting activated fast-enough by NIC
> interrupts.
>
>
>> 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
>> 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
>> 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
>> 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
>> 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
>> 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
>> 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
>> 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
>> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
>> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
>> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
>> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms
>
> Can you try to test if disabling TSO, GRO and GSO makes a difference?
>
>  ethtool -K eth4 gso off gro off tso off
>

I had a call yesterday with Mellanox and we added the following boot 
options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll

This completely solved the problem, but now we run with a heater and 
energy consumer, nearly 2x Watts on the outlet.

I had no discards, super pings during transfer(< 0.100 ms), no outliers, 
and good transfer rates > 50 MB/s

So it seems to be related to C-State management in newer kernel version 
being too agressive.
I would like to try to tune here a bit, maybe we can get some input 
which knobs to turn?

I will read here: 
https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation
and related docs, I think there will be a few helpful hints.

>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer