[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Toke Høiland-Jørgensen toke at toke.dk
Fri Nov 6 06:45:31 EST 2020


"Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> writes:

> On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
>
>> On Fri, 06 Nov 2020 10:18:10 +0100
>> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>>
>>>>> I just tested 5.9.4 seems to also fix it partly, I have long
>>>>> stretches where it looks good, and then some increases again. (3.10
>>>>> Stock has them too, but not so high, rather 1-3 ms)
>>>>>
>>
>> That you have long stretches where latency looks good is interesting
>> information.   My theory is that your system have a periodic userspace
>> process that does a kernel syscall that takes too long, blocking
>> network card from processing packets. (Note it can also be a kernel
>> thread).
>
> The weird part is, I first only updated router-02 and pinged to 
> router-04 (out of traffic flow), there I noticed these long stretches of 
> ok ping.
>
> When I updated also router-03 and router-04, the old behaviour kind of 
> was back, this confused me.
>
> Could this be related to netlink? I have gobgpd running on these 
> routers, which injects routes via netlink.
> But the churn rate during the tests is very minimal, maybe 30 - 40 
> routes every second.
>
> Otherwise we got: salt-minion, collectd, node_exporter, sshd

collectd may be polling the interface stats; try turning that off?

>>
>> Another theory is the NIC HW does strange things, but it is not very
>> likely.  E.g. delaying the packets before generating the IRQ 
>> interrupt,
>> which hide it from my IRQ-to-softirq latency tool.
>>
>> A question: What traffic control qdisc are you using on your system?
>
> kernel 4+ uses pfifo, but there's no dropped packets
> I have also tested with fq_codel, same behaviour and also no weirdness 
> in the packets queue itself
>
> kernel 3.10 uses mq, and for the vlan interfaces noqueue

Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is
it not using mq?

Was there anything in the ethtool stats?

-Toke


More information about the Bloat mailing list