[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Jesper Dangaard Brouer brouer at redhat.com
Fri Nov 6 07:53:58 EST 2020


On Fri, 06 Nov 2020 12:45:31 +0100
Toke Høiland-Jørgensen <toke at toke.dk> wrote:

> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> writes:
> 
> > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
> >  
> >> On Fri, 06 Nov 2020 10:18:10 +0100
> >> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
> >>  
> >>>>> I just tested 5.9.4 seems to also fix it partly, I have long
> >>>>> stretches where it looks good, and then some increases again. (3.10
> >>>>> Stock has them too, but not so high, rather 1-3 ms)
> >>>>>  
> >>
> >> That you have long stretches where latency looks good is interesting
> >> information.   My theory is that your system have a periodic userspace
> >> process that does a kernel syscall that takes too long, blocking
> >> network card from processing packets. (Note it can also be a kernel
> >> thread).  
> >
[...]
> >
> > Could this be related to netlink? I have gobgpd running on these 
> > routers, which injects routes via netlink.
> > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > routes every second.

Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.

> >
> > Otherwise we got: salt-minion, collectd, node_exporter, sshd  
> 
> collectd may be polling the interface stats; try turning that off?

It should be fairly easy for you to test the theory if any of these
services (except sshd) is causing this, by turning them off
individually.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



More information about the Bloat mailing list