[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Thomas Rosenstein thomas.rosenstein at creamfinance.com
Sat Nov 7 07:37:01 EST 2020



On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:

> On Fri, 06 Nov 2020 18:04:49 +0100
> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>
>> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
>>
>>
>> I'm using ping on IPv4, but I'll try to see if IPv6 makes any
>> difference!
>
> I think you misunderstand me.  I'm not asking you to use ping6. The
> gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
> Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
> route tables update can potentially stall softirq from running, which
> was the latency tool was measuring... and it did show some outliers.

yes I did, I assumed the latency would be introduced in the traffic path 
by the lock.
Nonetheless, I tested it and no difference :)

>
>
>>> Have you tried to use 'perf record' to observe that is happening on
>>> the system while these latency incidents happen?  (let me know if 
>>> you
>>> want some cmdline hints)
>>
>> Haven't tried this yet. If you have some hints what events to monitor
>> I'll take them!
>
> Okay to record everything (-a) on the system and save call-graph (-g),
> and run for 5 seconds (via profiling the sleep function).
>
>  # perf record -g -a  sleep 5
>
> To view the result the simply use the 'perf report', but likely you
> want to use option --no-children as you are profiling the kernel (and
> not a userspace program you want to have grouped 'children' by).  I
> also include the CPU column via '--sort cpu,comm,dso,symbol' and you
> can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.
>
>  # perf report --sort cpu,comm,dso,symbol --no-children
>
> When we ask you to provide the output, you can use the --stdio option,
> and provide txt-info via a pastebin link as it is very long.

Here is the output from kernel 3.10_1127 (I updated to the really newest 
in that branch):  https://pastebin.com/5mxirXPw
Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

I have noticed that the delays are directly related to the traffic 
flows, see below.

These tests are WITHOUT gobgpd running, so no updates to the route 
table, but the route tables are fully populated.
Also, it's ONLY outgoing traffic, the return packets are coming in on 
another router.

I have then cleared the routing tables, and the issue persists, table 
has only 78 entries.

40 threads -> sometimes higher rtt times: https://pastebin.com/Y9nd0h4h
60 threads -> always high rtt times: https://pastebin.com/JFvhtLrH

So it definitly gets worse the more connections there are.

I have also tried to reproduce the issue with the kernel on a virtual 
hyper-v machine, there I don't have any adverse effects.
But it's not 100% the same, since MASQ happens on it .. will restructure 
a bit to get a similar representation

I also suspected now that -j NOTRACK would be an issue, removed that 
too, no change. (it's anyways async routing)

Additionally I have quit all applications except for sshd, no change!



>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


More information about the Bloat mailing list