[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
Thomas Rosenstein
thomas.rosenstein at creamfinance.com
Wed Nov 4 10:23:12 EST 2020
Hi all,
I'm coming from the lartc mailing list, here's the original text:
=====
I have multiple routers which connect to multiple upstream providers, I
have noticed a high latency shift in icmp (and generally all connection)
if I run b2 upload-file --threads 40 (and I can reproduce this)
What options do I have to analyze why this happens?
General Info:
Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
2 of the 4 routers are connected upstream with a 1G connection (separate
port, same network card)
All routers have the full internet routing tables, i.e. 80k entries for
IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7
During high latency:
Latency on routers which have the traffic flow increases to 12 - 20 ms,
for all interfaces, moving of the stream (via bgp disable session) moves
also the high latency
iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)
for b2 upload-file with 10 threads I can achieve 60 MB/s consistently,
with 40 threads the performance drops to 8 MB/s
I do not believe that 40 tcp streams should be any problem for a machine
of that size.
Thanks for any ideas, help, pointers, things I can verify / check /
provide additional!
=======
So far I have tested:
1) Use Stock Kernel 3.10.0-541 -> issue does not happen
2) setup fq_codel on the interfaces:
Here is the tc -s qdisc output:
qdisc fq_codel 8005: dev eth4 root refcnt 193 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 8374229144 bytes 10936167 pkt (dropped 0, overlimits 0 requeues
6127)
backlog 0b 0p requeues 6127
maxpacket 25398 drop_overlimit 0 new_flow_count 15441 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8008: dev eth5 root refcnt 193 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 1072480080 bytes 1012973 pkt (dropped 0, overlimits 0 requeues
735)
backlog 0b 0p requeues 735
maxpacket 19682 drop_overlimit 0 new_flow_count 15963 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8004: dev eth4.2300 root refcnt 2 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 8441021899 bytes 11021070 pkt (dropped 0, overlimits 0 requeues
0)
backlog 0b 0p requeues 0
maxpacket 68130 drop_overlimit 0 new_flow_count 257055 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8006: dev eth5.2501 root refcnt 2 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 571984459 bytes 2148377 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 7570 drop_overlimit 0 new_flow_count 11300 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 8007: dev eth5.2502 root refcnt 2 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms ecn
Sent 1401322222 bytes 1966724 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 19682 drop_overlimit 0 new_flow_count 76653 ecn_mark 0
new_flows_len 0 old_flows_len 0
I have no statistics / metrics that would point to a slow down on the
server, cpu / load / network / packets / memory all show normal very low
load.
Is there other, (hidden) metrics I can collect to analyze this issue
further?
Thanks
Thomas
More information about the Bloat
mailing list