General list for discussing Bufferbloat
 help / color / mirror / Atom feed
* [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
@ 2020-11-04 15:23 Thomas Rosenstein
  2020-11-04 16:10 ` Toke Høiland-Jørgensen
  2020-11-16 12:34 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-04 15:23 UTC (permalink / raw)
  To: bloat

Hi all,

I'm coming from the lartc mailing list, here's the original text:

=====

I have multiple routers which connect to multiple upstream providers, I 
have noticed a high latency shift in icmp (and generally all connection) 
if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X 
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
2 of the 4 routers are connected upstream with a 1G connection (separate 
port, same network card)
All routers have the full internet routing tables, i.e. 80k entries for 
IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7

During high latency:

Latency on routers which have the traffic flow increases to 12 - 20 ms, 
for all interfaces, moving of the stream (via bgp disable session) moves 
also the high latency
iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)


for b2 upload-file with 10 threads I can achieve 60 MB/s consistently, 
with 40 threads the performance drops to 8 MB/s

I do not believe that 40 tcp streams should be any problem for a machine 
of that size.

Thanks for any ideas, help, pointers, things I can verify / check / 
provide additional!

=======


So far I have tested:

1) Use Stock Kernel 3.10.0-541 -> issue does not happen
2) setup fq_codel on the interfaces:

Here is the tc -s qdisc output:

qdisc fq_codel 8005: dev eth4 root refcnt 193 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
  Sent 8374229144 bytes 10936167 pkt (dropped 0, overlimits 0 requeues 
6127)
  backlog 0b 0p requeues 6127
   maxpacket 25398 drop_overlimit 0 new_flow_count 15441 ecn_mark 0
   new_flows_len 0 old_flows_len 0
qdisc fq_codel 8008: dev eth5 root refcnt 193 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
  Sent 1072480080 bytes 1012973 pkt (dropped 0, overlimits 0 requeues 
735)
  backlog 0b 0p requeues 735
   maxpacket 19682 drop_overlimit 0 new_flow_count 15963 ecn_mark 0
   new_flows_len 0 old_flows_len 0
qdisc fq_codel 8004: dev eth4.2300 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
  Sent 8441021899 bytes 11021070 pkt (dropped 0, overlimits 0 requeues 
0)
  backlog 0b 0p requeues 0
   maxpacket 68130 drop_overlimit 0 new_flow_count 257055 ecn_mark 0
   new_flows_len 0 old_flows_len 0
qdisc fq_codel 8006: dev eth5.2501 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
  Sent 571984459 bytes 2148377 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
   maxpacket 7570 drop_overlimit 0 new_flow_count 11300 ecn_mark 0
   new_flows_len 0 old_flows_len 0
qdisc fq_codel 8007: dev eth5.2502 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
  Sent 1401322222 bytes 1966724 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
   maxpacket 19682 drop_overlimit 0 new_flow_count 76653 ecn_mark 0
   new_flows_len 0 old_flows_len 0


I have no statistics / metrics that would point to a slow down on the 
server, cpu / load / network / packets / memory all show normal very low 
load.
Is there other, (hidden) metrics I can collect to analyze this issue 
further?

Thanks
Thomas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-04 15:23 [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60 Thomas Rosenstein
@ 2020-11-04 16:10 ` Toke Høiland-Jørgensen
  2020-11-04 16:24   ` Thomas Rosenstein
  2020-11-16 12:34 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-04 16:10 UTC (permalink / raw)
  To: Thomas Rosenstein, bloat

Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> writes:

> Hi all,
>
> I'm coming from the lartc mailing list, here's the original text:
>
> =====
>
> I have multiple routers which connect to multiple upstream providers, I 
> have noticed a high latency shift in icmp (and generally all connection) 
> if I run b2 upload-file --threads 40 (and I can reproduce this)
>
> What options do I have to analyze why this happens?
>
> General Info:
>
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz
> 96 GB RAM
> No Swap
> CentOs 7
>
> During high latency:
>
> Latency on routers which have the traffic flow increases to 12 - 20 ms, 
> for all interfaces, moving of the stream (via bgp disable session) moves 
> also the high latency
> iperf3 performance plumets to 300 - 400 MBits
> CPU load (user / system) are around 0.1%
> Ram Usage is around 3 - 4 GB
> if_packets count is stable (around 8000 pkt/s more)

I'm not sure I get you topology. Packets are going from where to where,
and what link is the bottleneck for the transfer you're doing? Are you
measuring the latency along the same path?

Have you tried running 'mtr' to figure out which hop the latency is at?

> Here is the tc -s qdisc output:

This indicates ("dropped 0" and "ecn_mark 0") that there's no
backpressure on the qdisc, so something else is going on.

Also, you said the issue goes away if you downgrade the kernel? That
does sound odd...

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-04 16:10 ` Toke Høiland-Jørgensen
@ 2020-11-04 16:24   ` Thomas Rosenstein
  2020-11-05  0:10     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-04 16:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bloat



On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:

> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> writes:
>
>> Hi all,
>>
>> I'm coming from the lartc mailing list, here's the original text:
>>
>> =====
>>
>> I have multiple routers which connect to multiple upstream providers, 
>> I
>> have noticed a high latency shift in icmp (and generally all 
>> connection)
>> if I run b2 upload-file --threads 40 (and I can reproduce this)
>>
>> What options do I have to analyze why this happens?
>>
>> General Info:
>>
>> Routers are connected between each other with 10G Mellanox Connect-X
>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>> Latency generally is around 0.18 ms between all routers (4).
>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
>> iperf3.
>> 2 of the 4 routers are connected upstream with a 1G connection 
>> (separate
>> port, same network card)
>> All routers have the full internet routing tables, i.e. 80k entries 
>> for
>> IPv6 and 830k entries for IPv4
>> Conntrack is disabled (-j NOTRACK)
>> Kernel 5.4.60 (custom)
>> 2x Xeon X5670 @ 2.93 Ghz
>> 96 GB RAM
>> No Swap
>> CentOs 7
>>
>> During high latency:
>>
>> Latency on routers which have the traffic flow increases to 12 - 20 
>> ms,
>> for all interfaces, moving of the stream (via bgp disable session) 
>> moves
>> also the high latency
>> iperf3 performance plumets to 300 - 400 MBits
>> CPU load (user / system) are around 0.1%
>> Ram Usage is around 3 - 4 GB
>> if_packets count is stable (around 8000 pkt/s more)
>
> I'm not sure I get you topology. Packets are going from where to 
> where,
> and what link is the bottleneck for the transfer you're doing? Are you
> measuring the latency along the same path?
>
> Have you tried running 'mtr' to figure out which hop the latency is 
> at?

I tried to draw the topology, I hope this is okay and explains betters 
what's happening:

https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing

There is definitly no bottleneck in any of the links, the maximum on any 
link is 16k packets/sec and around 300 Mbit/s.
In the iperf3 tests I can easily get up to 9.4 Gbit/s

So it must be something in the kernel tacking on a delay, I could try to 
do a bisect and build like 10 kernels :)

>
>> Here is the tc -s qdisc output:
>
> This indicates ("dropped 0" and "ecn_mark 0") that there's no
> backpressure on the qdisc, so something else is going on.
>
> Also, you said the issue goes away if you downgrade the kernel? That
> does sound odd...

Yes, indeed. I have only recently upgraded the kernel to the 5.4.60 and 
haven't had the issue before.

>
> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-04 16:24   ` Thomas Rosenstein
@ 2020-11-05  0:10     ` Toke Høiland-Jørgensen
  2020-11-05  8:48       ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-05  0:10 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: bloat

"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:

> On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:
>
>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> writes:
>>
>>> Hi all,
>>>
>>> I'm coming from the lartc mailing list, here's the original text:
>>>
>>> =====
>>>
>>> I have multiple routers which connect to multiple upstream providers, 
>>> I
>>> have noticed a high latency shift in icmp (and generally all 
>>> connection)
>>> if I run b2 upload-file --threads 40 (and I can reproduce this)
>>>
>>> What options do I have to analyze why this happens?
>>>
>>> General Info:
>>>
>>> Routers are connected between each other with 10G Mellanox Connect-X
>>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>>> Latency generally is around 0.18 ms between all routers (4).
>>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
>>> iperf3.
>>> 2 of the 4 routers are connected upstream with a 1G connection 
>>> (separate
>>> port, same network card)
>>> All routers have the full internet routing tables, i.e. 80k entries 
>>> for
>>> IPv6 and 830k entries for IPv4
>>> Conntrack is disabled (-j NOTRACK)
>>> Kernel 5.4.60 (custom)
>>> 2x Xeon X5670 @ 2.93 Ghz
>>> 96 GB RAM
>>> No Swap
>>> CentOs 7
>>>
>>> During high latency:
>>>
>>> Latency on routers which have the traffic flow increases to 12 - 20 
>>> ms,
>>> for all interfaces, moving of the stream (via bgp disable session) 
>>> moves
>>> also the high latency
>>> iperf3 performance plumets to 300 - 400 MBits
>>> CPU load (user / system) are around 0.1%
>>> Ram Usage is around 3 - 4 GB
>>> if_packets count is stable (around 8000 pkt/s more)
>>
>> I'm not sure I get you topology. Packets are going from where to 
>> where,
>> and what link is the bottleneck for the transfer you're doing? Are you
>> measuring the latency along the same path?
>>
>> Have you tried running 'mtr' to figure out which hop the latency is 
>> at?
>
> I tried to draw the topology, I hope this is okay and explains betters 
> what's happening:
>
> https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing

Ohh, right, you're pinging between two of the routers across a 10 Gbps
link with plenty of capacity to spare, and *that* goes up by two orders
of magnitude when you start the transfer, even though the transfer
itself is <1Gbps? Am I understanding you correctly now?

If so, this sounds more like a driver issue, or maybe something to do
with scheduling. Does it only happen with ICMP? You could try this tool
for a userspace UDP measurement:

https://github.com/heistp/irtt/

Also, what happens if you ping a host on the internet (*through* the
router instead of *to* it)?

And which version of the Connect-X cards are you using (or rather, which
driver? mlx4?)

> So it must be something in the kernel tacking on a delay, I could try to 
> do a bisect and build like 10 kernels :)

That may ultimately end up being necessary. However, when you say 'stock
kernel' you mean what CentOS ships, right? If so, that's not really a
3.10 kernel - the RHEL kernels (that centos is based on) are... somewhat
creative... about their versioning. So if you're switched to a vanilla
upstream kernel you may find bisecting difficult :/

How did you configure the new kernel? Did you start from scratch, or is
it based on the old centos config?

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05  0:10     ` Toke Høiland-Jørgensen
@ 2020-11-05  8:48       ` Thomas Rosenstein
  2020-11-05 11:21         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-05  8:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bloat



On 5 Nov 2020, at 1:10, Toke Høiland-Jørgensen wrote:

> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>
>> On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:
>>
>>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> writes:
>>>
>>>> Hi all,
>>>>
>>>> I'm coming from the lartc mailing list, here's the original text:
>>>>
>>>> =====
>>>>
>>>> I have multiple routers which connect to multiple upstream 
>>>> providers,
>>>> I
>>>> have noticed a high latency shift in icmp (and generally all
>>>> connection)
>>>> if I run b2 upload-file --threads 40 (and I can reproduce this)
>>>>
>>>> What options do I have to analyze why this happens?
>>>>
>>>> General Info:
>>>>
>>>> Routers are connected between each other with 10G Mellanox 
>>>> Connect-X
>>>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>>>> Latency generally is around 0.18 ms between all routers (4).
>>>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with
>>>> iperf3.
>>>> 2 of the 4 routers are connected upstream with a 1G connection
>>>> (separate
>>>> port, same network card)
>>>> All routers have the full internet routing tables, i.e. 80k entries
>>>> for
>>>> IPv6 and 830k entries for IPv4
>>>> Conntrack is disabled (-j NOTRACK)
>>>> Kernel 5.4.60 (custom)
>>>> 2x Xeon X5670 @ 2.93 Ghz
>>>> 96 GB RAM
>>>> No Swap
>>>> CentOs 7
>>>>
>>>> During high latency:
>>>>
>>>> Latency on routers which have the traffic flow increases to 12 - 20
>>>> ms,
>>>> for all interfaces, moving of the stream (via bgp disable session)
>>>> moves
>>>> also the high latency
>>>> iperf3 performance plumets to 300 - 400 MBits
>>>> CPU load (user / system) are around 0.1%
>>>> Ram Usage is around 3 - 4 GB
>>>> if_packets count is stable (around 8000 pkt/s more)
>>>
>>> I'm not sure I get you topology. Packets are going from where to
>>> where,
>>> and what link is the bottleneck for the transfer you're doing? Are 
>>> you
>>> measuring the latency along the same path?
>>>
>>> Have you tried running 'mtr' to figure out which hop the latency is
>>> at?
>>
>> I tried to draw the topology, I hope this is okay and explains 
>> betters
>> what's happening:
>>
>> https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing
>
> Ohh, right, you're pinging between two of the routers across a 10 Gbps
> link with plenty of capacity to spare, and *that* goes up by two 
> orders
> of magnitude when you start the transfer, even though the transfer
> itself is <1Gbps? Am I understanding you correctly now?

Exactly :)

>
> If so, this sounds more like a driver issue, or maybe something to do
> with scheduling. Does it only happen with ICMP? You could try this 
> tool
> for a userspace UDP measurement:

It happens with all packets, therefore the transfer to backblaze with 40 
threads goes down to ~8MB/s instead of >60MB/s

>
> https://github.com/heistp/irtt/
>

I'll try what that reports!

> Also, what happens if you ping a host on the internet (*through* the
> router instead of *to* it)?

Same issue, but twice pronounced, as it seems all interfaces are 
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides, so 
ping to the internet increased by 2x

>
> And which version of the Connect-X cards are you using (or rather, 
> which
> driver? mlx4?)
>

It's Connect-X 4 Lx cards, specifcally: MCX4121A-ACAT
Driver is mlx5_core

>> So it must be something in the kernel tacking on a delay, I could try 
>> to
>> do a bisect and build like 10 kernels :)
>
> That may ultimately end up being necessary. However, when you say 
> 'stock
> kernel' you mean what CentOS ships, right? If so, that's not really a
> 3.10 kernel - the RHEL kernels (that centos is based on) are... 
> somewhat
> creative... about their versioning. So if you're switched to a vanilla
> upstream kernel you may find bisecting difficult :/

Yep default that CentOS ships, I just tested 4.12.5 there the issue also 
does not happen. So I guess I can bisect it then...(really don't want to 
😃)

>
> How did you configure the new kernel? Did you start from scratch, or 
> is
> it based on the old centos config?

first oldconfig and from there then added additional options for IB, 
NVMe, etc (which I don't really need on the routers)

>
> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05  8:48       ` Thomas Rosenstein
@ 2020-11-05 11:21         ` Toke Høiland-Jørgensen
  2020-11-05 12:22           ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-05 11:21 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: bloat

"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:

>> If so, this sounds more like a driver issue, or maybe something to do
>> with scheduling. Does it only happen with ICMP? You could try this 
>> tool
>> for a userspace UDP measurement:
>
> It happens with all packets, therefore the transfer to backblaze with 40 
> threads goes down to ~8MB/s instead of >60MB/s

Huh, right, definitely sounds like a kernel bug; or maybe the new kernel
is getting the hardware into a state where it bugs out when there are
lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while running
the test and see if any error counters go up. Here's a handy script to
monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

> I'll try what that reports!
>
>> Also, what happens if you ping a host on the internet (*through* the
>> router instead of *to* it)?
>
> Same issue, but twice pronounced, as it seems all interfaces are 
> affected.
> So, ping on one interface and the second has the issue.
> Also all traffic across the host has the issue, but on both sides, so 
> ping to the internet increased by 2x

Right, so even an unloaded interface suffers? But this is the same NIC,
right? So it could still be a hardware issue...

> Yep default that CentOS ships, I just tested 4.12.5 there the issue also 
> does not happen. So I guess I can bisect it then...(really don't want to 
> 😃)

Well that at least narrows it down :)

>>
>> How did you configure the new kernel? Did you start from scratch, or 
>> is
>> it based on the old centos config?
>
> first oldconfig and from there then added additional options for IB, 
> NVMe, etc (which I don't really need on the routers)

OK, so you're probably building with roughly the same options in terms
of scheduling granularity etc. That's good. Did you enable spectre
mitigations etc on the new kernel? What's the output of
`tail /sys/devices/system/cpu/vulnerabilities/*` ?

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 11:21         ` Toke Høiland-Jørgensen
@ 2020-11-05 12:22           ` Thomas Rosenstein
  2020-11-05 12:38             ` Toke Høiland-Jørgensen
  2020-11-05 13:33             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-05 12:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bloat



On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:

> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>
>>> If so, this sounds more like a driver issue, or maybe something to 
>>> do
>>> with scheduling. Does it only happen with ICMP? You could try this
>>> tool
>>> for a userspace UDP measurement:
>>
>> It happens with all packets, therefore the transfer to backblaze with 
>> 40
>> threads goes down to ~8MB/s instead of >60MB/s
>
> Huh, right, definitely sounds like a kernel bug; or maybe the new 
> kernel
> is getting the hardware into a state where it bugs out when there are
> lots of flows or something.
>
> You could try looking at the ethtool stats (ethtool -S) while running
> the test and see if any error counters go up. Here's a handy script to
> monitor changes in the counters:
>
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>
>> I'll try what that reports!
>>
>>> Also, what happens if you ping a host on the internet (*through* the
>>> router instead of *to* it)?
>>
>> Same issue, but twice pronounced, as it seems all interfaces are
>> affected.
>> So, ping on one interface and the second has the issue.
>> Also all traffic across the host has the issue, but on both sides, so
>> ping to the internet increased by 2x
>
> Right, so even an unloaded interface suffers? But this is the same 
> NIC,
> right? So it could still be a hardware issue...
>
>> Yep default that CentOS ships, I just tested 4.12.5 there the issue 
>> also
>> does not happen. So I guess I can bisect it then...(really don't want 
>> to
>> 😃)
>
> Well that at least narrows it down :)

I just tested 5.9.4 seems to also fix it partly, I have long stretches 
where it looks good, and then some increases again. (3.10 Stock has them 
too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... 
anyone can pinpoint it?

>
>>>
>>> How did you configure the new kernel? Did you start from scratch, or
>>> is
>>> it based on the old centos config?
>>
>> first oldconfig and from there then added additional options for IB,
>> NVMe, etc (which I don't really need on the routers)
>
> OK, so you're probably building with roughly the same options in terms
> of scheduling granularity etc. That's good. Did you enable spectre
> mitigations etc on the new kernel? What's the output of
> `tail /sys/devices/system/cpu/vulnerabilities/*` ?

mitigations are off

==> /sys/devices/system/cpu/vulnerabilities/itlb_multihit <==
KVM: Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/l1tf <==
Mitigation: PTE Inversion; VMX: vulnerable

==> /sys/devices/system/cpu/vulnerabilities/mds <==
Vulnerable; SMT vulnerable

==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spec_store_bypass <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Vulnerable: __user pointer sanitization and usercopy barriers only; no 
swapgs barriers

==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Vulnerable, STIBP: disabled

==> /sys/devices/system/cpu/vulnerabilities/srbds <==
Not affected

==> /sys/devices/system/cpu/vulnerabilities/tsx_async_abort <==
Not affected


Grub Boot options are: crashkernel=896M rd.lvm.lv=cl/root net.ifnames=0 
biosdevname=0 scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=y mitigations=off 
console=tty0 console=ttyS1,115200

>
> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 12:22           ` Thomas Rosenstein
@ 2020-11-05 12:38             ` Toke Høiland-Jørgensen
  2020-11-05 12:41               ` Thomas Rosenstein
  2020-11-05 13:33             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-05 12:38 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: bloat

"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:

> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>
>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>
>>>> If so, this sounds more like a driver issue, or maybe something to 
>>>> do
>>>> with scheduling. Does it only happen with ICMP? You could try this
>>>> tool
>>>> for a userspace UDP measurement:
>>>
>>> It happens with all packets, therefore the transfer to backblaze with 
>>> 40
>>> threads goes down to ~8MB/s instead of >60MB/s
>>
>> Huh, right, definitely sounds like a kernel bug; or maybe the new 
>> kernel
>> is getting the hardware into a state where it bugs out when there are
>> lots of flows or something.
>>
>> You could try looking at the ethtool stats (ethtool -S) while running
>> the test and see if any error counters go up. Here's a handy script to
>> monitor changes in the counters:
>>
>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>
>>> I'll try what that reports!
>>>
>>>> Also, what happens if you ping a host on the internet (*through* the
>>>> router instead of *to* it)?
>>>
>>> Same issue, but twice pronounced, as it seems all interfaces are
>>> affected.
>>> So, ping on one interface and the second has the issue.
>>> Also all traffic across the host has the issue, but on both sides, so
>>> ping to the internet increased by 2x
>>
>> Right, so even an unloaded interface suffers? But this is the same 
>> NIC,
>> right? So it could still be a hardware issue...
>>
>>> Yep default that CentOS ships, I just tested 4.12.5 there the issue 
>>> also
>>> does not happen. So I guess I can bisect it then...(really don't want 
>>> to
>>> 😃)
>>
>> Well that at least narrows it down :)
>
> I just tested 5.9.4 seems to also fix it partly, I have long stretches 
> where it looks good, and then some increases again. (3.10 Stock has them 
> too, but not so high, rather 1-3 ms)
>
> for example:
>
> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>
> and then again:
>
> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>
>
> For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... 
> anyone can pinpoint it?

$ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
72932

Only 73k commits; should be easy, right? :)

(In other words no, I have no idea; I'd suggest either (a) asking on
netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
knowing).

>>>> How did you configure the new kernel? Did you start from scratch, or
>>>> is
>>>> it based on the old centos config?
>>>
>>> first oldconfig and from there then added additional options for IB,
>>> NVMe, etc (which I don't really need on the routers)
>>
>> OK, so you're probably building with roughly the same options in terms
>> of scheduling granularity etc. That's good. Did you enable spectre
>> mitigations etc on the new kernel? What's the output of
>> `tail /sys/devices/system/cpu/vulnerabilities/*` ?
>
> mitigations are off

Right, I just figured maybe you were hitting some threshold that
involved a lot of indirect calls which slowed things down due to
mitigations. Guess not, then...

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 12:38             ` Toke Høiland-Jørgensen
@ 2020-11-05 12:41               ` Thomas Rosenstein
  2020-11-05 12:47                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-05 12:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bloat



On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote:

> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>
>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>>
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>>
>>>>> If so, this sounds more like a driver issue, or maybe something to
>>>>> do
>>>>> with scheduling. Does it only happen with ICMP? You could try this
>>>>> tool
>>>>> for a userspace UDP measurement:
>>>>
>>>> It happens with all packets, therefore the transfer to backblaze 
>>>> with
>>>> 40
>>>> threads goes down to ~8MB/s instead of >60MB/s
>>>
>>> Huh, right, definitely sounds like a kernel bug; or maybe the new
>>> kernel
>>> is getting the hardware into a state where it bugs out when there 
>>> are
>>> lots of flows or something.
>>>
>>> You could try looking at the ethtool stats (ethtool -S) while 
>>> running
>>> the test and see if any error counters go up. Here's a handy script 
>>> to
>>> monitor changes in the counters:
>>>
>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>
>>>> I'll try what that reports!
>>>>
>>>>> Also, what happens if you ping a host on the internet (*through* 
>>>>> the
>>>>> router instead of *to* it)?
>>>>
>>>> Same issue, but twice pronounced, as it seems all interfaces are
>>>> affected.
>>>> So, ping on one interface and the second has the issue.
>>>> Also all traffic across the host has the issue, but on both sides, 
>>>> so
>>>> ping to the internet increased by 2x
>>>
>>> Right, so even an unloaded interface suffers? But this is the same
>>> NIC,
>>> right? So it could still be a hardware issue...
>>>
>>>> Yep default that CentOS ships, I just tested 4.12.5 there the issue
>>>> also
>>>> does not happen. So I guess I can bisect it then...(really don't 
>>>> want
>>>> to
>>>> 😃)
>>>
>>> Well that at least narrows it down :)
>>
>> I just tested 5.9.4 seems to also fix it partly, I have long 
>> stretches
>> where it looks good, and then some increases again. (3.10 Stock has 
>> them
>> too, but not so high, rather 1-3 ms)
>>
>> for example:
>>
>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>
>> and then again:
>>
>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>>
>>
>> For me it looks now that there was some fix between 5.4.60 and 5.9.4 
>> ...
>> anyone can pinpoint it?
>
> $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
> 72932
>
> Only 73k commits; should be easy, right? :)
>
> (In other words no, I have no idea; I'd suggest either (a) asking on
> netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
> knowing).

Guess I'll go the easy route and let it be ...

I'll update all routers to the 5.9.4 and see if it fixes the traffic 
flow - will report back once more after that.

>
>>>>> How did you configure the new kernel? Did you start from scratch, 
>>>>> or
>>>>> is
>>>>> it based on the old centos config?
>>>>
>>>> first oldconfig and from there then added additional options for 
>>>> IB,
>>>> NVMe, etc (which I don't really need on the routers)
>>>
>>> OK, so you're probably building with roughly the same options in 
>>> terms
>>> of scheduling granularity etc. That's good. Did you enable spectre
>>> mitigations etc on the new kernel? What's the output of
>>> `tail /sys/devices/system/cpu/vulnerabilities/*` ?
>>
>> mitigations are off
>
> Right, I just figured maybe you were hitting some threshold that
> involved a lot of indirect calls which slowed things down due to
> mitigations. Guess not, then...
>

Thanks for the support :)

> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 12:41               ` Thomas Rosenstein
@ 2020-11-05 12:47                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-05 12:47 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: bloat

"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:

> On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote:
>
>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>
>>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>>>
>>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>>>
>>>>>> If so, this sounds more like a driver issue, or maybe something to
>>>>>> do
>>>>>> with scheduling. Does it only happen with ICMP? You could try this
>>>>>> tool
>>>>>> for a userspace UDP measurement:
>>>>>
>>>>> It happens with all packets, therefore the transfer to backblaze 
>>>>> with
>>>>> 40
>>>>> threads goes down to ~8MB/s instead of >60MB/s
>>>>
>>>> Huh, right, definitely sounds like a kernel bug; or maybe the new
>>>> kernel
>>>> is getting the hardware into a state where it bugs out when there 
>>>> are
>>>> lots of flows or something.
>>>>
>>>> You could try looking at the ethtool stats (ethtool -S) while 
>>>> running
>>>> the test and see if any error counters go up. Here's a handy script 
>>>> to
>>>> monitor changes in the counters:
>>>>
>>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>>
>>>>> I'll try what that reports!
>>>>>
>>>>>> Also, what happens if you ping a host on the internet (*through* 
>>>>>> the
>>>>>> router instead of *to* it)?
>>>>>
>>>>> Same issue, but twice pronounced, as it seems all interfaces are
>>>>> affected.
>>>>> So, ping on one interface and the second has the issue.
>>>>> Also all traffic across the host has the issue, but on both sides, 
>>>>> so
>>>>> ping to the internet increased by 2x
>>>>
>>>> Right, so even an unloaded interface suffers? But this is the same
>>>> NIC,
>>>> right? So it could still be a hardware issue...
>>>>
>>>>> Yep default that CentOS ships, I just tested 4.12.5 there the issue
>>>>> also
>>>>> does not happen. So I guess I can bisect it then...(really don't 
>>>>> want
>>>>> to
>>>>> 😃)
>>>>
>>>> Well that at least narrows it down :)
>>>
>>> I just tested 5.9.4 seems to also fix it partly, I have long 
>>> stretches
>>> where it looks good, and then some increases again. (3.10 Stock has 
>>> them
>>> too, but not so high, rather 1-3 ms)
>>>
>>> for example:
>>>
>>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>>
>>> and then again:
>>>
>>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
>>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
>>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
>>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
>>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
>>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
>>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
>>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
>>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
>>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
>>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
>>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
>>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
>>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
>>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>>>
>>>
>>> For me it looks now that there was some fix between 5.4.60 and 5.9.4 
>>> ...
>>> anyone can pinpoint it?
>>
>> $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
>> 72932
>>
>> Only 73k commits; should be easy, right? :)
>>
>> (In other words no, I have no idea; I'd suggest either (a) asking on
>> netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
>> knowing).
>
> Guess I'll go the easy route and let it be ...
>
> I'll update all routers to the 5.9.4 and see if it fixes the traffic 
> flow - will report back once more after that.

Sounds like a plan :)

>>
>>>>>> How did you configure the new kernel? Did you start from scratch, 
>>>>>> or
>>>>>> is
>>>>>> it based on the old centos config?
>>>>>
>>>>> first oldconfig and from there then added additional options for 
>>>>> IB,
>>>>> NVMe, etc (which I don't really need on the routers)
>>>>
>>>> OK, so you're probably building with roughly the same options in 
>>>> terms
>>>> of scheduling granularity etc. That's good. Did you enable spectre
>>>> mitigations etc on the new kernel? What's the output of
>>>> `tail /sys/devices/system/cpu/vulnerabilities/*` ?
>>>
>>> mitigations are off
>>
>> Right, I just figured maybe you were hitting some threshold that
>> involved a lot of indirect calls which slowed things down due to
>> mitigations. Guess not, then...
>>
>
> Thanks for the support :)

You're welcome!

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 12:22           ` Thomas Rosenstein
  2020-11-05 12:38             ` Toke Høiland-Jørgensen
@ 2020-11-05 13:33             ` Jesper Dangaard Brouer
  2020-11-06  8:48               ` Thomas Rosenstein
  2020-11-06  9:18               ` Thomas Rosenstein
  1 sibling, 2 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-05 13:33 UTC (permalink / raw)
  To: Bloat

On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:

> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
> 
> > "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
> >  
> >>> If so, this sounds more like a driver issue, or maybe something to 
> >>> do with scheduling. Does it only happen with ICMP? You could try this
> >>> tool for a userspace UDP measurement:  
> >>
> >> It happens with all packets, therefore the transfer to backblaze with 
> >> 40 threads goes down to ~8MB/s instead of >60MB/s  
> >
> > Huh, right, definitely sounds like a kernel bug; or maybe the new 
> > kernel is getting the hardware into a state where it bugs out when
> > there are lots of flows or something.
> >
> > You could try looking at the ethtool stats (ethtool -S) while
> > running the test and see if any error counters go up. Here's a
> > handy script to monitor changes in the counters:
> >
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >  
> >> I'll try what that reports!
> >>  
> >>> Also, what happens if you ping a host on the internet (*through*
> >>> the router instead of *to* it)?  
> >>
> >> Same issue, but twice pronounced, as it seems all interfaces are
> >> affected.
> >> So, ping on one interface and the second has the issue.
> >> Also all traffic across the host has the issue, but on both sides,
> >> so ping to the internet increased by 2x  
> >
> > Right, so even an unloaded interface suffers? But this is the same 
> > NIC, right? So it could still be a hardware issue...
> >  
> >> Yep default that CentOS ships, I just tested 4.12.5 there the
> >> issue also does not happen. So I guess I can bisect it
> >> then...(really don't want to 😃)  
> >
> > Well that at least narrows it down :)  
> 
> I just tested 5.9.4 seems to also fix it partly, I have long
> stretches where it looks good, and then some increases again. (3.10
> Stock has them too, but not so high, rather 1-3 ms)
> 
> for example:
> 
> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> 
> and then again:
> 
> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
> 
> 
> For me it looks now that there was some fix between 5.4.60 and 5.9.4
> ... anyone can pinpoint it?

I have some bpftrace tools to measure these kind of latency spikes here:
 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/

The tool you want is: softirq_net_latency.bt
 [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt

Example output see[3]:
 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8

Based on the kernel versions, I don't expect this to be same latency
issue as described in the bugzilla[3] case (as IIRC it was fixed in
4.19).  It can still be similar issue, where some userspace process is
reading information from the kernel (/sys/fs/cgroup/memory/memory.stat
in BZ case) that blocks softirq from running, and result in these
latency spikes.



Install guide to bpftrace[4]:
 [4] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 13:33             ` Jesper Dangaard Brouer
@ 2020-11-06  8:48               ` Thomas Rosenstein
  2020-11-06 10:53                 ` Jesper Dangaard Brouer
  2020-11-06  9:18               ` Thomas Rosenstein
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-06  8:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat, Toke Høiland-Jørgensen



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:

> On Thu, 05 Nov 2020 13:22:10 +0100
> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>
>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>>
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>>
>>>>> If so, this sounds more like a driver issue, or maybe something to
>>>>> do with scheduling. Does it only happen with ICMP? You could try 
>>>>> this
>>>>> tool for a userspace UDP measurement:
>>>>
>>>> It happens with all packets, therefore the transfer to backblaze 
>>>> with
>>>> 40 threads goes down to ~8MB/s instead of >60MB/s
>>>
>>> Huh, right, definitely sounds like a kernel bug; or maybe the new
>>> kernel is getting the hardware into a state where it bugs out when
>>> there are lots of flows or something.
>>>
>>> You could try looking at the ethtool stats (ethtool -S) while
>>> running the test and see if any error counters go up. Here's a
>>> handy script to monitor changes in the counters:
>>>
>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>
>>>> I'll try what that reports!
>>>>
>>>>> Also, what happens if you ping a host on the internet (*through*
>>>>> the router instead of *to* it)?
>>>>
>>>> Same issue, but twice pronounced, as it seems all interfaces are
>>>> affected.
>>>> So, ping on one interface and the second has the issue.
>>>> Also all traffic across the host has the issue, but on both sides,
>>>> so ping to the internet increased by 2x
>>>
>>> Right, so even an unloaded interface suffers? But this is the same
>>> NIC, right? So it could still be a hardware issue...
>>>
>>>> Yep default that CentOS ships, I just tested 4.12.5 there the
>>>> issue also does not happen. So I guess I can bisect it
>>>> then...(really don't want to 😃)
>>>
>>> Well that at least narrows it down :)
>>
>> I just tested 5.9.4 seems to also fix it partly, I have long
>> stretches where it looks good, and then some increases again. (3.10
>> Stock has them too, but not so high, rather 1-3 ms)
>>
>> for example:
>>
>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>
>> and then again:
>>
>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>>
>>
>> For me it looks now that there was some fix between 5.4.60 and 5.9.4
>> ... anyone can pinpoint it?

So, new day, same issue!

I upgraded now all routers to 5.9.4, and the issue is back ...

here, when I stop it, it goes immediatly down to 0.xx ms

64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
64 bytes from x.x.x.x: icmp_seq=64 ttl=64 time=9.69 ms
64 bytes from x.x.x.x: icmp_seq=65 ttl=64 time=7.52 ms
64 bytes from x.x.x.x: icmp_seq=66 ttl=64 time=14.9 ms
64 bytes from x.x.x.x: icmp_seq=67 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=68 ttl=64 time=2.34 ms
64 bytes from x.x.x.x: icmp_seq=69 ttl=64 time=8.97 ms
64 bytes from x.x.x.x: icmp_seq=70 ttl=64 time=0.203 ms
64 bytes from x.x.x.x: icmp_seq=71 ttl=64 time=9.10 ms
64 bytes from x.x.x.x: icmp_seq=72 ttl=64 time=3.16 ms
64 bytes from x.x.x.x: icmp_seq=73 ttl=64 time=1.88 ms
64 bytes from x.x.x.x: icmp_seq=74 ttl=64 time=11.5 ms
64 bytes from x.x.x.x: icmp_seq=75 ttl=64 time=11.9 ms
64 bytes from x.x.x.x: icmp_seq=76 ttl=64 time=3.64 ms
64 bytes from x.x.x.x: icmp_seq=77 ttl=64 time=5.82 ms
64 bytes from x.x.x.x: icmp_seq=78 ttl=64 time=0.121 ms
64 bytes from x.x.x.x: icmp_seq=79 ttl=64 time=0.212 ms
64 bytes from x.x.x.x: icmp_seq=80 ttl=64 time=0.158 ms
64 bytes from x.x.x.x: icmp_seq=81 ttl=64 time=0.174 ms
64 bytes from x.x.x.x: icmp_seq=82 ttl=64 time=0.160 ms
64 bytes from x.x.x.x: icmp_seq=83 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=84 ttl=64 time=0.211 ms
64 bytes from x.x.x.x: icmp_seq=85 ttl=64 time=0.182 ms
64 bytes from x.x.x.x: icmp_seq=86 ttl=64 time=0.161 ms
64 bytes from x.x.x.x: icmp_seq=87 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=88 ttl=64 time=0.143 ms
64 bytes from x.x.x.x: icmp_seq=89 ttl=64 time=0.186 ms
64 bytes from x.x.x.x: icmp_seq=90 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=91 ttl=64 time=0.187 ms
64 bytes from x.x.x.x: icmp_seq=92 ttl=64 time=0.174 ms
64 bytes from x.x.x.x: icmp_seq=93 ttl=64 time=0.167 ms


I have also tried the bpftrace, there were no high IRQ reported during 
the test, here's the latency output:

@latency_usecs[21]:
[0]                   17 |                                               
      |
[1]                  463 |@@@                                            
      |
[2, 4)              6814 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)              5401 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      
      |
[8, 16)              136 |@                                              
      |
[16, 32)              42 |                                               
      |
[32, 64)               2 |                                               
      |
[64, 128)              1 |                                               
      |

@latency_usecs[13]:
[0]                  108 |                                               
      |
[1]                 2089 |@@@                                            
      |
[2, 4)             32112 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)              7642 |@@@@@@@@@@@@                                   
      |
[8, 16)              462 |                                               
      |
[16, 32)             159 |                                               
      |
[32, 64)              32 |                                               
      |
[64, 128)             34 |                                               
      |
[128, 256)            10 |                                               
      |
[256, 512)             1 |                                               
      |

@latency_usecs[22]:
[0]                  244 |                                               
      |
[1]                 3481 |@@@@@                                          
      |
[2, 4)             33795 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             13848 |@@@@@@@@@@@@@@@@@@@@@                          
      |
[8, 16)              607 |                                               
      |
[16, 32)             214 |                                               
      |
[32, 64)              58 |                                               
      |
[64, 128)             42 |                                               
      |
[128, 256)            12 |                                               
      |

@latency_usecs[16]:
[0]                  539 |                                               
      |
[1]                 7601 |@@@@@@                                         
      |
[2, 4)             62482 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             22456 |@@@@@@@@@@@@@@@@@@                             
      |
[8, 16)             1252 |@                                              
      |
[16, 32)             480 |                                               
      |
[32, 64)             211 |                                               
      |
[64, 128)            134 |                                               
      |
[128, 256)            17 |                                               
      |

@latency_usecs[2]:
[0]                  352 |                                               
      |
[1]                18810 |@@@@@@@@@@@@@                                  
      |
[2, 4)             74915 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             11504 |@@@@@@@                                        
      |
[8, 16)             1077 |                                               
      |
[16, 32)             479 |                                               
      |
[32, 64)             329 |                                               
      |
[64, 128)            359 |                                               
      |
[128, 256)            79 |                                               
      |
[256, 512)             2 |                                               
      |

@latency_usecs[23]:
[0]                 1214 |                                               
      |
[1]                 9483 |@@@@@@@                                        
      |
[2, 4)             64949 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             34220 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                    
      |
[8, 16)             1940 |@                                              
      |
[16, 32)             775 |                                               
      |
[32, 64)             224 |                                               
      |
[64, 128)            220 |                                               
      |
[128, 256)            51 |                                               
      |

@latency_usecs[9]:
[0]                  407 |                                               
      |
[1]                 8040 |@@@@@                                          
      |
[2, 4)             80780 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             46867 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 
      |
[8, 16)             3232 |@@                                             
      |
[16, 32)             614 |                                               
      |
[32, 64)             337 |                                               
      |
[64, 128)            231 |                                               
      |
[128, 256)            33 |                                               
      |
[256, 512)             0 |                                               
      |
[512, 1K)              0 |                                               
      |
[1K, 2K)               0 |                                               
      |
[2K, 4K)               0 |                                               
      |
[4K, 8K)               0 |                                               
      |
[8K, 16K)              1 |                                               
      |

@latency_usecs[4]:
[0]                  513 |                                               
      |
[1]                10016 |@@@@@                                          
      |
[2, 4)            100149 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             32956 |@@@@@@@@@@@@@@@@@                              
      |
[8, 16)             1436 |                                               
      |
[16, 32)             731 |                                               
      |
[32, 64)             285 |                                               
      |
[64, 128)            302 |                                               
      |
[128, 256)            64 |                                               
      |
[256, 512)             1 |                                               
      |

@latency_usecs[18]:
[0]                  679 |                                               
      |
[1]                10962 |@@@@@@                                         
      |
[2, 4)             93837 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             36900 |@@@@@@@@@@@@@@@@@@@@                           
      |
[8, 16)             2050 |@                                              
      |
[16, 32)            1283 |                                               
      |
[32, 64)             390 |                                               
      |
[64, 128)            419 |                                               
      |
[128, 256)            91 |                                               
      |
[256, 512)             1 |                                               
      |

@latency_usecs[3]:
[0]                  792 |                                               
      |
[1]                12109 |@@@@@@@                                        
      |
[2, 4)             83144 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             43316 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                    
      |
[8, 16)             4688 |@@                                             
      |
[16, 32)            2087 |@                                              
      |
[32, 64)             786 |                                               
      |
[64, 128)            791 |                                               
      |
[128, 256)           316 |                                               
      |
[256, 512)            43 |                                               
      |
[512, 1K)              5 |                                               
      |
[1K, 2K)               1 |                                               
      |

@latency_usecs[5]:
[0]                 1990 |@                                              
      |
[1]                14018 |@@@@@@@@                                       
      |
[2, 4)             85797 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             56720 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             
      |
[8, 16)             2993 |@                                              
      |
[16, 32)            1239 |                                               
      |
[32, 64)             536 |                                               
      |
[64, 128)            567 |                                               
      |
[128, 256)           106 |                                               
      |
[256, 512)             5 |                                               
      |
[512, 1K)             11 |                                               
      |
[1K, 2K)               4 |                                               
      |

@latency_usecs[15]:
[0]                  719 |                                               
      |
[1]                12086 |@@@@@@                                         
      |
[2, 4)             95664 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             57490 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                
      |
[8, 16)             3632 |@                                              
      |
[16, 32)            1417 |                                               
      |
[32, 64)             695 |                                               
      |
[64, 128)            696 |                                               
      |
[128, 256)           182 |                                               
      |
[256, 512)             2 |                                               
      |
[512, 1K)              7 |                                               
      |
[1K, 2K)               6 |                                               
      |

@latency_usecs[6]:
[0]                  585 |                                               
      |
[1]                11880 |@@@@@@@                                        
      |
[2, 4)             87736 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             72264 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     
      |
[8, 16)             2554 |@                                              
      |
[16, 32)            1403 |                                               
      |
[32, 64)             813 |                                               
      |
[64, 128)            624 |                                               
      |
[128, 256)           168 |                                               
      |
[256, 512)            23 |                                               
      |
[512, 1K)              3 |                                               
      |

@latency_usecs[11]:
[0]                  768 |                                               
      |
[1]                19378 |@@@@@@@@@                                      
      |
[2, 4)            104935 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             81555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       
      |
[8, 16)             3645 |@                                              
      |
[16, 32)            1535 |                                               
      |
[32, 64)             837 |                                               
      |
[64, 128)            913 |                                               
      |
[128, 256)           207 |                                               
      |
[256, 512)             2 |                                               
      |

@latency_usecs[1]:
[0]                 1511 |                                               
      |
[1]                23327 |@@@@@@@@@@                                     
      |
[2, 4)            107938 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
[4, 8)            118722 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)             5364 |@@                                             
      |
[16, 32)            2551 |@                                              
      |
[32, 64)             874 |                                               
      |
[64, 128)            809 |                                               
      |
[128, 256)           192 |                                               
      |
[256, 512)             3 |                                               
      |
[512, 1K)              0 |                                               
      |
[1K, 2K)               1 |                                               
      |

@latency_usecs[17]:
[0]                 1268 |                                               
      |
[1]                16486 |@@@@@                                          
      |
[2, 4)            146148 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             92483 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               
      |
[8, 16)            12484 |@@@@                                           
      |
[16, 32)            2330 |                                               
      |
[32, 64)            1040 |                                               
      |
[64, 128)           1054 |                                               
      |
[128, 256)           234 |                                               
      |
[256, 512)             6 |                                               
      |
[512, 1K)              2 |                                               
      |
[1K, 2K)               1 |                                               
      |
[2K, 4K)               0 |                                               
      |
[4K, 8K)               1 |                                               
      |

@latency_usecs[14]:
[0]                 2685 |@                                              
      |
[1]               119278 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)            113518 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |
[4, 8)             33067 |@@@@@@@@@@@@@@                                 
      |
[8, 16)             9740 |@@@@                                           
      |
[16, 32)            3348 |@                                              
      |
[32, 64)            2245 |                                               
      |
[64, 128)           1515 |                                               
      |
[128, 256)           466 |                                               
      |
[256, 512)             8 |                                               
      |

@latency_usecs[10]:
[0]                 2013 |                                               
      |
[1]                28900 |@@@@@@@@                                       
      |
[2, 4)            175104 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             71562 |@@@@@@@@@@@@@@@@@@@@@                          
      |
[8, 16)             6094 |@                                              
      |
[16, 32)            2884 |                                               
      |
[32, 64)            1481 |                                               
      |
[64, 128)           1450 |                                               
      |
[128, 256)           311 |                                               
      |
[256, 512)             6 |                                               
      |
[512, 1K)              1 |                                               
      |

@latency_usecs[7]:
[0]                 4790 |@                                              
      |
[1]                94276 |@@@@@@@@@@@@@@@@@@@@@@@@@@                     
      |
[2, 4)            186012 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             44289 |@@@@@@@@@@@@                                   
      |
[8, 16)             6822 |@                                              
      |
[16, 32)            2842 |                                               
      |
[32, 64)            2361 |                                               
      |
[64, 128)           3178 |                                               
      |
[128, 256)           910 |                                               
      |
[256, 512)            19 |                                               
      |
[512, 1K)              1 |                                               
      |

@latency_usecs[12]:
[0]                 2360 |                                               
      |
[1]                33684 |@@@@@@                                         
      |
[2, 4)            280464 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             51498 |@@@@@@@@@                                      
      |
[8, 16)            10636 |@                                              
      |
[16, 32)            4374 |                                               
      |
[32, 64)            2077 |                                               
      |
[64, 128)           2414 |                                               
      |
[128, 256)           596 |                                               
      |
[256, 512)             8 |                                               
      |
[512, 1K)              0 |                                               
      |
[1K, 2K)               0 |                                               
      |
[2K, 4K)               0 |                                               
      |
[4K, 8K)               1 |                                               
      |

@latency_usecs[0]:
[0]                 3559 |                                               
      |
[1]                41168 |@@@@@@@@                                       
      |
[2, 4)            251224 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            126271 |@@@@@@@@@@@@@@@@@@@@@@@@@@                     
      |
[8, 16)            15213 |@@@                                            
      |
[16, 32)            6788 |@                                              
      |
[32, 64)            4227 |                                               
      |
[64, 128)           4231 |                                               
      |
[128, 256)          1669 |                                               
      |
[256, 512)           395 |                                               
      |
[512, 1K)             54 |                                               
      |
[1K, 2K)               1 |                                               
      |

@latency_usecs[8]:
[0]                 5280 |                                               
      |
[1]                43115 |@@@@@@@                                        
      |
[2, 4)            319098 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)             98723 |@@@@@@@@@@@@@@@@                               
      |
[8, 16)            16048 |@@                                             
      |
[16, 32)            9455 |@                                              
      |
[32, 64)            3755 |                                               
      |
[64, 128)           4650 |                                               
      |
[128, 256)          1284 |                                               
      |
[256, 512)            61 |                                               
      |
[512, 1K)              2 |                                               
      |
[1K, 2K)               1 |                                               
      |
[2K, 4K)               1 |                                               
      |
[4K, 8K)               2 |                                               
      |

@latency_usecs[19]:
[0]                49889 |@@@@@@@@@@@@                                   
      |
[1]               199560 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)            185771 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4, 8)             59143 |@@@@@@@@@@@@@@@                                
      |
[8, 16)            24402 |@@@@@@                                         
      |
[16, 32)           13158 |@@@                                            
      |
[32, 64)           12889 |@@@                                            
      |
[64, 128)          12269 |@@@                                            
      |
[128, 256)          6322 |@                                              
      |
[256, 512)           299 |                                               
      |
[512, 1K)              8 |                                               
      |
[1K, 2K)               1 |                                               
      |

@latency_usecs[20]:
[0]                 6314 |                                               
      |
[1]                51792 |@@@@@@@                                        
      |
[2, 4)            353847 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            154730 |@@@@@@@@@@@@@@@@@@@@@@                         
      |
[8, 16)            23434 |@@@                                            
      |
[16, 32)            8441 |@                                              
      |
[32, 64)            3300 |                                               
      |
[64, 128)           3058 |                                               
      |
[128, 256)          1017 |                                               
      |
[256, 512)           124 |                                               
      |
[512, 1K)             12 |                                               
      |
[1K, 2K)               1 |                                               
      |

@Jesper: how should that be interpreted?

>
> I have some bpftrace tools to measure these kind of latency spikes 
> here:
>  [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/
>
> The tool you want is: softirq_net_latency.bt
>  [2] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
> Example output see[3]:
>  [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8
>
> Based on the kernel versions, I don't expect this to be same latency
> issue as described in the bugzilla[3] case (as IIRC it was fixed in
> 4.19).  It can still be similar issue, where some userspace process is
> reading information from the kernel (/sys/fs/cgroup/memory/memory.stat
> in BZ case) that blocks softirq from running, and result in these
> latency spikes.
>




>
>
> Install guide to bpftrace[4]:
>  [4] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-05 13:33             ` Jesper Dangaard Brouer
  2020-11-06  8:48               ` Thomas Rosenstein
@ 2020-11-06  9:18               ` Thomas Rosenstein
  2020-11-06 11:18                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-06  9:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat, Toke Høiland-Jørgensen



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:

> On Thu, 05 Nov 2020 13:22:10 +0100
> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>
>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>>
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>>>
>>>>> If so, this sounds more like a driver issue, or maybe something to
>>>>> do with scheduling. Does it only happen with ICMP? You could try 
>>>>> this
>>>>> tool for a userspace UDP measurement:
>>>>
>>>> It happens with all packets, therefore the transfer to backblaze 
>>>> with
>>>> 40 threads goes down to ~8MB/s instead of >60MB/s
>>>
>>> Huh, right, definitely sounds like a kernel bug; or maybe the new
>>> kernel is getting the hardware into a state where it bugs out when
>>> there are lots of flows or something.
>>>
>>> You could try looking at the ethtool stats (ethtool -S) while
>>> running the test and see if any error counters go up. Here's a
>>> handy script to monitor changes in the counters:
>>>
>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>
>>>> I'll try what that reports!
>>>>
>>>>> Also, what happens if you ping a host on the internet (*through*
>>>>> the router instead of *to* it)?
>>>>
>>>> Same issue, but twice pronounced, as it seems all interfaces are
>>>> affected.
>>>> So, ping on one interface and the second has the issue.
>>>> Also all traffic across the host has the issue, but on both sides,
>>>> so ping to the internet increased by 2x
>>>
>>> Right, so even an unloaded interface suffers? But this is the same
>>> NIC, right? So it could still be a hardware issue...
>>>
>>>> Yep default that CentOS ships, I just tested 4.12.5 there the
>>>> issue also does not happen. So I guess I can bisect it
>>>> then...(really don't want to 😃)
>>>
>>> Well that at least narrows it down :)
>>
>> I just tested 5.9.4 seems to also fix it partly, I have long
>> stretches where it looks good, and then some increases again. (3.10
>> Stock has them too, but not so high, rather 1-3 ms)
>>
>> for example:
>>
>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>
>> and then again:
>>
>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>>
>>
>> For me it looks now that there was some fix between 5.4.60 and 5.9.4
>> ... anyone can pinpoint it?

I have now retried with 3.10, here's the results:

I'm pinging between the hosts with the traffic flow, upload is running 
55 MB/s (compared to ~20 MB/s on 5.9.4, and 8MB/s on 5.4.60):

64 bytes from x.x.x.x: icmp_seq=159 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=160 ttl=64 time=0.056 ms
64 bytes from x.x.x.x: icmp_seq=161 ttl=64 time=0.061 ms
64 bytes from x.x.x.x: icmp_seq=162 ttl=64 time=0.072 ms
64 bytes from x.x.x.x: icmp_seq=163 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=164 ttl=64 time=0.053 ms
64 bytes from x.x.x.x: icmp_seq=165 ttl=64 time=0.068 ms
64 bytes from x.x.x.x: icmp_seq=166 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=167 ttl=64 time=0.057 ms
64 bytes from x.x.x.x: icmp_seq=168 ttl=64 time=0.051 ms
64 bytes from x.x.x.x: icmp_seq=169 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=170 ttl=64 time=0.138 ms
64 bytes from x.x.x.x: icmp_seq=171 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=172 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=173 ttl=64 time=0.094 ms
64 bytes from x.x.x.x: icmp_seq=174 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=175 ttl=64 time=0.810 ms
64 bytes from x.x.x.x: icmp_seq=176 ttl=64 time=0.077 ms
64 bytes from x.x.x.x: icmp_seq=177 ttl=64 time=0.055 ms
64 bytes from x.x.x.x: icmp_seq=178 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=179 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=180 ttl=64 time=0.073 ms
64 bytes from x.x.x.x: icmp_seq=181 ttl=64 time=0.065 ms
64 bytes from x.x.x.x: icmp_seq=182 ttl=64 time=0.123 ms
64 bytes from x.x.x.x: icmp_seq=183 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=184 ttl=64 time=0.054 ms
64 bytes from x.x.x.x: icmp_seq=185 ttl=64 time=0.063 ms
64 bytes from x.x.x.x: icmp_seq=186 ttl=64 time=0.044 ms
64 bytes from x.x.x.x: icmp_seq=187 ttl=64 time=0.075 ms
64 bytes from x.x.x.x: icmp_seq=188 ttl=64 time=0.076 ms
64 bytes from x.x.x.x: icmp_seq=189 ttl=64 time=0.065 ms
64 bytes from x.x.x.x: icmp_seq=190 ttl=64 time=0.049 ms

The response times with traffic flow are around 3 times better than on 
newer kernels without the traffic flow.
I have a feeling that maybe not all config options were correctly moved 
to the newer kernel.
Or there's a big bug somewhere ... (which would seem rather weird for me 
to be the first one to discover this)

I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it 
makes a difference ...

>
> I have some bpftrace tools to measure these kind of latency spikes 
> here:
>  [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/
>
> The tool you want is: softirq_net_latency.bt
>  [2] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
> Example output see[3]:
>  [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8
>
> Based on the kernel versions, I don't expect this to be same latency
> issue as described in the bugzilla[3] case (as IIRC it was fixed in
> 4.19).  It can still be similar issue, where some userspace process is
> reading information from the kernel (/sys/fs/cgroup/memory/memory.stat
> in BZ case) that blocks softirq from running, and result in these
> latency spikes.
>
>
>
> Install guide to bpftrace[4]:
>  [4] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06  8:48               ` Thomas Rosenstein
@ 2020-11-06 10:53                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-06 10:53 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bloat, Toke Høiland-Jørgensen, brouer

On Fri, 06 Nov 2020 09:48:46 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 05 Nov 2020 13:22:10 +0100
> > Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
> >  
> >> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
> >>  
> >>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
> >>>  
> >>>>> If so, this sounds more like a driver issue, or maybe something to
> >>>>> do with scheduling. Does it only happen with ICMP? You could try 
> >>>>> this
> >>>>> tool for a userspace UDP measurement:  
> >>>>
> >>>> It happens with all packets, therefore the transfer to backblaze 
> >>>> with
> >>>> 40 threads goes down to ~8MB/s instead of >60MB/s  
> >>>
> >>> Huh, right, definitely sounds like a kernel bug; or maybe the new
> >>> kernel is getting the hardware into a state where it bugs out when
> >>> there are lots of flows or something.
> >>>
> >>> You could try looking at the ethtool stats (ethtool -S) while
> >>> running the test and see if any error counters go up. Here's a
> >>> handy script to monitor changes in the counters:
> >>>
> >>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >>>  
> >>>> I'll try what that reports!
> >>>>  
> >>>>> Also, what happens if you ping a host on the internet (*through*
> >>>>> the router instead of *to* it)?  
> >>>>
> >>>> Same issue, but twice pronounced, as it seems all interfaces are
> >>>> affected.
> >>>> So, ping on one interface and the second has the issue.
> >>>> Also all traffic across the host has the issue, but on both sides,
> >>>> so ping to the internet increased by 2x  
> >>>
> >>> Right, so even an unloaded interface suffers? But this is the same
> >>> NIC, right? So it could still be a hardware issue...
> >>>  
> >>>> Yep default that CentOS ships, I just tested 4.12.5 there the
> >>>> issue also does not happen. So I guess I can bisect it
> >>>> then...(really don't want to 😃)  
> >>>
> >>> Well that at least narrows it down :)  
> >>
> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>
> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> >> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> >> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> >> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> >> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> >> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> >> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> >> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> >> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> >> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> >> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> >> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> >> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> >> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
> >>
> >>
> >> For me it looks now that there was some fix between 5.4.60 and 5.9.4
> >> ... anyone can pinpoint it?  
> 
> So, new day, same issue!
> 
> I upgraded now all routers to 5.9.4, and the issue is back ...
> 
> here, when I stop it, it goes immediatly down to 0.xx ms
> 
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
> 64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
> 64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
> 64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
> 64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
> 64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
> 64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
> 64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
> 64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
> 64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
> 64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
> 64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
> 64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
> 64 bytes from x.x.x.x: icmp_seq=64 ttl=64 time=9.69 ms
> 64 bytes from x.x.x.x: icmp_seq=65 ttl=64 time=7.52 ms
> 64 bytes from x.x.x.x: icmp_seq=66 ttl=64 time=14.9 ms
> 64 bytes from x.x.x.x: icmp_seq=67 ttl=64 time=12.6 ms
> 64 bytes from x.x.x.x: icmp_seq=68 ttl=64 time=2.34 ms
> 64 bytes from x.x.x.x: icmp_seq=69 ttl=64 time=8.97 ms
> 64 bytes from x.x.x.x: icmp_seq=70 ttl=64 time=0.203 ms
> 64 bytes from x.x.x.x: icmp_seq=71 ttl=64 time=9.10 ms
> 64 bytes from x.x.x.x: icmp_seq=72 ttl=64 time=3.16 ms
> 64 bytes from x.x.x.x: icmp_seq=73 ttl=64 time=1.88 ms
> 64 bytes from x.x.x.x: icmp_seq=74 ttl=64 time=11.5 ms
> 64 bytes from x.x.x.x: icmp_seq=75 ttl=64 time=11.9 ms
> 64 bytes from x.x.x.x: icmp_seq=76 ttl=64 time=3.64 ms
> 64 bytes from x.x.x.x: icmp_seq=77 ttl=64 time=5.82 ms
> 64 bytes from x.x.x.x: icmp_seq=78 ttl=64 time=0.121 ms
> 64 bytes from x.x.x.x: icmp_seq=79 ttl=64 time=0.212 ms
> 64 bytes from x.x.x.x: icmp_seq=80 ttl=64 time=0.158 ms
> 
> 
> I have also tried the bpftrace, there were no high IRQ reported during 
> the test, here's the latency output:

So, you didn't see any reports of "High IRQ-to-softirq latency" ?

> 
[...cut...]

> @Jesper: how should that be interpreted?

The tool is looking for outlines and (default) above 2000 us or 2 ms it
will print a statement about finding  "High IRQ-to-softirq latency".
In below output there are 3 cases that should have triggered such a
print out line...


> @latency_usecs[9]:
> [0]                  407 |                                                  |
> [1]                 8040 |@@@@@                                             |
> [2, 4)             80780 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8)             46867 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
> [8, 16)             3232 |@@                                                |
> [16, 32)             614 |                                                  |
> [32, 64)             337 |                                                  |
> [64, 128)            231 |                                                  |
> [128, 256)            33 |                                                  |
> [256, 512)             0 |                                                  |
> [512, 1K)              0 |                                                  |
> [1K, 2K)               0 |                                                  |
> [2K, 4K)               0 |                                                  |
> [4K, 8K)               0 |                                                  |
> [8K, 16K)              1 |                                                  |
> 

The majority of IRQ-to-softirq latency is within 2-8 usecs (0.002-0.008 ms).

The interesting is the outliner a number between 8ms to <16ms.

[... cut ...]

> @latency_usecs[17]:
> [0]                 1268 |                                                  |
> [1]                16486 |@@@@@                                             |
> [2, 4)            146148 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8)             92483 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> [8, 16)            12484 |@@@@                                              |
> [16, 32)            2330 |                                                  |
> [32, 64)            1040 |                                                  |
> [64, 128)           1054 |                                                  |
> [128, 256)           234 |                                                  |
> [256, 512)             6 |                                                  |
> [512, 1K)              2 |                                                  |
> [1K, 2K)               1 |                                                  |
> [2K, 4K)               0 |                                                  |
> [4K, 8K)               1 |                                                  |

[...cut...]

Outliner a number between 4ms to <8ms.

> 
> @latency_usecs[12]:
> [0]                 2360 |                                                  |
> [1]                33684 |@@@@@@                                            |
> [2, 4)            280464 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8)             51498 |@@@@@@@@@                                         |
> [8, 16)            10636 |@                                                 |
> [16, 32)            4374 |                                                  |
> [32, 64)            2077 |                                                  |
> [64, 128)           2414 |                                                  |
> [128, 256)           596 |                                                  |
> [256, 512)             8 |                                                  |
> [512, 1K)              0 |                                                  |
> [1K, 2K)               0 |                                                  |
> [2K, 4K)               0 |                                                  |
> [4K, 8K)               1 |                                                  |
> 
[...cut...]

The tools measures hardirq IRQ-to-softirq delays, which can happen if
softirq is blocked from running. Meaning kernel was busy doing other
stuff (especially for no-preempt kernels).

Even though there were some outliners in above, then your ping shows a
lot more count of high latency reported, although each count can be
multiple packets at this level.  Thus, the issue might lie in in this
specific step.

I will think of other ways to catch these latency outliners.

We can likely find inspiration in he GitHub SRE blogpost on how they
troubleshooted their latency issue (which my tool is related to).

> >
> > I have some bpftrace tools to measure these kind of latency spikes 
> > here:
> >  [1] 
> > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/
> >
> > The tool you want is: softirq_net_latency.bt
> >  [2] 
> > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
> >
> > Example output see[3]:
> >  [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8
> >
> > Based on the kernel versions, I don't expect this to be same latency
> > issue as described in the bugzilla[3] case (as IIRC it was fixed in
> > 4.19).  It can still be similar issue, where some userspace process is
> > reading information from the kernel (/sys/fs/cgroup/memory/memory.stat
> > in BZ case) that blocks softirq from running, and result in these
> > latency spikes.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06  9:18               ` Thomas Rosenstein
@ 2020-11-06 11:18                 ` Jesper Dangaard Brouer
  2020-11-06 11:37                   ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-06 11:18 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bloat, Toke Høiland-Jørgensen, brouer

On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>

That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).

Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ interrupt,
which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?

What you looked at the obvious case if any of your qdisc report a large
backlog? (during the incidents)


> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms

These very low ping times tell me that you are measuring very close to
the target machine, which is good.  Here on the bufferbloat list, we are
always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.

You mention some fs.com switches (your desc below signature), can you
tell us more?


[...]
> I have a feeling that maybe not all config options were correctly moved 
> to the newer kernel.
>
> Or there's a big bug somewhere ... (which would seem rather weird for me 
> to be the first one to discover this)

I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.

> I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it 
> makes a difference ...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:

> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz
> 96 GB RAM



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 11:18                 ` Jesper Dangaard Brouer
@ 2020-11-06 11:37                   ` Thomas Rosenstein
  2020-11-06 11:45                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-06 11:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat, Toke Høiland-Jørgensen



On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:

> On Fri, 06 Nov 2020 10:18:10 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>>>> I just tested 5.9.4 seems to also fix it partly, I have long
>>>> stretches where it looks good, and then some increases again. (3.10
>>>> Stock has them too, but not so high, rather 1-3 ms)
>>>>
>
> That you have long stretches where latency looks good is interesting
> information.   My theory is that your system have a periodic userspace
> process that does a kernel syscall that takes too long, blocking
> network card from processing packets. (Note it can also be a kernel
> thread).

The weird part is, I first only updated router-02 and pinged to 
router-04 (out of traffic flow), there I noticed these long stretches of 
ok ping.

When I updated also router-03 and router-04, the old behaviour kind of 
was back, this confused me.

Could this be related to netlink? I have gobgpd running on these 
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40 
routes every second.

Otherwise we got: salt-minion, collectd, node_exporter, sshd

>
> Another theory is the NIC HW does strange things, but it is not very
> likely.  E.g. delaying the packets before generating the IRQ 
> interrupt,
> which hide it from my IRQ-to-softirq latency tool.
>
> A question: What traffic control qdisc are you using on your system?

kernel 4+ uses pfifo, but there's no dropped packets
I have also tested with fq_codel, same behaviour and also no weirdness 
in the packets queue itself

kernel 3.10 uses mq, and for the vlan interfaces noqueue


Here's the mail archive link for the question on lartc :

https://www.spinics.net/lists/lartc/msg23774.html

>
> What you looked at the obvious case if any of your qdisc report a 
> large
> backlog? (during the incidents)

as said above, nothing in qdiscs or reported

>
>
>>>> for example:
>>>>
>>>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>>>
>>>> and then again:
>>>>
>>>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>>>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>
> These very low ping times tell me that you are measuring very close to
> the target machine, which is good.  Here on the bufferbloat list, we 
> are
> always suspicious of network equipment being use in these kind of
> setups.  As experience tells us that this can be the cause of
> bufferbloat latency.
yes, I'm just testing across two machines connected directly to the same 
switch,
basically that's the best case scenario apart from direct connection.

I do also use a VLAN on this interface, so the pings go through the vlan 
stack!

>
> You mention some fs.com switches (your desc below signature), can you
> tell us more?

It's a fs.com N5850-48S6Q

48 Port 10 Gbit + 6 port 40 Gbit

there are only 6 ports with 10 G in use, and 2 with 1 G, basically no 
traffic

>
>
> [...]
>> I have a feeling that maybe not all config options were correctly 
>> moved
>> to the newer kernel.
>>
>> Or there's a big bug somewhere ... (which would seem rather weird for 
>> me
>> to be the first one to discover this)
>
> I really appreciate that you report this.  This is a periodic issue,
> that often result in people not reporting this.
>
> Even if we find this to be caused by some process running on your
> system, or a bad config, this it is really important that we find the
> root-cause.
>
>> I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it
>> makes a difference ...
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> On Wed, 04 Nov 2020 16:23:12 +0100
> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>
>> General Info:
>>
>> Routers are connected between each other with 10G Mellanox Connect-X
>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>> Latency generally is around 0.18 ms between all routers (4).
>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
>> iperf3.
>> 2 of the 4 routers are connected upstream with a 1G connection 
>> (separate
>> port, same network card)
>> All routers have the full internet routing tables, i.e. 80k entries 
>> for
>> IPv6 and 830k entries for IPv4
>> Conntrack is disabled (-j NOTRACK)
>> Kernel 5.4.60 (custom)
>> 2x Xeon X5670 @ 2.93 Ghz
>> 96 GB RAM

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 11:37                   ` Thomas Rosenstein
@ 2020-11-06 11:45                     ` Toke Høiland-Jørgensen
  2020-11-06 12:01                       ` Thomas Rosenstein
  2020-11-06 12:53                       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-06 11:45 UTC (permalink / raw)
  To: Thomas Rosenstein, Jesper Dangaard Brouer; +Cc: Bloat

"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:

> On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
>
>> On Fri, 06 Nov 2020 10:18:10 +0100
>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>>
>>>>> I just tested 5.9.4 seems to also fix it partly, I have long
>>>>> stretches where it looks good, and then some increases again. (3.10
>>>>> Stock has them too, but not so high, rather 1-3 ms)
>>>>>
>>
>> That you have long stretches where latency looks good is interesting
>> information.   My theory is that your system have a periodic userspace
>> process that does a kernel syscall that takes too long, blocking
>> network card from processing packets. (Note it can also be a kernel
>> thread).
>
> The weird part is, I first only updated router-02 and pinged to 
> router-04 (out of traffic flow), there I noticed these long stretches of 
> ok ping.
>
> When I updated also router-03 and router-04, the old behaviour kind of 
> was back, this confused me.
>
> Could this be related to netlink? I have gobgpd running on these 
> routers, which injects routes via netlink.
> But the churn rate during the tests is very minimal, maybe 30 - 40 
> routes every second.
>
> Otherwise we got: salt-minion, collectd, node_exporter, sshd

collectd may be polling the interface stats; try turning that off?

>>
>> Another theory is the NIC HW does strange things, but it is not very
>> likely.  E.g. delaying the packets before generating the IRQ 
>> interrupt,
>> which hide it from my IRQ-to-softirq latency tool.
>>
>> A question: What traffic control qdisc are you using on your system?
>
> kernel 4+ uses pfifo, but there's no dropped packets
> I have also tested with fq_codel, same behaviour and also no weirdness 
> in the packets queue itself
>
> kernel 3.10 uses mq, and for the vlan interfaces noqueue

Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is
it not using mq?

Was there anything in the ethtool stats?

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 11:45                     ` Toke Høiland-Jørgensen
@ 2020-11-06 12:01                       ` Thomas Rosenstein
  2020-11-06 12:53                       ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-06 12:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Jesper Dangaard Brouer, Bloat



On 6 Nov 2020, at 12:45, Toke Høiland-Jørgensen wrote:

> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
>
>> On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
>>
>>> On Fri, 06 Nov 2020 10:18:10 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>>>
>>>>>> I just tested 5.9.4 seems to also fix it partly, I have long
>>>>>> stretches where it looks good, and then some increases again. 
>>>>>> (3.10
>>>>>> Stock has them too, but not so high, rather 1-3 ms)
>>>>>>
>>>
>>> That you have long stretches where latency looks good is interesting
>>> information.   My theory is that your system have a periodic 
>>> userspace
>>> process that does a kernel syscall that takes too long, blocking
>>> network card from processing packets. (Note it can also be a kernel
>>> thread).
>>
>> The weird part is, I first only updated router-02 and pinged to
>> router-04 (out of traffic flow), there I noticed these long stretches 
>> of
>> ok ping.
>>
>> When I updated also router-03 and router-04, the old behaviour kind 
>> of
>> was back, this confused me.
>>
>> Could this be related to netlink? I have gobgpd running on these
>> routers, which injects routes via netlink.
>> But the churn rate during the tests is very minimal, maybe 30 - 40
>> routes every second.
>>
>> Otherwise we got: salt-minion, collectd, node_exporter, sshd
>
> collectd may be polling the interface stats; try turning that off?

I can, but shouldn't that also influence iperf3 performance then?

>
>>>
>>> Another theory is the NIC HW does strange things, but it is not very
>>> likely.  E.g. delaying the packets before generating the IRQ
>>> interrupt,
>>> which hide it from my IRQ-to-softirq latency tool.
>>>
>>> A question: What traffic control qdisc are you using on your system?
>>
>> kernel 4+ uses pfifo, but there's no dropped packets
>> I have also tested with fq_codel, same behaviour and also no 
>> weirdness
>> in the packets queue itself
>>
>> kernel 3.10 uses mq, and for the vlan interfaces noqueue
>
> Do you mean that you only have a single pfifo qdisc on kernel 4+? Why 
> is
> it not using mq?

oh, actually, I just noticed that's a remnant of the previous tests, I 
had

net.core.default_qdisc = fq_codel

in the sysctl.conf... so disregard my previous wrong info



so all kernel by default look like that, mq + pfifo_fast:

qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc mq 0: dev eth1 root
qdisc pfifo_fast 0: dev eth1 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc mq 0: dev eth2 root
qdisc pfifo_fast 0: dev eth2 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc mq 0: dev eth3 root
qdisc pfifo_fast 0: dev eth3 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc mq 0: dev eth4 root
qdisc pfifo_fast 0: dev eth4 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :9 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :a bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :b bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :c bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :d bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :e bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :f bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :10 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :11 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :12 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :13 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :14 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :15 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :16 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :17 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 parent :18 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc noqueue 0: dev eth4.2300 root refcnt 2
qdisc noqueue 0: dev eth4.2503 root refcnt 2


>
> Was there anything in the ethtool stats?

no, just rx / tx increase, not a single err

>
> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 11:45                     ` Toke Høiland-Jørgensen
  2020-11-06 12:01                       ` Thomas Rosenstein
@ 2020-11-06 12:53                       ` Jesper Dangaard Brouer
  2020-11-06 14:13                         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-06 12:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Thomas Rosenstein, Bloat, brouer

On Fri, 06 Nov 2020 12:45:31 +0100
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> writes:
> 
> > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
> >  
> >> On Fri, 06 Nov 2020 10:18:10 +0100
> >> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
> >>  
> >>>>> I just tested 5.9.4 seems to also fix it partly, I have long
> >>>>> stretches where it looks good, and then some increases again. (3.10
> >>>>> Stock has them too, but not so high, rather 1-3 ms)
> >>>>>  
> >>
> >> That you have long stretches where latency looks good is interesting
> >> information.   My theory is that your system have a periodic userspace
> >> process that does a kernel syscall that takes too long, blocking
> >> network card from processing packets. (Note it can also be a kernel
> >> thread).  
> >
[...]
> >
> > Could this be related to netlink? I have gobgpd running on these 
> > routers, which injects routes via netlink.
> > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > routes every second.

Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.

> >
> > Otherwise we got: salt-minion, collectd, node_exporter, sshd  
> 
> collectd may be polling the interface stats; try turning that off?

It should be fairly easy for you to test the theory if any of these
services (except sshd) is causing this, by turning them off
individually.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 12:53                       ` Jesper Dangaard Brouer
@ 2020-11-06 14:13                         ` Jesper Dangaard Brouer
  2020-11-06 17:04                           ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-06 14:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Thomas Rosenstein, Bloat, brouer, David Ahern

On Fri, 6 Nov 2020 13:53:58 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> [...]
> > >
> > > Could this be related to netlink? I have gobgpd running on these 
> > > routers, which injects routes via netlink.
> > > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > > routes every second.  
> 
> Yes, this could be related.  The internal data-structure for FIB
> lookups is a fibtrie which is a compressed patricia tree, related to
> radix tree idea.  Thus, I can imagine that the kernel have to
> rebuild/rebalance the tree with all these updates.

Reading the kernel code. The IPv4 fib_trie code is very well tuned,
fully RCU-ified, meaning read-side is lock-free.  The resize() function
code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses too
much time.  And the update looks lockfree.

The IPv6 update looks more scary, as it seems to take a "bh" spinlock
that can block softirq from running code in net/ipv6/ip6_fib.c
(spin_lock_bh(&f6i->fib6_table->tb6_lock).

Have you tried to use 'perf record' to observe that is happening on the system while these latency incidents happen?  (let me know if you want some cmdline hints)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 14:13                         ` Jesper Dangaard Brouer
@ 2020-11-06 17:04                           ` Thomas Rosenstein
  2020-11-06 20:19                             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-06 17:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat



On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:

> On Fri, 6 Nov 2020 13:53:58 +0100
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>> [...]
>>>>
>>>> Could this be related to netlink? I have gobgpd running on these
>>>> routers, which injects routes via netlink.
>>>> But the churn rate during the tests is very minimal, maybe 30 - 40
>>>> routes every second.
>>
>> Yes, this could be related.  The internal data-structure for FIB
>> lookups is a fibtrie which is a compressed patricia tree, related to
>> radix tree idea.  Thus, I can imagine that the kernel have to
>> rebuild/rebalance the tree with all these updates.
>
> Reading the kernel code. The IPv4 fib_trie code is very well tuned,
> fully RCU-ified, meaning read-side is lock-free.  The resize() 
> function
> code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses 
> too
> much time.  And the update looks lockfree.
>
> The IPv6 update looks more scary, as it seems to take a "bh" spinlock
> that can block softirq from running code in net/ipv6/ip6_fib.c
> (spin_lock_bh(&f6i->fib6_table->tb6_lock).

I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
difference!

>
> Have you tried to use 'perf record' to observe that is happening on 
> the system while these latency incidents happen?  (let me know if you 
> want some cmdline hints)

Haven't tried this yet. If you have some hints what events to monitor 
I'll take them!

>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 17:04                           ` Thomas Rosenstein
@ 2020-11-06 20:19                             ` Jesper Dangaard Brouer
  2020-11-07 12:37                               ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-06 20:19 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bloat, brouer

On Fri, 06 Nov 2020 18:04:49 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 6 Nov 2020 13:53:58 +0100
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >  
> >> [...]  
> >>>>
> >>>> Could this be related to netlink? I have gobgpd running on these
> >>>> routers, which injects routes via netlink.
> >>>> But the churn rate during the tests is very minimal, maybe 30 - 40
> >>>> routes every second.  
> >>
> >> Yes, this could be related.  The internal data-structure for FIB
> >> lookups is a fibtrie which is a compressed patricia tree, related to
> >> radix tree idea.  Thus, I can imagine that the kernel have to
> >> rebuild/rebalance the tree with all these updates.  
> >
> > Reading the kernel code. The IPv4 fib_trie code is very well tuned,
> > fully RCU-ified, meaning read-side is lock-free.  The resize() 
> > function code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses 
> > too much time.  And the update looks lockfree.
> >
> > The IPv6 update looks more scary, as it seems to take a "bh" spinlock
> > that can block softirq from running code in net/ipv6/ip6_fib.c
> > (spin_lock_bh(&f6i->fib6_table->tb6_lock).  
> 
> I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
> difference!

I think you misunderstand me.  I'm not asking you to use ping6. The
gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
route tables update can potentially stall softirq from running, which
was the latency tool was measuring... and it did show some outliers.


> > Have you tried to use 'perf record' to observe that is happening on 
> > the system while these latency incidents happen?  (let me know if you 
> > want some cmdline hints)  
> 
> Haven't tried this yet. If you have some hints what events to monitor 
> I'll take them!

Okay to record everything (-a) on the system and save call-graph (-g),
and run for 5 seconds (via profiling the sleep function).

 # perf record -g -a  sleep 5

To view the result the simply use the 'perf report', but likely you
want to use option --no-children as you are profiling the kernel (and
not a userspace program you want to have grouped 'children' by).  I
also include the CPU column via '--sort cpu,comm,dso,symbol' and you
can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.

 # perf report --sort cpu,comm,dso,symbol --no-children

When we ask you to provide the output, you can use the --stdio option,
and provide txt-info via a pastebin link as it is very long.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-06 20:19                             ` Jesper Dangaard Brouer
@ 2020-11-07 12:37                               ` Thomas Rosenstein
  2020-11-07 12:40                                 ` Jan Ceuleers
                                                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-07 12:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat



On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:

> On Fri, 06 Nov 2020 18:04:49 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
>>
>>
>> I'm using ping on IPv4, but I'll try to see if IPv6 makes any
>> difference!
>
> I think you misunderstand me.  I'm not asking you to use ping6. The
> gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
> Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
> route tables update can potentially stall softirq from running, which
> was the latency tool was measuring... and it did show some outliers.

yes I did, I assumed the latency would be introduced in the traffic path 
by the lock.
Nonetheless, I tested it and no difference :)

>
>
>>> Have you tried to use 'perf record' to observe that is happening on
>>> the system while these latency incidents happen?  (let me know if 
>>> you
>>> want some cmdline hints)
>>
>> Haven't tried this yet. If you have some hints what events to monitor
>> I'll take them!
>
> Okay to record everything (-a) on the system and save call-graph (-g),
> and run for 5 seconds (via profiling the sleep function).
>
>  # perf record -g -a  sleep 5
>
> To view the result the simply use the 'perf report', but likely you
> want to use option --no-children as you are profiling the kernel (and
> not a userspace program you want to have grouped 'children' by).  I
> also include the CPU column via '--sort cpu,comm,dso,symbol' and you
> can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.
>
>  # perf report --sort cpu,comm,dso,symbol --no-children
>
> When we ask you to provide the output, you can use the --stdio option,
> and provide txt-info via a pastebin link as it is very long.

Here is the output from kernel 3.10_1127 (I updated to the really newest 
in that branch):  https://pastebin.com/5mxirXPw
Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

I have noticed that the delays are directly related to the traffic 
flows, see below.

These tests are WITHOUT gobgpd running, so no updates to the route 
table, but the route tables are fully populated.
Also, it's ONLY outgoing traffic, the return packets are coming in on 
another router.

I have then cleared the routing tables, and the issue persists, table 
has only 78 entries.

40 threads -> sometimes higher rtt times: https://pastebin.com/Y9nd0h4h
60 threads -> always high rtt times: https://pastebin.com/JFvhtLrH

So it definitly gets worse the more connections there are.

I have also tried to reproduce the issue with the kernel on a virtual 
hyper-v machine, there I don't have any adverse effects.
But it's not 100% the same, since MASQ happens on it .. will restructure 
a bit to get a similar representation

I also suspected now that -j NOTRACK would be an issue, removed that 
too, no change. (it's anyways async routing)

Additionally I have quit all applications except for sshd, no change!



>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 12:37                               ` Thomas Rosenstein
@ 2020-11-07 12:40                                 ` Jan Ceuleers
  2020-11-07 12:43                                   ` Thomas Rosenstein
  2020-11-07 13:00                                   ` Thomas Rosenstein
  2020-11-07 13:33                                 ` Thomas Rosenstein
  2020-11-07 16:46                                 ` Jesper Dangaard Brouer
  2 siblings, 2 replies; 47+ messages in thread
From: Jan Ceuleers @ 2020-11-07 12:40 UTC (permalink / raw)
  To: bloat

On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:
> Nonetheless, I tested it and no difference :)

The relevant test would be to run your experiment while IPv6 is
completely disabled in the kernel.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 12:40                                 ` Jan Ceuleers
@ 2020-11-07 12:43                                   ` Thomas Rosenstein
  2020-11-07 13:00                                   ` Thomas Rosenstein
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-07 12:43 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: bloat

On 7 Nov 2020, at 13:40, Jan Ceuleers wrote:

> On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:
>> Nonetheless, I tested it and no difference :)
>
> The relevant test would be to run your experiment while IPv6 is
> completely disabled in the kernel.

sure, I'll try!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 12:40                                 ` Jan Ceuleers
  2020-11-07 12:43                                   ` Thomas Rosenstein
@ 2020-11-07 13:00                                   ` Thomas Rosenstein
  2020-11-09  8:24                                     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-07 13:00 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: bloat



On 7 Nov 2020, at 13:40, Jan Ceuleers wrote:

> On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:
>> Nonetheless, I tested it and no difference :)
>
> The relevant test would be to run your experiment while IPv6 is
> completely disabled in the kernel.

Tested with ipv6 full off (via ipv6.disable=1), no difference!

Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in 
case there's something hidden.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 12:37                               ` Thomas Rosenstein
  2020-11-07 12:40                                 ` Jan Ceuleers
@ 2020-11-07 13:33                                 ` Thomas Rosenstein
  2020-11-07 16:46                                 ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-07 13:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat


On 7 Nov 2020, at 13:37, Thomas Rosenstein wrote:
>
> I have also tried to reproduce the issue with the kernel on a virtual 
> hyper-v machine, there I don't have any adverse effects.
> But it's not 100% the same, since MASQ happens on it .. will 
> restructure a bit to get a similar representation
>

I reproduced something similar now, for the VM it only happens at 120 
threads but I have every 5 or so seconds one packet that's delayed 
heavily:

PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data.
64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.273 ms
64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.260 ms
64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.621 ms
64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.218 ms
64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=52.7 ms
64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.261 ms
64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.381 ms
64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.230 ms
64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.378 ms
64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.313 ms
64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.395 ms
64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.269 ms
64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=44.1 ms
64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.279 ms
64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.406 ms
64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.245 ms
64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.202 ms
64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.229 ms
64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.357 ms
64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.462 ms
64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.300 ms
64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=60.3 ms
64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.200 ms
64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=15.8 ms
64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.296 ms
64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.273 ms
64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.591 ms
64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.583 ms

during that time the interface is loaded with 40 Mbits.

this does not appear with 3.10 kernel:

PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data.
64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.322 ms
64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.242 ms
64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.461 ms
64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.244 ms
64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=0.429 ms
64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.540 ms
64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.465 ms
64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.237 ms
64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.216 ms
64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.248 ms
64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.247 ms
64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.282 ms
64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=0.256 ms
64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.272 ms
64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.236 ms
64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.287 ms
64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.288 ms
64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.297 ms
64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.319 ms
64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.294 ms
64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=0.313 ms
64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.208 ms
64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=0.222 ms
64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.279 ms
64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.257 ms
64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.266 ms
64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.398 ms
64 bytes from 10.10.0.69: icmp_seq=30 ttl=63 time=0.281 ms
64 bytes from 10.10.0.69: icmp_seq=31 ttl=63 time=0.263 ms
64 bytes from 10.10.0.69: icmp_seq=32 ttl=63 time=0.276 ms

might be a hint that it's actually reproduceable.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 12:37                               ` Thomas Rosenstein
  2020-11-07 12:40                                 ` Jan Ceuleers
  2020-11-07 13:33                                 ` Thomas Rosenstein
@ 2020-11-07 16:46                                 ` Jesper Dangaard Brouer
  2020-11-07 17:01                                   ` Thomas Rosenstein
  2 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-07 16:46 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bloat, brouer

On Sat, 07 Nov 2020 13:37:01 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 06 Nov 2020 18:04:49 +0100
> > "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
> >  
> >> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
> >>
> >>
[...]
> >  
> >>> Have you tried to use 'perf record' to observe that is happening on
> >>> the system while these latency incidents happen?  (let me know if 
> >>> you want some cmdline hints)  
> >>
> >> Haven't tried this yet. If you have some hints what events to monitor
> >> I'll take them!  
> >
> > Okay to record everything (-a) on the system and save call-graph (-g),
> > and run for 5 seconds (via profiling the sleep function).
> >
> >  # perf record -g -a  sleep 5
> >
> > To view the result the simply use the 'perf report', but likely you
> > want to use option --no-children as you are profiling the kernel (and
> > not a userspace program you want to have grouped 'children' by).  I
> > also include the CPU column via '--sort cpu,comm,dso,symbol' and you
> > can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.
> >
> >  # perf report --sort cpu,comm,dso,symbol --no-children
> >
> > When we ask you to provide the output, you can use the --stdio option,
> > and provide txt-info via a pastebin link as it is very long.  
> 
> Here is the output from kernel 3.10_1127 (I updated to the really newest 
> in that branch):  https://pastebin.com/5mxirXPw
> Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

Did the latency issue happen during this this perf recording?
(it is obviously important that you record during the issue)

Finding a single latency event/issue in perf.data is like looking in
for needle in a haystack.  It does give hints and e.g. tell me details
like what part of the kernel are used/activated.

I can see that you have likely configured XPS (Transmit Packet
Steering). Is this on purpose?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 16:46                                 ` Jesper Dangaard Brouer
@ 2020-11-07 17:01                                   ` Thomas Rosenstein
  2020-11-07 17:26                                     ` Sebastian Moeller
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-07 17:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bloat



On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote:

>
> Did the latency issue happen during this this perf recording?
> (it is obviously important that you record during the issue)

yes it was recorded during the transfer, while it exhibited the issue

I let the transfer run some time, then did the perf extract, stopped the 
transfer afterwards.

>
> Finding a single latency event/issue in perf.data is like looking in
> for needle in a haystack.  It does give hints and e.g. tell me details
> like what part of the kernel are used/activated.
>
> I can see that you have likely configured XPS (Transmit Packet
> Steering). Is this on purpose?

Not on purpose, I assume it comes from the Mellanox Card.

>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 17:01                                   ` Thomas Rosenstein
@ 2020-11-07 17:26                                     ` Sebastian Moeller
  0 siblings, 0 replies; 47+ messages in thread
From: Sebastian Moeller @ 2020-11-07 17:26 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Jesper Dangaard Brouer, Bloat

Hi Thomas,

Probably not very helpful, but iputils' ping has the -D option that is described as:
      -D
           Print timestamp (unix time + microseconds as in gettimeofday) before each line.

this potentially allows to be translated into kernel time to pinpoint events in time? Not that I would know how to translate times here and if perf and friends actually store timestamps....

Best Regards
	Sebastian


> On Nov 7, 2020, at 18:01, Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
> 
> 
> 
> On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote:
> 
>> 
>> Did the latency issue happen during this this perf recording?
>> (it is obviously important that you record during the issue)
> 
> yes it was recorded during the transfer, while it exhibited the issue
> 
> I let the transfer run some time, then did the perf extract, stopped the transfer afterwards.
> 
>> 
>> Finding a single latency event/issue in perf.data is like looking in
>> for needle in a haystack.  It does give hints and e.g. tell me details
>> like what part of the kernel are used/activated.
>> 
>> I can see that you have likely configured XPS (Transmit Packet
>> Steering). Is this on purpose?
> 
> Not on purpose, I assume it comes from the Mellanox Card.
> 
>> 
>> -- 
>> Best regards,
>>  Jesper Dangaard Brouer
>>  MSc.CS, Principal Kernel Engineer at Red Hat
>>  LinkedIn: http://www.linkedin.com/in/brouer
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-07 13:00                                   ` Thomas Rosenstein
@ 2020-11-09  8:24                                     ` Jesper Dangaard Brouer
  2020-11-09 10:09                                       ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-09  8:24 UTC (permalink / raw)
  To: Thomas Rosenstein via Bloat
  Cc: brouer, Thomas Rosenstein, Jan Ceuleers, Saeed Mahameed,
	Tariq Toukan, kheib

On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
 
> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in 
> case there's something hidden.

Yes, there is something hiding in the data from ethtool_stats.pl[1]:
(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:            1 (          1) <= outbound_pci_stalled_wr_events /sec
 stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec

I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking
in the kernel driver code it is related to "rx_buffer_almost_full".
The numbers per second is excessive (but it be related to a driver bug
as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).

 stat:     29583661 ( 29,583,661) <= rx_bytes /sec
 stat:     30343677 ( 30,343,677) <= rx_bytes_phy /sec

You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
difference between what the OS sees (rx_bytes) and what the NIC
hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).

 stat:        19552 (     19,552) <= rx_packets /sec
 stat:        19950 (     19,950) <= rx_packets_phy /sec

Above RX packet counters also indicated HW is seeing more packets that
OS is receiving.

Next counters is likely your problem:

 stat:          718 (        718) <= tx_global_pause /sec
 stat:       954035 (    954,035) <= tx_global_pause_duration /sec
 stat:          714 (        714) <= tx_pause_ctrl_phy /sec

It looks like you have enabled Ethernet Flow-Control, and something is
causing pause frames to be generated.  It seem strange that this happen
on a 10Gbit/s link with only 236 Mbit/s.

The TX byte counters are also very strange:

 stat:        26063 (     26,063) <= tx_bytes /sec
 stat:        71950 (     71,950) <= tx_bytes_phy /sec

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

Strange size distribution:
 stat:     19922 (     19,922) <= rx_1519_to_2047_bytes_phy /sec
 stat:        14 (         14) <= rx_65_to_127_bytes_phy /sec


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09  8:24                                     ` Jesper Dangaard Brouer
@ 2020-11-09 10:09                                       ` Thomas Rosenstein
  2020-11-09 11:40                                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-09 10:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Thomas Rosenstein via Bloat



On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:

> On Sat, 07 Nov 2020 14:00:04 +0100
> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>
>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just 
>> in
>> case there's something hidden.
>
> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>
>  stat:            1 (          1) <= outbound_pci_stalled_wr_events 
> /sec
>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
>
> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
> looking
> in the kernel driver code it is related to "rx_buffer_almost_full".
> The numbers per second is excessive (but it be related to a driver bug
> as it ends up reading "high" -> rx_buffer_almost_full_high in the
> extended counters).
>
>  stat:     29583661 ( 29,583,661) <= rx_bytes /sec
>  stat:     30343677 ( 30,343,677) <= rx_bytes_phy /sec
>
> You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
> difference between what the OS sees (rx_bytes) and what the NIC
> hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).
>
>  stat:        19552 (     19,552) <= rx_packets /sec
>  stat:        19950 (     19,950) <= rx_packets_phy /sec

Could these packets be from VLAN interfaces that are not used in the OS?

>
> Above RX packet counters also indicated HW is seeing more packets that
> OS is receiving.
>
> Next counters is likely your problem:
>
>  stat:          718 (        718) <= tx_global_pause /sec
>  stat:       954035 (    954,035) <= tx_global_pause_duration /sec
>  stat:          714 (        714) <= tx_pause_ctrl_phy /sec

As far as I can see that's only the TX, and we are only doing RX on this 
interface - so maybe that's irrelevant?

>
> It looks like you have enabled Ethernet Flow-Control, and something is
> causing pause frames to be generated.  It seem strange that this 
> happen
> on a 10Gbit/s link with only 236 Mbit/s.
>
> The TX byte counters are also very strange:
>
>  stat:        26063 (     26,063) <= tx_bytes /sec
>  stat:        71950 (     71,950) <= tx_bytes_phy /sec

Also, it's TX, and we are only doing RX, as I said already somewhere, 
it's async routing, so the TX data comes via another router back.

>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [1] 
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>
> Strange size distribution:
>  stat:     19922 (     19,922) <= rx_1519_to_2047_bytes_phy /sec
>  stat:        14 (         14) <= rx_65_to_127_bytes_phy /sec

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 10:09                                       ` Thomas Rosenstein
@ 2020-11-09 11:40                                         ` Jesper Dangaard Brouer
  2020-11-09 11:51                                           ` Toke Høiland-Jørgensen
                                                             ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-09 11:40 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Thomas Rosenstein via Bloat, brouer

On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
> 
> > On Sat, 07 Nov 2020 14:00:04 +0100
> > Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
> >  
> >> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just 
> >> in
> >> case there's something hidden.  
> >
> > Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> > (10G Mellanox Connect-X cards via 10G SPF+ DAC)
> >
> >  stat:            1 (          1) <= outbound_pci_stalled_wr_events /sec
> >  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
> >
> > I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking
> > in the kernel driver code it is related to "rx_buffer_almost_full".
> > The numbers per second is excessive (but it be related to a driver bug
> > as it ends up reading "high" -> rx_buffer_almost_full_high in the
> > extended counters).

Notice this indication is a strong red-flag that something is wrong.

> >  stat:     29583661 ( 29,583,661) <= rx_bytes /sec
> >  stat:     30343677 ( 30,343,677) <= rx_bytes_phy /sec
> >
> > You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
> > difference between what the OS sees (rx_bytes) and what the NIC
> > hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).
> >
> >  stat:        19552 (     19,552) <= rx_packets /sec
> >  stat:        19950 (     19,950) <= rx_packets_phy /sec  
> 
> Could these packets be from VLAN interfaces that are not used in the OS?
> 
> >
> > Above RX packet counters also indicated HW is seeing more packets that
> > OS is receiving.
> >
> > Next counters is likely your problem:
> >
> >  stat:          718 (        718) <= tx_global_pause /sec
> >  stat:       954035 (    954,035) <= tx_global_pause_duration /sec
> >  stat:          714 (        714) <= tx_pause_ctrl_phy /sec  
> 
> As far as I can see that's only the TX, and we are only doing RX on this 
> interface - so maybe that's irrelevant?
> 
> >
> > It looks like you have enabled Ethernet Flow-Control, and something is
> > causing pause frames to be generated.  It seem strange that this 
> > happen on a 10Gbit/s link with only 236 Mbit/s.
> >
> > The TX byte counters are also very strange:
> >
> >  stat:        26063 (     26,063) <= tx_bytes /sec
> >  stat:        71950 (     71,950) <= tx_bytes_phy /sec  
> 
> Also, it's TX, and we are only doing RX, as I said already somewhere, 
> it's async routing, so the TX data comes via another router back.

Okay, but as this is a router you also need to transmit this
(asymmetric) traffic out another interface right.

Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

 ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.


> > [1] 
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >
> > Strange size distribution:
> >  stat:     19922 (     19,922) <= rx_1519_to_2047_bytes_phy /sec
> >  stat:        14 (         14) <= rx_65_to_127_bytes_phy /sec  
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 11:40                                         ` Jesper Dangaard Brouer
@ 2020-11-09 11:51                                           ` Toke Høiland-Jørgensen
  2020-11-09 12:25                                           ` Thomas Rosenstein
                                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-09 11:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Thomas Rosenstein; +Cc: Thomas Rosenstein via Bloat

>> Also, it's TX, and we are only doing RX, as I said already somewhere, 
>> it's async routing, so the TX data comes via another router back.
>
> Okay, but as this is a router you also need to transmit this
> (asymmetric) traffic out another interface right.
>
> Could you also provide ethtool_stats for the TX interface?
>
> Notice that the tool[1] ethtool_stats.pl support monitoring several
> interfaces at the same time, e.g. run:
>
>  ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX
>
> And provide output as pastebin.

Also, from your drawing, everything goes through the same switch, right?
And since pause frames are involved... Maybe it's the switch being
overwhelmed?

In general, I find that pause frames give more problems than they solve,
so I always turn them off...

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 11:40                                         ` Jesper Dangaard Brouer
  2020-11-09 11:51                                           ` Toke Høiland-Jørgensen
@ 2020-11-09 12:25                                           ` Thomas Rosenstein
  2020-11-09 14:33                                           ` Thomas Rosenstein
  2020-11-09 16:39                                           ` Thomas Rosenstein
  3 siblings, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-09 12:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Thomas Rosenstein via Bloat



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 11:09:33 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>
>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>>>
>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
>>>> just
>>>> in
>>>> case there's something hidden.
>>>
>>> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>
>>>  stat:            1 (          1) <= outbound_pci_stalled_wr_events 
>>> /sec
>>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
>>>
>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
>>> looking
>>> in the kernel driver code it is related to "rx_buffer_almost_full".
>>> The numbers per second is excessive (but it be related to a driver 
>>> bug
>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>> extended counters).
>
> Notice this indication is a strong red-flag that something is wrong.
>
>
> Okay, but as this is a router you also need to transmit this
> (asymmetric) traffic out another interface right.

The asymmetric traffic comes back on another router, this is router-02, 
traffic from internet comes back on router-01,
I also added the interfaces names.

See the updated diagram: 
https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing

>
> Could you also provide ethtool_stats for the TX interface?
>
> Notice that the tool[1] ethtool_stats.pl support monitoring several
> interfaces at the same time, e.g. run:
>
>  ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX
>
> And provide output as pastebin.

I have disabled pause control, like Toke said via:

ethtool -A eth4 autoneg off rx off tx off
ethtool -A eth5 autoneg off rx off tx off

Afterwards an ethtool output, first "without" traffic for a few seconds, 
then with the problematic flow.

Since the output is > 512KB I had to upload it on gdrive:

https://drive.google.com/file/d/1EVKt1LseaBuD40QE-SqFvqYSeWUEcGA_/view?usp=sharing

>
>
>>> [1]
>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>
>>> Strange size distribution:
>>>  stat:     19922 (     19,922) <= rx_1519_to_2047_bytes_phy /sec
>>>  stat:        14 (         14) <= rx_65_to_127_bytes_phy /sec

I assume it's because of the VLAN Tagging, and therefore 1522 bytes per 
packet with mtu of 1500?


>>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 11:40                                         ` Jesper Dangaard Brouer
  2020-11-09 11:51                                           ` Toke Høiland-Jørgensen
  2020-11-09 12:25                                           ` Thomas Rosenstein
@ 2020-11-09 14:33                                           ` Thomas Rosenstein
  2020-11-12 10:05                                             ` Jesper Dangaard Brouer
  2020-11-09 16:39                                           ` Thomas Rosenstein
  3 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-09 14:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Thomas Rosenstein via Bloat



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 11:09:33 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>
>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>>>
>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
>>>> just
>>>> in
>>>> case there's something hidden.
>>>
>>> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>
>>>  stat:            1 (          1) <= outbound_pci_stalled_wr_events 
>>> /sec
>>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
>>>
>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
>>> looking
>>> in the kernel driver code it is related to "rx_buffer_almost_full".
>>> The numbers per second is excessive (but it be related to a driver 
>>> bug
>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>> extended counters).

I have now tested with a new kernel 5.9.4 build made from 3.10 with make 
oldconfig and I noticed an interesting effect.

The first ca. 2 minutes the router behaves completely normal as with 
3.10, after that the ping times go crazy.

I have recorded this with ethtool, and also the ping times.

Ethtool: (13 MB)
https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing

The transfer first was doing around 50 - 70 MB/s then once the ping 
times go worse it dropped to ~12 MB/s.
ca. Line 74324 the transfer speed drops to 12 MB/s

Seems you are right about the rx_buffer_passed_thres_phy if you check 
just those lines they appear more often once the speed dropped.
Not sure if that's the cause or an effect of the underlying problem!

Pings:
https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing

Pause frames were activated again after the restart.

(Here a link for rerefence for the ethtool variables: 
https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 11:40                                         ` Jesper Dangaard Brouer
                                                             ` (2 preceding siblings ...)
  2020-11-09 14:33                                           ` Thomas Rosenstein
@ 2020-11-09 16:39                                           ` Thomas Rosenstein
  3 siblings, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-09 16:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Thomas Rosenstein via Bloat

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 11:09:33 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
> Could you also provide ethtool_stats for the TX interface?
>
> Notice that the tool[1] ethtool_stats.pl support monitoring several
> interfaces at the same time, e.g. run:
>
>  ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX
>
> And provide output as pastebin.

I have now also repeated the same test with 3.10, here are the ethtool 
outputs:

https://drive.google.com/file/d/1c98MVV0JYl6Su6xZTpqwS7m-6OlbmAFp/view?usp=sharing

and the ping times:

https://drive.google.com/file/d/1xhbGJHb5jUbPsee4frbx-c-uqh-7orXY/view?usp=sharing

Sadly the parameters we were looking at are not supported below 4.14.

but I immediatly saw 1 thing very different:

ethtool --statistics eth4 | grep discards
      rx_discards_phy: 0
      tx_discards_phy: 0

if we check the ethtool output from 5.9.4 were have:

      rx_discards_phy: 151793

And also the outbound_pci_stalled_wr_events get more frequent the lower 
the total bandwidth / the higher the ping is.
Logically there must be something blocking the the buffers, either they 
are not getting freed, or not rotated correctly, or processing is too 
slow.
I would exclude the processing, simply based on 0% CPU load, and also 
that it doesn't happen in 3.10.
Suspicious is also, that the issue only appears after a certain time of 
activity (maybe total traffic?!)



[-- Attachment #2: Type: text/html, Size: 2700 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-09 14:33                                           ` Thomas Rosenstein
@ 2020-11-12 10:05                                             ` Jesper Dangaard Brouer
  2020-11-12 11:26                                               ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-12 10:05 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bufferbloat, brouer, Saeed Mahameed, Tariq Toukan

On Mon, 09 Nov 2020 15:33:46 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:
> 
> > On Mon, 09 Nov 2020 11:09:33 +0100
> > "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
> >  
> >> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
> >>  
> >>> On Sat, 07 Nov 2020 14:00:04 +0100
> >>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
> >>>  
> >>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
> >>>> just
> >>>> in
> >>>> case there's something hidden.  
> >>>
> >>> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> >>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
> >>>
> >>>  stat:            1 (          1) <= outbound_pci_stalled_wr_events /sec
> >>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
> >>>
> >>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
> >>> looking
> >>> in the kernel driver code it is related to "rx_buffer_almost_full".
> >>> The numbers per second is excessive (but it be related to a driver 
> >>> bug
> >>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
> >>> extended counters).  
> 
> I have now tested with a new kernel 5.9.4 build made from 3.10 with make 
> oldconfig and I noticed an interesting effect.
> 
> The first ca. 2 minutes the router behaves completely normal as with 
> 3.10, after that the ping times go crazy.
> 
> I have recorded this with ethtool, and also the ping times.
> 
> Ethtool: (13 MB)
> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing
> 
> The transfer first was doing around 50 - 70 MB/s then once the ping 
> times go worse it dropped to ~12 MB/s.
> ca. Line 74324 the transfer speed drops to 12 MB/s

Lets add them here so we can see and talk about them:

 74324 Ethtool(eth4 ) stat:    144584002 (    144,584,002)  <= rx_buffer_passed_thres_phy /sec 
 74325 Ethtool(eth4 ) stat:     17243186 (     17,243,186) <= rx_bytes /sec 
 74326 Ethtool(eth4 ) stat:     19032079 (     19,032,079) <= rx_bytes_phy /sec 

Significant difference between RX-bytes Phy-level and OS-level.

17243186*8/10^6 = 138 Mbit/s

 74327 Ethtool(eth4 ) stat:          321 (            321) <= rx_cache_busy /sec 
 74328 Ethtool(eth4 ) stat:          321 (            321) <= rx_cache_full /sec 
 74329 Ethtool(eth4 ) stat:         7323 (          7,323) <= rx_cache_reuse /sec 
 74330 Ethtool(eth4 ) stat:        15188 (         15,188) <= rx_csum_complete /sec 
 74331 Ethtool(eth4 ) stat:            1 (              1) <= rx_csum_none /sec 
 74332 Ethtool(eth4 ) stat:           96 (             96) <= rx_csum_unnecessary /sec 
 74333 Ethtool(eth4 ) stat:         1242 (          1,242) <= rx_discards_phy /sec 

Packets are getting rx_discards

 74334 Ethtool(eth4 ) stat:        15285 (         15,285) <= rx_packets /sec 
 74335 Ethtool(eth4 ) stat:        16655 (         16,655) <= rx_packets_phy /sec 

(16655-15285 = 1370)
Again difference between RX-packets Phy-level and OS-level

 74336 Ethtool(eth4 ) stat:     19132497 (     19,132,497) <= rx_prio0_bytes /sec 
 74337 Ethtool(eth4 ) stat:         1264 (          1,264) <= rx_prio0_discards /sec 
 74338 Ethtool(eth4 ) stat:        15474 (         15,474) <= rx_prio0_packets /sec 



> Seems you are right about the rx_buffer_passed_thres_phy if you check 
> just those lines they appear more often once the speed dropped.
> Not sure if that's the cause or an effect of the underlying problem!

Thanks for the link to Mellanox doc:
 https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters

The counter: 'rx_buffer_passed_thres_phy'
 Description: 
  'The number of events where the port receive buffer was over 85% full.
   Supported from kernel 4.14'

The counter: 'tx_pause_ctrl_phy'
 Description:
 'The number of link layer pause packets transmitted on a physical
  port. If this counter is increasing, it implies that the NIC is
  congested and cannot absorb the traffic coming from the network.'


So, the NIC is clearly telling us that something is wrong. You are not
near link speed (12MB/s => 96Mbit/s).   And still port receive buffer
is over 85% full most of the time.  There are indications that, the NIC
is congested and cannot absorb the traffic coming from the network.

Long-shot theory(1):  Could this be caused by CPUs going into too deep
sleep states? And NIC not sending wake-up signal early enough, e.g.
because it holds on to packets for generating a bulk at IRQ raise.

Long-shot theory(2): The NIC driver IRQ interrupt coalesce
adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
because of TSO frames, due packet-to-byte ratio isn't taken into
account.

You can test theory(2) in two ways, via disabling TSO and via adjusting
ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
change part of theory(1), the NIC wake-up/IRQ interval.  For CPU sleep
modes see software package 'tuned'.


> Pings:
> https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing
> 
> Pause frames were activated again after the restart.
> 
> (Here a link for rerefence for the ethtool variables: 
> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)

To easier digest, here are some of the interesting counters.

  74171 Ethtool(eth4    ) stat:           24 (             24) <= outbound_pci_stalled_wr /sec 
  74172 Ethtool(eth4    ) stat:            1 (              1) <= outbound_pci_stalled_wr_events /sec 
  [...]
  74628 Ethtool(eth5    ) stat:           40 (             40) <= outbound_pci_stalled_wr /sec 
  74629 Ethtool(eth5    ) stat:            1 (              1) <= outbound_pci_stalled_wr_events /sec 
 

  74539 Ethtool(eth4    ) stat:          432 (            432) <= tx_global_pause /sec 
  74540 Ethtool(eth4    ) stat:       551264 (        551,264) <= tx_global_pause_duration /sec 
  74541 Ethtool(eth4    ) stat:          434 (            434) <= tx_mac_control_phy /sec 
  74542 Ethtool(eth4    ) stat:           31 (             31) <= tx_nop /sec 
  74543 Ethtool(eth4    ) stat:        24888 (         24,888) <= tx_packets /sec 
  74544 Ethtool(eth4    ) stat:        25187 (         25,187) <= tx_packets_phy /sec 
  74545 Ethtool(eth4    ) stat:          434 (            434) <= tx_pause_ctrl_phy /sec 
  74546 Ethtool(eth4    ) stat:     15190865 (     15,190,865) <= tx_prio0_bytes /sec 
  74547 Ethtool(eth4    ) stat:        24907 (         24,907) <= tx_prio0_packets /sec 
  74548 Ethtool(eth4    ) stat:       647795 (        647,795) <= tx_tso_bytes /sec 
  74549 Ethtool(eth4    ) stat:          196 (            196) <= tx_tso_packets /sec 
  74550 Ethtool(eth4    ) stat:     15005949 (     15,005,949) <= tx_vport_unicast_bytes /sec 
  74551 Ethtool(eth4    ) stat:        24780 (         24,780) <= tx_vport_unicast_packets /sec 
  74552 Ethtool(eth4    ) stat:           57 (             57) <= tx_xmit_more /sec 
  [...]
  75030 Ethtool(eth5    ) stat:           39 (             39) <= tx_global_pause /sec 
  75031 Ethtool(eth5    ) stat:       242374 (        242,374) <= tx_global_pause_duration /sec 
  75032 Ethtool(eth5    ) stat:           39 (             39) <= tx_mac_control_phy /sec 
  75033 Ethtool(eth5    ) stat:          166 (            166) <= tx_nop /sec 
  75034 Ethtool(eth5    ) stat:        14837 (         14,837) <= tx_packets /sec 
  75035 Ethtool(eth5    ) stat:        14816 (         14,816) <= tx_packets_phy /sec 
  75036 Ethtool(eth5    ) stat:           39 (             39) <= tx_pause_ctrl_phy /sec 
  75037 Ethtool(eth5    ) stat:     16748584 (     16,748,584) <= tx_prio0_bytes /sec 
  75038 Ethtool(eth5    ) stat:        14708 (         14,708) <= tx_prio0_packets /sec 
  75039 Ethtool(eth5    ) stat:      2477442 (      2,477,442) <= tx_tso_bytes /sec 
  75040 Ethtool(eth5    ) stat:          733 (            733) <= tx_tso_packets /sec 
  75041 Ethtool(eth5    ) stat:     16765330 (     16,765,330) <= tx_vport_unicast_bytes /sec 
  75042 Ethtool(eth5    ) stat:        14780 (         14,780) <= tx_vport_unicast_packets /sec 
  75043 Ethtool(eth5    ) stat:            2 (              2) <= tx_xmit_more /sec 


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-12 10:05                                             ` Jesper Dangaard Brouer
@ 2020-11-12 11:26                                               ` Thomas Rosenstein
  2020-11-12 13:31                                                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-12 11:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan



On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 15:33:46 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:
>>
>>> On Mon, 09 Nov 2020 11:09:33 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>>>
>>>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>>>
>>>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>>>> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>>>>>
>>>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz
>>>>>> just
>>>>>> in
>>>>>> case there's something hidden.
>>>>>
>>>>> Yes, there is something hiding in the data from 
>>>>> ethtool_stats.pl[1]:
>>>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>>>
>>>>>  stat:            1 (          1) <= 
>>>>> outbound_pci_stalled_wr_events /sec
>>>>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy 
>>>>> /sec
>>>>>
>>>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before,
>>>>> looking
>>>>> in the kernel driver code it is related to 
>>>>> "rx_buffer_almost_full".
>>>>> The numbers per second is excessive (but it be related to a driver
>>>>> bug
>>>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>>>> extended counters).
>>
>> I have now tested with a new kernel 5.9.4 build made from 3.10 with 
>> make
>> oldconfig and I noticed an interesting effect.
>>
>> The first ca. 2 minutes the router behaves completely normal as with
>> 3.10, after that the ping times go crazy.
>>
>> I have recorded this with ethtool, and also the ping times.
>>
>> Ethtool: (13 MB)
>> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing
>>
>> The transfer first was doing around 50 - 70 MB/s then once the ping
>> times go worse it dropped to ~12 MB/s.
>> ca. Line 74324 the transfer speed drops to 12 MB/s
>
> Lets add them here so we can see and talk about them:
>
>  74324 Ethtool(eth4 ) stat:    144584002 (    144,584,002)  <= 
> rx_buffer_passed_thres_phy /sec
>  74325 Ethtool(eth4 ) stat:     17243186 (     17,243,186) <= rx_bytes 
> /sec
>  74326 Ethtool(eth4 ) stat:     19032079 (     19,032,079) <= 
> rx_bytes_phy /sec
>
> Significant difference between RX-bytes Phy-level and OS-level.
>
> 17243186*8/10^6 = 138 Mbit/s
>
>  74327 Ethtool(eth4 ) stat:          321 (            321) <= 
> rx_cache_busy /sec
>  74328 Ethtool(eth4 ) stat:          321 (            321) <= 
> rx_cache_full /sec
>  74329 Ethtool(eth4 ) stat:         7323 (          7,323) <= 
> rx_cache_reuse /sec
>  74330 Ethtool(eth4 ) stat:        15188 (         15,188) <= 
> rx_csum_complete /sec
>  74331 Ethtool(eth4 ) stat:            1 (              1) <= 
> rx_csum_none /sec
>  74332 Ethtool(eth4 ) stat:           96 (             96) <= 
> rx_csum_unnecessary /sec
>  74333 Ethtool(eth4 ) stat:         1242 (          1,242) <= 
> rx_discards_phy /sec
>
> Packets are getting rx_discards
>
>  74334 Ethtool(eth4 ) stat:        15285 (         15,285) <= 
> rx_packets /sec
>  74335 Ethtool(eth4 ) stat:        16655 (         16,655) <= 
> rx_packets_phy /sec
>
> (16655-15285 = 1370)
> Again difference between RX-packets Phy-level and OS-level
>
>  74336 Ethtool(eth4 ) stat:     19132497 (     19,132,497) <= 
> rx_prio0_bytes /sec
>  74337 Ethtool(eth4 ) stat:         1264 (          1,264) <= 
> rx_prio0_discards /sec
>  74338 Ethtool(eth4 ) stat:        15474 (         15,474) <= 
> rx_prio0_packets /sec

I have also updated now OFED on the routers and checked with 3.10 and 
5.4.60  (new firmware for the adapters)
That did change absolutly nothing!

I noticed still that only under kernel 5.4.60 the discard happen, on 
3.10 they don't.

Based on that I would blame the kernel instead of hardware.
Do you agree on this?

I wouldn't exclude any settings to be problematic for now, maybe some 
defaults changed.


>
>
>
>> Seems you are right about the rx_buffer_passed_thres_phy if you check
>> just those lines they appear more often once the speed dropped.
>> Not sure if that's the cause or an effect of the underlying problem!
>
> Thanks for the link to Mellanox doc:
>  https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters
>
> The counter: 'rx_buffer_passed_thres_phy'
>  Description:
>   'The number of events where the port receive buffer was over 85% 
> full.
>    Supported from kernel 4.14'
>
> The counter: 'tx_pause_ctrl_phy'
>  Description:
>  'The number of link layer pause packets transmitted on a physical
>   port. If this counter is increasing, it implies that the NIC is
>   congested and cannot absorb the traffic coming from the network.'
>
>
> So, the NIC is clearly telling us that something is wrong. You are not
> near link speed (12MB/s => 96Mbit/s).   And still port receive buffer
> is over 85% full most of the time.  There are indications that, the 
> NIC
> is congested and cannot absorb the traffic coming from the network.

I have opened a support case with Mellanox maybe they have an idea 
directly.

Their first response based on the outbound_pci_stalled_wr was that the 
PCIexpress credits for the card run out.
Which means (if I interpret it correctly) that the buffer is not emptied 
quickly enough anymore.

>
> Long-shot theory(1):  Could this be caused by CPUs going into too deep
> sleep states? And NIC not sending wake-up signal early enough, e.g.
> because it holds on to packets for generating a bulk at IRQ raise.
>
> Long-shot theory(2): The NIC driver IRQ interrupt coalesce
> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
> because of TSO frames, due packet-to-byte ratio isn't taken into
> account.

here are the coalese parameters: (are the same for 3.10 and 5.4.60)

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: on  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 32
rx-frames: 64
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

>
> You can test theory(2) in two ways, via disabling TSO and via 
> adjusting
> ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
> change part of theory(1), the NIC wake-up/IRQ interval.  For CPU sleep
> modes see software package 'tuned'.

CPU mode is set to performance, I thought that would stop any deep sleep 
states.

>
>
>> Pings:
>> https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing
>>
>> Pause frames were activated again after the restart.
>>
>> (Here a link for rerefence for the ethtool variables:
>> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)
>
> To easier digest, here are some of the interesting counters.
>
>   74171 Ethtool(eth4    ) stat:           24 (             24) <= 
> outbound_pci_stalled_wr /sec
>   74172 Ethtool(eth4    ) stat:            1 (              1) <= 
> outbound_pci_stalled_wr_events /sec
>   [...]
>   74628 Ethtool(eth5    ) stat:           40 (             40) <= 
> outbound_pci_stalled_wr /sec
>   74629 Ethtool(eth5    ) stat:            1 (              1) <= 
> outbound_pci_stalled_wr_events /sec
>
>
>   74539 Ethtool(eth4    ) stat:          432 (            432) <= 
> tx_global_pause /sec
>   74540 Ethtool(eth4    ) stat:       551264 (        551,264) <= 
> tx_global_pause_duration /sec
>   74541 Ethtool(eth4    ) stat:          434 (            434) <= 
> tx_mac_control_phy /sec
>   74542 Ethtool(eth4    ) stat:           31 (             31) <= 
> tx_nop /sec
>   74543 Ethtool(eth4    ) stat:        24888 (         24,888) <= 
> tx_packets /sec
>   74544 Ethtool(eth4    ) stat:        25187 (         25,187) <= 
> tx_packets_phy /sec
>   74545 Ethtool(eth4    ) stat:          434 (            434) <= 
> tx_pause_ctrl_phy /sec
>   74546 Ethtool(eth4    ) stat:     15190865 (     15,190,865) <= 
> tx_prio0_bytes /sec
>   74547 Ethtool(eth4    ) stat:        24907 (         24,907) <= 
> tx_prio0_packets /sec
>   74548 Ethtool(eth4    ) stat:       647795 (        647,795) <= 
> tx_tso_bytes /sec
>   74549 Ethtool(eth4    ) stat:          196 (            196) <= 
> tx_tso_packets /sec
>   74550 Ethtool(eth4    ) stat:     15005949 (     15,005,949) <= 
> tx_vport_unicast_bytes /sec
>   74551 Ethtool(eth4    ) stat:        24780 (         24,780) <= 
> tx_vport_unicast_packets /sec
>   74552 Ethtool(eth4    ) stat:           57 (             57) <= 
> tx_xmit_more /sec
>   [...]
>   75030 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_global_pause /sec
>   75031 Ethtool(eth5    ) stat:       242374 (        242,374) <= 
> tx_global_pause_duration /sec
>   75032 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_mac_control_phy /sec
>   75033 Ethtool(eth5    ) stat:          166 (            166) <= 
> tx_nop /sec
>   75034 Ethtool(eth5    ) stat:        14837 (         14,837) <= 
> tx_packets /sec
>   75035 Ethtool(eth5    ) stat:        14816 (         14,816) <= 
> tx_packets_phy /sec
>   75036 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_pause_ctrl_phy /sec
>   75037 Ethtool(eth5    ) stat:     16748584 (     16,748,584) <= 
> tx_prio0_bytes /sec
>   75038 Ethtool(eth5    ) stat:        14708 (         14,708) <= 
> tx_prio0_packets /sec
>   75039 Ethtool(eth5    ) stat:      2477442 (      2,477,442) <= 
> tx_tso_bytes /sec
>   75040 Ethtool(eth5    ) stat:          733 (            733) <= 
> tx_tso_packets /sec
>   75041 Ethtool(eth5    ) stat:     16765330 (     16,765,330) <= 
> tx_vport_unicast_bytes /sec
>   75042 Ethtool(eth5    ) stat:        14780 (         14,780) <= 
> tx_vport_unicast_packets /sec
>   75043 Ethtool(eth5    ) stat:            2 (              2) <= 
> tx_xmit_more /sec
>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Is it fine to always reply all here, or should I just keep the list?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-12 11:26                                               ` Thomas Rosenstein
@ 2020-11-12 13:31                                                 ` Jesper Dangaard Brouer
  2020-11-12 13:42                                                   ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-12 13:31 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan, brouer

On Thu, 12 Nov 2020 12:26:20 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> > Long-shot theory(2): The NIC driver IRQ interrupt coalesce
> > adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
> > because of TSO frames, due packet-to-byte ratio isn't taken into
> > account.  
> 
> here are the coalese parameters: (are the same for 3.10 and 5.4.60)
> 
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: on  TX: on

Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
adaptive algorithm in the driver code can guess wrong (due to not
taking TSO into account) and cause issues for 

Try to turn this adaptive algorithm off:

  ethtool -C eth4 adaptive-rx off adaptive-tx off

> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> 
> rx-usecs: 32

When you run off "adaptive-rx" you will get 31250 interrupts/sec
(calc 1/(32/10^6) = 31250).

> rx-frames: 64

You likely want to reduce this.

> rx-usecs-irq: 0
> rx-frames-irq: 0
> 
> tx-usecs: 8
> tx-frames: 128

These 128 frames also seems too large.

> tx-usecs-irq: 0
> tx-frames-irq: 0
> 
[...]
> >
> > You can test theory(2) in two ways, via disabling TSO and via 
> > adjusting
> > ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
> > change part of theory(1), the NIC wake-up/IRQ interval.  

What is the NIC ring size settings?

 ethtool -g eth4

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-12 13:31                                                 ` Jesper Dangaard Brouer
@ 2020-11-12 13:42                                                   ` Thomas Rosenstein
  2020-11-12 15:42                                                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-12 13:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan



On 12 Nov 2020, at 14:31, Jesper Dangaard Brouer wrote:

> On Thu, 12 Nov 2020 12:26:20 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>>> Long-shot theory(2): The NIC driver IRQ interrupt coalesce
>>> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
>>> because of TSO frames, due packet-to-byte ratio isn't taken into
>>> account.
>>
>> here are the coalese parameters: (are the same for 3.10 and 5.4.60)
>>
>> ethtool -c eth4
>> Coalesce parameters for eth4:
>> Adaptive RX: on  TX: on
>
> Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
> adaptive algorithm in the driver code can guess wrong (due to not
> taking TSO into account) and cause issues for
>
> Try to turn this adaptive algorithm off:
>
>   ethtool -C eth4 adaptive-rx off adaptive-tx off
>
>> stats-block-usecs: 0
>> sample-interval: 0
>> pkt-rate-low: 0
>> pkt-rate-high: 0
>>
>> rx-usecs: 32
>
> When you run off "adaptive-rx" you will get 31250 interrupts/sec
> (calc 1/(32/10^6) = 31250).
>
>> rx-frames: 64
>
> You likely want to reduce this.
>
>> rx-usecs-irq: 0
>> rx-frames-irq: 0
>>
>> tx-usecs: 8
>> tx-frames: 128
>
> These 128 frames also seems too large.
>
>> tx-usecs-irq: 0
>> tx-frames-irq: 0
>>
> [...]

I have now updated the settings to:

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 0
rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

but it makes absolutely no difference, also discard keep coming as before.

>>>
>>> You can test theory(2) in two ways, via disabling TSO and via
>>> adjusting
>>> ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
>>> change part of theory(1), the NIC wake-up/IRQ interval.
here's the current nic settings:

ethtool -k eth4
Features for eth4:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
	tx-tcp-segmentation: off
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: off
	tx-tcp6-segmentation: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

>
> What is the NIC ring size settings?
>
>  ethtool -g eth4
>

ethtool -g eth4
Ring parameters for eth4:
Pre-set maximums:
RX:		8192
RX Mini:	0
RX Jumbo:	0
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	0
RX Jumbo:	0
TX:		1024

> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-12 13:42                                                   ` Thomas Rosenstein
@ 2020-11-12 15:42                                                     ` Jesper Dangaard Brouer
  2020-11-13  6:31                                                       ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-12 15:42 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan, brouer

On Thu, 12 Nov 2020 14:42:59 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> > Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
> > adaptive algorithm in the driver code can guess wrong (due to not
> > taking TSO into account) and cause issues for
> >
> > Try to turn this adaptive algorithm off:
> >
> >   ethtool -C eth4 adaptive-rx off adaptive-tx off
> >  
[...]
> >>
> >> rx-usecs: 32  
> >
> > When you run off "adaptive-rx" you will get 31250 interrupts/sec
> > (calc 1/(32/10^6) = 31250).
> >  
> >> rx-frames: 64  
[...]
> >> tx-usecs-irq: 0
> >> tx-frames-irq: 0
> >>  
> > [...]  
> 
> I have now updated the settings to:
> 
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: off  TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> 
> rx-usecs: 0

Please put a value in rx-usecs, like 20 or 10.
The value 0 is often used to signal driver to do adaptive.

> rx-frames: 32
> rx-usecs-irq: 0
> rx-frames-irq: 0



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-12 15:42                                                     ` Jesper Dangaard Brouer
@ 2020-11-13  6:31                                                       ` Thomas Rosenstein
  2020-11-16 11:56                                                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-13  6:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan



On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:

> On Thu, 12 Nov 2020 14:42:59 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that 
>>> this
>>> adaptive algorithm in the driver code can guess wrong (due to not
>>> taking TSO into account) and cause issues for
>>>
>>> Try to turn this adaptive algorithm off:
>>>
>>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
>>>
> [...]
>>>>
>>>> rx-usecs: 32
>>>
>>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
>>> (calc 1/(32/10^6) = 31250).
>>>
>>>> rx-frames: 64
> [...]
>>>> tx-usecs-irq: 0
>>>> tx-frames-irq: 0
>>>>
>>> [...]
>>
>> I have now updated the settings to:
>>
>> ethtool -c eth4
>> Coalesce parameters for eth4:
>> Adaptive RX: off  TX: off
>> stats-block-usecs: 0
>> sample-interval: 0
>> pkt-rate-low: 0
>> pkt-rate-high: 0
>>
>> rx-usecs: 0
>
> Please put a value in rx-usecs, like 20 or 10.
> The value 0 is often used to signal driver to do adaptive.

Ok, put it now to 10.

Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls 
are still there.

Ping times are noticable improved:

64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms

>
>> rx-frames: 32
>> rx-usecs-irq: 0
>> rx-frames-irq: 0
>
>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-13  6:31                                                       ` Thomas Rosenstein
@ 2020-11-16 11:56                                                         ` Jesper Dangaard Brouer
  2020-11-16 12:05                                                           ` Thomas Rosenstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-16 11:56 UTC (permalink / raw)
  To: Thomas Rosenstein; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan, brouer

On Fri, 13 Nov 2020 07:31:26 +0100
"Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:

> On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 12 Nov 2020 14:42:59 +0100
> > "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
> >  
> >>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that 
> >>> this
> >>> adaptive algorithm in the driver code can guess wrong (due to not
> >>> taking TSO into account) and cause issues for
> >>>
> >>> Try to turn this adaptive algorithm off:
> >>>
> >>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
> >>>  
> > [...]  
> >>>>
> >>>> rx-usecs: 32  
> >>>
> >>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
> >>> (calc 1/(32/10^6) = 31250).
> >>>  
> >>>> rx-frames: 64  
> > [...]  
> >>>> tx-usecs-irq: 0
> >>>> tx-frames-irq: 0
> >>>>  
> >>> [...]  
> >>
> >> I have now updated the settings to:
> >>
> >> ethtool -c eth4
> >> Coalesce parameters for eth4:
> >> Adaptive RX: off  TX: off
> >> stats-block-usecs: 0
> >> sample-interval: 0
> >> pkt-rate-low: 0
> >> pkt-rate-high: 0
> >>
> >> rx-usecs: 0  
> >
> > Please put a value in rx-usecs, like 20 or 10.
> > The value 0 is often used to signal driver to do adaptive.  
> 
> Ok, put it now to 10.

Setting it to 10 is a little aggressive, as you ask it to generate
100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)

 1/(10/10^6) = 100000 interrupts/sec

> Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls 
> are still there.

Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)

If you still have ethtool PHY-discards, then you still have a problem.

> Ping times are noticable improved:

Okay so this means these changes did have a positive effect.  So, this
can be related to OS is not getting activated fast-enough by NIC
interrupts.

 
> 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
> 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
> 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
> 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
> 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms

Can you try to test if disabling TSO, GRO and GSO makes a difference?

 ethtool -K eth4 gso off gro off tso off


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-16 11:56                                                         ` Jesper Dangaard Brouer
@ 2020-11-16 12:05                                                           ` Thomas Rosenstein
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-16 12:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bufferbloat, Saeed Mahameed, Tariq Toukan



On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote:

> On Fri, 13 Nov 2020 07:31:26 +0100
> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>
>> On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:
>>
>>> On Thu, 12 Nov 2020 14:42:59 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein@creamfinance.com> wrote:
>>>
>>>>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that
>>>>> this
>>>>> adaptive algorithm in the driver code can guess wrong (due to not
>>>>> taking TSO into account) and cause issues for
>>>>>
>>>>> Try to turn this adaptive algorithm off:
>>>>>
>>>>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
>>>>>
>>> [...]
>>>>>>
>>>>>> rx-usecs: 32
>>>>>
>>>>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
>>>>> (calc 1/(32/10^6) = 31250).
>>>>>
>>>>>> rx-frames: 64
>>> [...]
>>>>>> tx-usecs-irq: 0
>>>>>> tx-frames-irq: 0
>>>>>>
>>>>> [...]
>>>>
>>>> I have now updated the settings to:
>>>>
>>>> ethtool -c eth4
>>>> Coalesce parameters for eth4:
>>>> Adaptive RX: off  TX: off
>>>> stats-block-usecs: 0
>>>> sample-interval: 0
>>>> pkt-rate-low: 0
>>>> pkt-rate-high: 0
>>>>
>>>> rx-usecs: 0
>>>
>>> Please put a value in rx-usecs, like 20 or 10.
>>> The value 0 is often used to signal driver to do adaptive.
>>
>> Ok, put it now to 10.
>
> Setting it to 10 is a little aggressive, as you ask it to generate
> 100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)
>
>  1/(10/10^6) = 100000 interrupts/sec
>
>> Goes a bit quicker (transfer up to 26 MB/s), but discards and pci 
>> stalls
>> are still there.
>
> Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)

yep 208 MBits

>
> If you still have ethtool PHY-discards, then you still have a problem.
>
>> Ping times are noticable improved:
>
> Okay so this means these changes did have a positive effect.  So, this
> can be related to OS is not getting activated fast-enough by NIC
> interrupts.
>
>
>> 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
>> 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
>> 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
>> 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
>> 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
>> 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
>> 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
>> 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
>> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
>> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
>> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
>> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms
>
> Can you try to test if disabling TSO, GRO and GSO makes a difference?
>
>  ethtool -K eth4 gso off gro off tso off
>

I had a call yesterday with Mellanox and we added the following boot 
options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll

This completely solved the problem, but now we run with a heater and 
energy consumer, nearly 2x Watts on the outlet.

I had no discards, super pings during transfer(< 0.100 ms), no outliers, 
and good transfer rates > 50 MB/s


So it seems to be related to C-State management in newer kernel version 
being too agressive.
I would like to try to tune here a bit, maybe we can get some input 
which knobs to turn?

I will read here: 
https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation
and related docs, I think there will be a few helpful hints.

>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-04 15:23 [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60 Thomas Rosenstein
  2020-11-04 16:10 ` Toke Høiland-Jørgensen
@ 2020-11-16 12:34 ` Jesper Dangaard Brouer
  2020-11-16 12:49   ` Thomas Rosenstein
  1 sibling, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-16 12:34 UTC (permalink / raw)
  To: Bufferbloat

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:

[...] 
> I have multiple routers which connect to multiple upstream providers, I 
> have noticed a high latency shift in icmp (and generally all connection) 
> if I run b2 upload-file --threads 40 (and I can reproduce this)
> 
> What options do I have to analyze why this happens?
> 
> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz

I think I have spotted your problem... This CPU[1] Xeon X5670 is more
than 10 years old!  It basically corresponds to the machines I used for
my presentation at LinuxCon 2009 see slides[2].  Only with large frames
and with massive scaling across all CPUs was I able to get close to
10Gbit/s through these machines.  And on top I had to buy low-latency
RAM memory-blocks to make it happen.

As you can see on my slides[2], memory bandwidth and PCIe speeds was at
the limit for making it possible on the hardware level.  I had to run
DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
6.4GT/s (default 4.8GT/s).

This generation Motherboards had both PCIe gen-1 and gen-2 slots.  Only
the PCIe gen-2 slots had barely enough bandwidth.  Maybe you physically
placed NIC in PCIe gen-1 slot?

On top of this, you also have a NUMA system, 2x Xeon X5670, which can
result is A LOT of "funny" issue, that is really hard to troubleshoot...


[1] https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html

[2] https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
  2020-11-16 12:34 ` Jesper Dangaard Brouer
@ 2020-11-16 12:49   ` Thomas Rosenstein
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Rosenstein @ 2020-11-16 12:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Bufferbloat



On 16 Nov 2020, at 13:34, Jesper Dangaard Brouer wrote:

> On Wed, 04 Nov 2020 16:23:12 +0100
> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:
>
> [...]
>> I have multiple routers which connect to multiple upstream providers, 
>> I
>> have noticed a high latency shift in icmp (and generally all 
>> connection)
>> if I run b2 upload-file --threads 40 (and I can reproduce this)
>>
>> What options do I have to analyze why this happens?
>>
>> General Info:
>>
>> Routers are connected between each other with 10G Mellanox Connect-X
>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>> Latency generally is around 0.18 ms between all routers (4).
>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
>> iperf3.
>> 2 of the 4 routers are connected upstream with a 1G connection 
>> (separate
>> port, same network card)
>> All routers have the full internet routing tables, i.e. 80k entries 
>> for
>> IPv6 and 830k entries for IPv4
>> Conntrack is disabled (-j NOTRACK)
>> Kernel 5.4.60 (custom)
>> 2x Xeon X5670 @ 2.93 Ghz
>
> I think I have spotted your problem... This CPU[1] Xeon X5670 is more
> than 10 years old!  It basically corresponds to the machines I used 
> for
> my presentation at LinuxCon 2009 see slides[2].  Only with large 
> frames
> and with massive scaling across all CPUs was I able to get close to
> 10Gbit/s through these machines.  And on top I had to buy low-latency
> RAM memory-blocks to make it happen.
>
> As you can see on my slides[2], memory bandwidth and PCIe speeds was 
> at
> the limit for making it possible on the hardware level.  I had to run
> DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
> 6.4GT/s (default 4.8GT/s).
>
> This generation Motherboards had both PCIe gen-1 and gen-2 slots.  
> Only
> the PCIe gen-2 slots had barely enough bandwidth.  Maybe you 
> physically
> placed NIC in PCIe gen-1 slot?
>
> On top of this, you also have a NUMA system, 2x Xeon X5670, which can
> result is A LOT of "funny" issue, that is really hard to 
> troubleshoot...
>

Yes, I'm aware of the limits of what to expect, but as we agree 60 tcp 
streams with not even 200 Mbits shouldn't overload the PCIex bus or the 
cpus.

Also, don't forget, no issues with Kernel 3.10.

PCI slot is a Gen2, x8, so more than enough bandwidth there luckily ;)

But yes, they are quite old...

>
> [1] 
> https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html
>
> [2] 
> https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2020-11-16 12:49 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-04 15:23 [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60 Thomas Rosenstein
2020-11-04 16:10 ` Toke Høiland-Jørgensen
2020-11-04 16:24   ` Thomas Rosenstein
2020-11-05  0:10     ` Toke Høiland-Jørgensen
2020-11-05  8:48       ` Thomas Rosenstein
2020-11-05 11:21         ` Toke Høiland-Jørgensen
2020-11-05 12:22           ` Thomas Rosenstein
2020-11-05 12:38             ` Toke Høiland-Jørgensen
2020-11-05 12:41               ` Thomas Rosenstein
2020-11-05 12:47                 ` Toke Høiland-Jørgensen
2020-11-05 13:33             ` Jesper Dangaard Brouer
2020-11-06  8:48               ` Thomas Rosenstein
2020-11-06 10:53                 ` Jesper Dangaard Brouer
2020-11-06  9:18               ` Thomas Rosenstein
2020-11-06 11:18                 ` Jesper Dangaard Brouer
2020-11-06 11:37                   ` Thomas Rosenstein
2020-11-06 11:45                     ` Toke Høiland-Jørgensen
2020-11-06 12:01                       ` Thomas Rosenstein
2020-11-06 12:53                       ` Jesper Dangaard Brouer
2020-11-06 14:13                         ` Jesper Dangaard Brouer
2020-11-06 17:04                           ` Thomas Rosenstein
2020-11-06 20:19                             ` Jesper Dangaard Brouer
2020-11-07 12:37                               ` Thomas Rosenstein
2020-11-07 12:40                                 ` Jan Ceuleers
2020-11-07 12:43                                   ` Thomas Rosenstein
2020-11-07 13:00                                   ` Thomas Rosenstein
2020-11-09  8:24                                     ` Jesper Dangaard Brouer
2020-11-09 10:09                                       ` Thomas Rosenstein
2020-11-09 11:40                                         ` Jesper Dangaard Brouer
2020-11-09 11:51                                           ` Toke Høiland-Jørgensen
2020-11-09 12:25                                           ` Thomas Rosenstein
2020-11-09 14:33                                           ` Thomas Rosenstein
2020-11-12 10:05                                             ` Jesper Dangaard Brouer
2020-11-12 11:26                                               ` Thomas Rosenstein
2020-11-12 13:31                                                 ` Jesper Dangaard Brouer
2020-11-12 13:42                                                   ` Thomas Rosenstein
2020-11-12 15:42                                                     ` Jesper Dangaard Brouer
2020-11-13  6:31                                                       ` Thomas Rosenstein
2020-11-16 11:56                                                         ` Jesper Dangaard Brouer
2020-11-16 12:05                                                           ` Thomas Rosenstein
2020-11-09 16:39                                           ` Thomas Rosenstein
2020-11-07 13:33                                 ` Thomas Rosenstein
2020-11-07 16:46                                 ` Jesper Dangaard Brouer
2020-11-07 17:01                                   ` Thomas Rosenstein
2020-11-07 17:26                                     ` Sebastian Moeller
2020-11-16 12:34 ` Jesper Dangaard Brouer
2020-11-16 12:49   ` Thomas Rosenstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox