[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Thomas Rosenstein thomas.rosenstein at creamfinance.com
Mon Nov 9 09:33:46 EST 2020



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 11:09:33 +0100
> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>
>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>> Thomas Rosenstein via Bloat <bloat at lists.bufferbloat.net> wrote:
>>>
>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
>>>> just
>>>> in
>>>> case there's something hidden.
>>>
>>> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>
>>>  stat:            1 (          1) <= outbound_pci_stalled_wr_events 
>>> /sec
>>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
>>>
>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
>>> looking
>>> in the kernel driver code it is related to "rx_buffer_almost_full".
>>> The numbers per second is excessive (but it be related to a driver 
>>> bug
>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>> extended counters).

I have now tested with a new kernel 5.9.4 build made from 3.10 with make 
oldconfig and I noticed an interesting effect.

The first ca. 2 minutes the router behaves completely normal as with 
3.10, after that the ping times go crazy.

I have recorded this with ethtool, and also the ping times.

Ethtool: (13 MB)
https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing

The transfer first was doing around 50 - 70 MB/s then once the ping 
times go worse it dropped to ~12 MB/s.
ca. Line 74324 the transfer speed drops to 12 MB/s

Seems you are right about the rx_buffer_passed_thres_phy if you check 
just those lines they appear more often once the speed dropped.
Not sure if that's the cause or an effect of the underlying problem!

Pings:
https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing

Pause frames were activated again after the restart.

(Here a link for rerefence for the ethtool variables: 
https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)


More information about the Bloat mailing list