[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Thomas Rosenstein thomas.rosenstein at creamfinance.com
Thu Nov 12 06:26:20 EST 2020



On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote:

> On Mon, 09 Nov 2020 15:33:46 +0100
> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:
>>
>>> On Mon, 09 Nov 2020 11:09:33 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>>>
>>>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>>>
>>>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>>>> Thomas Rosenstein via Bloat <bloat at lists.bufferbloat.net> wrote:
>>>>>
>>>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz
>>>>>> just
>>>>>> in
>>>>>> case there's something hidden.
>>>>>
>>>>> Yes, there is something hiding in the data from 
>>>>> ethtool_stats.pl[1]:
>>>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>>>
>>>>>  stat:            1 (          1) <= 
>>>>> outbound_pci_stalled_wr_events /sec
>>>>>  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy 
>>>>> /sec
>>>>>
>>>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before,
>>>>> looking
>>>>> in the kernel driver code it is related to 
>>>>> "rx_buffer_almost_full".
>>>>> The numbers per second is excessive (but it be related to a driver
>>>>> bug
>>>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>>>> extended counters).
>>
>> I have now tested with a new kernel 5.9.4 build made from 3.10 with 
>> make
>> oldconfig and I noticed an interesting effect.
>>
>> The first ca. 2 minutes the router behaves completely normal as with
>> 3.10, after that the ping times go crazy.
>>
>> I have recorded this with ethtool, and also the ping times.
>>
>> Ethtool: (13 MB)
>> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing
>>
>> The transfer first was doing around 50 - 70 MB/s then once the ping
>> times go worse it dropped to ~12 MB/s.
>> ca. Line 74324 the transfer speed drops to 12 MB/s
>
> Lets add them here so we can see and talk about them:
>
>  74324 Ethtool(eth4 ) stat:    144584002 (    144,584,002)  <= 
> rx_buffer_passed_thres_phy /sec
>  74325 Ethtool(eth4 ) stat:     17243186 (     17,243,186) <= rx_bytes 
> /sec
>  74326 Ethtool(eth4 ) stat:     19032079 (     19,032,079) <= 
> rx_bytes_phy /sec
>
> Significant difference between RX-bytes Phy-level and OS-level.
>
> 17243186*8/10^6 = 138 Mbit/s
>
>  74327 Ethtool(eth4 ) stat:          321 (            321) <= 
> rx_cache_busy /sec
>  74328 Ethtool(eth4 ) stat:          321 (            321) <= 
> rx_cache_full /sec
>  74329 Ethtool(eth4 ) stat:         7323 (          7,323) <= 
> rx_cache_reuse /sec
>  74330 Ethtool(eth4 ) stat:        15188 (         15,188) <= 
> rx_csum_complete /sec
>  74331 Ethtool(eth4 ) stat:            1 (              1) <= 
> rx_csum_none /sec
>  74332 Ethtool(eth4 ) stat:           96 (             96) <= 
> rx_csum_unnecessary /sec
>  74333 Ethtool(eth4 ) stat:         1242 (          1,242) <= 
> rx_discards_phy /sec
>
> Packets are getting rx_discards
>
>  74334 Ethtool(eth4 ) stat:        15285 (         15,285) <= 
> rx_packets /sec
>  74335 Ethtool(eth4 ) stat:        16655 (         16,655) <= 
> rx_packets_phy /sec
>
> (16655-15285 = 1370)
> Again difference between RX-packets Phy-level and OS-level
>
>  74336 Ethtool(eth4 ) stat:     19132497 (     19,132,497) <= 
> rx_prio0_bytes /sec
>  74337 Ethtool(eth4 ) stat:         1264 (          1,264) <= 
> rx_prio0_discards /sec
>  74338 Ethtool(eth4 ) stat:        15474 (         15,474) <= 
> rx_prio0_packets /sec

I have also updated now OFED on the routers and checked with 3.10 and 
5.4.60  (new firmware for the adapters)
That did change absolutly nothing!

I noticed still that only under kernel 5.4.60 the discard happen, on 
3.10 they don't.

Based on that I would blame the kernel instead of hardware.
Do you agree on this?

I wouldn't exclude any settings to be problematic for now, maybe some 
defaults changed.


>
>
>
>> Seems you are right about the rx_buffer_passed_thres_phy if you check
>> just those lines they appear more often once the speed dropped.
>> Not sure if that's the cause or an effect of the underlying problem!
>
> Thanks for the link to Mellanox doc:
>  https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters
>
> The counter: 'rx_buffer_passed_thres_phy'
>  Description:
>   'The number of events where the port receive buffer was over 85% 
> full.
>    Supported from kernel 4.14'
>
> The counter: 'tx_pause_ctrl_phy'
>  Description:
>  'The number of link layer pause packets transmitted on a physical
>   port. If this counter is increasing, it implies that the NIC is
>   congested and cannot absorb the traffic coming from the network.'
>
>
> So, the NIC is clearly telling us that something is wrong. You are not
> near link speed (12MB/s => 96Mbit/s).   And still port receive buffer
> is over 85% full most of the time.  There are indications that, the 
> NIC
> is congested and cannot absorb the traffic coming from the network.

I have opened a support case with Mellanox maybe they have an idea 
directly.

Their first response based on the outbound_pci_stalled_wr was that the 
PCIexpress credits for the card run out.
Which means (if I interpret it correctly) that the buffer is not emptied 
quickly enough anymore.

>
> Long-shot theory(1):  Could this be caused by CPUs going into too deep
> sleep states? And NIC not sending wake-up signal early enough, e.g.
> because it holds on to packets for generating a bulk at IRQ raise.
>
> Long-shot theory(2): The NIC driver IRQ interrupt coalesce
> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
> because of TSO frames, due packet-to-byte ratio isn't taken into
> account.

here are the coalese parameters: (are the same for 3.10 and 5.4.60)

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: on  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 32
rx-frames: 64
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

>
> You can test theory(2) in two ways, via disabling TSO and via 
> adjusting
> ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
> change part of theory(1), the NIC wake-up/IRQ interval.  For CPU sleep
> modes see software package 'tuned'.

CPU mode is set to performance, I thought that would stop any deep sleep 
states.

>
>
>> Pings:
>> https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing
>>
>> Pause frames were activated again after the restart.
>>
>> (Here a link for rerefence for the ethtool variables:
>> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)
>
> To easier digest, here are some of the interesting counters.
>
>   74171 Ethtool(eth4    ) stat:           24 (             24) <= 
> outbound_pci_stalled_wr /sec
>   74172 Ethtool(eth4    ) stat:            1 (              1) <= 
> outbound_pci_stalled_wr_events /sec
>   [...]
>   74628 Ethtool(eth5    ) stat:           40 (             40) <= 
> outbound_pci_stalled_wr /sec
>   74629 Ethtool(eth5    ) stat:            1 (              1) <= 
> outbound_pci_stalled_wr_events /sec
>
>
>   74539 Ethtool(eth4    ) stat:          432 (            432) <= 
> tx_global_pause /sec
>   74540 Ethtool(eth4    ) stat:       551264 (        551,264) <= 
> tx_global_pause_duration /sec
>   74541 Ethtool(eth4    ) stat:          434 (            434) <= 
> tx_mac_control_phy /sec
>   74542 Ethtool(eth4    ) stat:           31 (             31) <= 
> tx_nop /sec
>   74543 Ethtool(eth4    ) stat:        24888 (         24,888) <= 
> tx_packets /sec
>   74544 Ethtool(eth4    ) stat:        25187 (         25,187) <= 
> tx_packets_phy /sec
>   74545 Ethtool(eth4    ) stat:          434 (            434) <= 
> tx_pause_ctrl_phy /sec
>   74546 Ethtool(eth4    ) stat:     15190865 (     15,190,865) <= 
> tx_prio0_bytes /sec
>   74547 Ethtool(eth4    ) stat:        24907 (         24,907) <= 
> tx_prio0_packets /sec
>   74548 Ethtool(eth4    ) stat:       647795 (        647,795) <= 
> tx_tso_bytes /sec
>   74549 Ethtool(eth4    ) stat:          196 (            196) <= 
> tx_tso_packets /sec
>   74550 Ethtool(eth4    ) stat:     15005949 (     15,005,949) <= 
> tx_vport_unicast_bytes /sec
>   74551 Ethtool(eth4    ) stat:        24780 (         24,780) <= 
> tx_vport_unicast_packets /sec
>   74552 Ethtool(eth4    ) stat:           57 (             57) <= 
> tx_xmit_more /sec
>   [...]
>   75030 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_global_pause /sec
>   75031 Ethtool(eth5    ) stat:       242374 (        242,374) <= 
> tx_global_pause_duration /sec
>   75032 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_mac_control_phy /sec
>   75033 Ethtool(eth5    ) stat:          166 (            166) <= 
> tx_nop /sec
>   75034 Ethtool(eth5    ) stat:        14837 (         14,837) <= 
> tx_packets /sec
>   75035 Ethtool(eth5    ) stat:        14816 (         14,816) <= 
> tx_packets_phy /sec
>   75036 Ethtool(eth5    ) stat:           39 (             39) <= 
> tx_pause_ctrl_phy /sec
>   75037 Ethtool(eth5    ) stat:     16748584 (     16,748,584) <= 
> tx_prio0_bytes /sec
>   75038 Ethtool(eth5    ) stat:        14708 (         14,708) <= 
> tx_prio0_packets /sec
>   75039 Ethtool(eth5    ) stat:      2477442 (      2,477,442) <= 
> tx_tso_bytes /sec
>   75040 Ethtool(eth5    ) stat:          733 (            733) <= 
> tx_tso_packets /sec
>   75041 Ethtool(eth5    ) stat:     16765330 (     16,765,330) <= 
> tx_vport_unicast_bytes /sec
>   75042 Ethtool(eth5    ) stat:        14780 (         14,780) <= 
> tx_vport_unicast_packets /sec
>   75043 Ethtool(eth5    ) stat:            2 (              2) <= 
> tx_xmit_more /sec
>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Is it fine to always reply all here, or should I just keep the list?



More information about the Bloat mailing list