[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
Thomas Rosenstein
thomas.rosenstein at creamfinance.com
Thu Nov 12 06:26:20 EST 2020
On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote:
> On Mon, 09 Nov 2020 15:33:46 +0100
> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>
>> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:
>>
>>> On Mon, 09 Nov 2020 11:09:33 +0100
>>> "Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:
>>>
>>>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
>>>>
>>>>> On Sat, 07 Nov 2020 14:00:04 +0100
>>>>> Thomas Rosenstein via Bloat <bloat at lists.bufferbloat.net> wrote:
>>>>>
>>>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz
>>>>>> just
>>>>>> in
>>>>>> case there's something hidden.
>>>>>
>>>>> Yes, there is something hiding in the data from
>>>>> ethtool_stats.pl[1]:
>>>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
>>>>>
>>>>> stat: 1 ( 1) <=
>>>>> outbound_pci_stalled_wr_events /sec
>>>>> stat: 339731557 (339,731,557) <= rx_buffer_passed_thres_phy
>>>>> /sec
>>>>>
>>>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before,
>>>>> looking
>>>>> in the kernel driver code it is related to
>>>>> "rx_buffer_almost_full".
>>>>> The numbers per second is excessive (but it be related to a driver
>>>>> bug
>>>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
>>>>> extended counters).
>>
>> I have now tested with a new kernel 5.9.4 build made from 3.10 with
>> make
>> oldconfig and I noticed an interesting effect.
>>
>> The first ca. 2 minutes the router behaves completely normal as with
>> 3.10, after that the ping times go crazy.
>>
>> I have recorded this with ethtool, and also the ping times.
>>
>> Ethtool: (13 MB)
>> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing
>>
>> The transfer first was doing around 50 - 70 MB/s then once the ping
>> times go worse it dropped to ~12 MB/s.
>> ca. Line 74324 the transfer speed drops to 12 MB/s
>
> Lets add them here so we can see and talk about them:
>
> 74324 Ethtool(eth4 ) stat: 144584002 ( 144,584,002) <=
> rx_buffer_passed_thres_phy /sec
> 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <= rx_bytes
> /sec
> 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <=
> rx_bytes_phy /sec
>
> Significant difference between RX-bytes Phy-level and OS-level.
>
> 17243186*8/10^6 = 138 Mbit/s
>
> 74327 Ethtool(eth4 ) stat: 321 ( 321) <=
> rx_cache_busy /sec
> 74328 Ethtool(eth4 ) stat: 321 ( 321) <=
> rx_cache_full /sec
> 74329 Ethtool(eth4 ) stat: 7323 ( 7,323) <=
> rx_cache_reuse /sec
> 74330 Ethtool(eth4 ) stat: 15188 ( 15,188) <=
> rx_csum_complete /sec
> 74331 Ethtool(eth4 ) stat: 1 ( 1) <=
> rx_csum_none /sec
> 74332 Ethtool(eth4 ) stat: 96 ( 96) <=
> rx_csum_unnecessary /sec
> 74333 Ethtool(eth4 ) stat: 1242 ( 1,242) <=
> rx_discards_phy /sec
>
> Packets are getting rx_discards
>
> 74334 Ethtool(eth4 ) stat: 15285 ( 15,285) <=
> rx_packets /sec
> 74335 Ethtool(eth4 ) stat: 16655 ( 16,655) <=
> rx_packets_phy /sec
>
> (16655-15285 = 1370)
> Again difference between RX-packets Phy-level and OS-level
>
> 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <=
> rx_prio0_bytes /sec
> 74337 Ethtool(eth4 ) stat: 1264 ( 1,264) <=
> rx_prio0_discards /sec
> 74338 Ethtool(eth4 ) stat: 15474 ( 15,474) <=
> rx_prio0_packets /sec
I have also updated now OFED on the routers and checked with 3.10 and
5.4.60 (new firmware for the adapters)
That did change absolutly nothing!
I noticed still that only under kernel 5.4.60 the discard happen, on
3.10 they don't.
Based on that I would blame the kernel instead of hardware.
Do you agree on this?
I wouldn't exclude any settings to be problematic for now, maybe some
defaults changed.
>
>
>
>> Seems you are right about the rx_buffer_passed_thres_phy if you check
>> just those lines they appear more often once the speed dropped.
>> Not sure if that's the cause or an effect of the underlying problem!
>
> Thanks for the link to Mellanox doc:
> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters
>
> The counter: 'rx_buffer_passed_thres_phy'
> Description:
> 'The number of events where the port receive buffer was over 85%
> full.
> Supported from kernel 4.14'
>
> The counter: 'tx_pause_ctrl_phy'
> Description:
> 'The number of link layer pause packets transmitted on a physical
> port. If this counter is increasing, it implies that the NIC is
> congested and cannot absorb the traffic coming from the network.'
>
>
> So, the NIC is clearly telling us that something is wrong. You are not
> near link speed (12MB/s => 96Mbit/s). And still port receive buffer
> is over 85% full most of the time. There are indications that, the
> NIC
> is congested and cannot absorb the traffic coming from the network.
I have opened a support case with Mellanox maybe they have an idea
directly.
Their first response based on the outbound_pci_stalled_wr was that the
PCIexpress credits for the card run out.
Which means (if I interpret it correctly) that the buffer is not emptied
quickly enough anymore.
>
> Long-shot theory(1): Could this be caused by CPUs going into too deep
> sleep states? And NIC not sending wake-up signal early enough, e.g.
> because it holds on to packets for generating a bulk at IRQ raise.
>
> Long-shot theory(2): The NIC driver IRQ interrupt coalesce
> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
> because of TSO frames, due packet-to-byte ratio isn't taken into
> account.
here are the coalese parameters: (are the same for 3.10 and 5.4.60)
ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: on TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 32
rx-frames: 64
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
>
> You can test theory(2) in two ways, via disabling TSO and via
> adjusting
> ethtool --show-coalesce|--coalesce (-c|-C). The --coalesce change can
> change part of theory(1), the NIC wake-up/IRQ interval. For CPU sleep
> modes see software package 'tuned'.
CPU mode is set to performance, I thought that would stop any deep sleep
states.
>
>
>> Pings:
>> https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing
>>
>> Pause frames were activated again after the restart.
>>
>> (Here a link for rerefence for the ethtool variables:
>> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)
>
> To easier digest, here are some of the interesting counters.
>
> 74171 Ethtool(eth4 ) stat: 24 ( 24) <=
> outbound_pci_stalled_wr /sec
> 74172 Ethtool(eth4 ) stat: 1 ( 1) <=
> outbound_pci_stalled_wr_events /sec
> [...]
> 74628 Ethtool(eth5 ) stat: 40 ( 40) <=
> outbound_pci_stalled_wr /sec
> 74629 Ethtool(eth5 ) stat: 1 ( 1) <=
> outbound_pci_stalled_wr_events /sec
>
>
> 74539 Ethtool(eth4 ) stat: 432 ( 432) <=
> tx_global_pause /sec
> 74540 Ethtool(eth4 ) stat: 551264 ( 551,264) <=
> tx_global_pause_duration /sec
> 74541 Ethtool(eth4 ) stat: 434 ( 434) <=
> tx_mac_control_phy /sec
> 74542 Ethtool(eth4 ) stat: 31 ( 31) <=
> tx_nop /sec
> 74543 Ethtool(eth4 ) stat: 24888 ( 24,888) <=
> tx_packets /sec
> 74544 Ethtool(eth4 ) stat: 25187 ( 25,187) <=
> tx_packets_phy /sec
> 74545 Ethtool(eth4 ) stat: 434 ( 434) <=
> tx_pause_ctrl_phy /sec
> 74546 Ethtool(eth4 ) stat: 15190865 ( 15,190,865) <=
> tx_prio0_bytes /sec
> 74547 Ethtool(eth4 ) stat: 24907 ( 24,907) <=
> tx_prio0_packets /sec
> 74548 Ethtool(eth4 ) stat: 647795 ( 647,795) <=
> tx_tso_bytes /sec
> 74549 Ethtool(eth4 ) stat: 196 ( 196) <=
> tx_tso_packets /sec
> 74550 Ethtool(eth4 ) stat: 15005949 ( 15,005,949) <=
> tx_vport_unicast_bytes /sec
> 74551 Ethtool(eth4 ) stat: 24780 ( 24,780) <=
> tx_vport_unicast_packets /sec
> 74552 Ethtool(eth4 ) stat: 57 ( 57) <=
> tx_xmit_more /sec
> [...]
> 75030 Ethtool(eth5 ) stat: 39 ( 39) <=
> tx_global_pause /sec
> 75031 Ethtool(eth5 ) stat: 242374 ( 242,374) <=
> tx_global_pause_duration /sec
> 75032 Ethtool(eth5 ) stat: 39 ( 39) <=
> tx_mac_control_phy /sec
> 75033 Ethtool(eth5 ) stat: 166 ( 166) <=
> tx_nop /sec
> 75034 Ethtool(eth5 ) stat: 14837 ( 14,837) <=
> tx_packets /sec
> 75035 Ethtool(eth5 ) stat: 14816 ( 14,816) <=
> tx_packets_phy /sec
> 75036 Ethtool(eth5 ) stat: 39 ( 39) <=
> tx_pause_ctrl_phy /sec
> 75037 Ethtool(eth5 ) stat: 16748584 ( 16,748,584) <=
> tx_prio0_bytes /sec
> 75038 Ethtool(eth5 ) stat: 14708 ( 14,708) <=
> tx_prio0_packets /sec
> 75039 Ethtool(eth5 ) stat: 2477442 ( 2,477,442) <=
> tx_tso_bytes /sec
> 75040 Ethtool(eth5 ) stat: 733 ( 733) <=
> tx_tso_packets /sec
> 75041 Ethtool(eth5 ) stat: 16765330 ( 16,765,330) <=
> tx_vport_unicast_bytes /sec
> 75042 Ethtool(eth5 ) stat: 14780 ( 14,780) <=
> tx_vport_unicast_packets /sec
> 75043 Ethtool(eth5 ) stat: 2 ( 2) <=
> tx_xmit_more /sec
>
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
Is it fine to always reply all here, or should I just keep the list?
More information about the Bloat
mailing list