From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 5BD723B29D for ; Thu, 12 Nov 2020 06:26:26 -0500 (EST) Received: by mail-wm1-x32e.google.com with SMTP id s13so5156538wmh.4 for ; Thu, 12 Nov 2020 03:26:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=creamfinance.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=kKAftxO88AzcjgzYPRXNXzlqyZdU0c4PU7edEo7heoA=; b=kVBXWR+D+zItmmUYvD63K8P0ri0fTzH0RGEfMtdW2Km/lL32UVw7OpQlbBgLsQ0M1v W4H8CEuSHGP4ZeINrbkrroFmCDw+BcZD3nMFOj9FrLcK051rgsOa3YTDZ+GfvcT9jF17 M97QEaSbNp1p1Gvfg0bthDUPDl95VcRo0WwT0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=kKAftxO88AzcjgzYPRXNXzlqyZdU0c4PU7edEo7heoA=; b=AzjElcUit2kHsxnlB4lAhuntX5g2+5/37SLbexHMqi383x+VNbw1DgtpNgL4eBXb7m qOV2vFOG9jaNy6p+cGsDiAKQvO2Ks2iod4RFLkrBUZuqKp5pFGqPNbf+W4zlPZPYDyBM dNiSf1L0ELSSQ8+VuH+Lw/Zzzj8PMCKtUtDVEoJrpxIGI7UNcY/q+wU0pOCihG1Hh6E5 r2cVEeWS1mQaZypfiiVh6NAzAI6hvcs95VlzDNfImMvxL8OZU2sDh9PV+15lWA+HjwPs 03LwOKsuHARccRMJ27+eU8eT48bVB3zecUgQ7EDK0A1J15tl63RF1r2lzjV30iAvTcAl GhFQ== X-Gm-Message-State: AOAM532Lbmn7Wa7NI3m1crlki1eaiaYl8UryKboT2Avv8Lc8xUiTZ7jj JAPf67q16CjRLqJLqnHnrCIz X-Google-Smtp-Source: ABdhPJyc/lwOKhwLNvb/2KFvynygO3uDB1G9qWmIDcZlO0/xwYC+QiY8uU4aiu3SXElZ190EfxWv6Q== X-Received: by 2002:a1c:5545:: with SMTP id j66mr9475817wmb.31.1605180384510; Thu, 12 Nov 2020 03:26:24 -0800 (PST) Received: from [10.37.129.2] (ip-185.208.132.9.cf-it.at. [185.208.132.9]) by smtp.gmail.com with ESMTPSA id q5sm1823220wrf.41.2020.11.12.03.26.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Nov 2020 03:26:23 -0800 (PST) From: "Thomas Rosenstein" To: "Jesper Dangaard Brouer" Cc: Bufferbloat , "Saeed Mahameed" , "Tariq Toukan" Date: Thu, 12 Nov 2020 12:26:20 +0100 X-Mailer: MailMate (1.13.2r5673) Message-ID: In-Reply-To: <20201112110508.59f402e7@carbon> References: <871rh8vf1p.fsf@toke.dk> <81ED2A33-D366-42FC-9344-985FEE8F11BA@creamfinance.com> <87sg9ot5f1.fsf@toke.dk> <20201105143317.78276bbc@carbon> <11812D44-BD46-4CA4-BA39-6080BD88F163@creamfinance.com> <20201106121840.7959ae4b@carbon> <87blgaso84.fsf@toke.dk> <20201106135358.09f6c281@carbon> <20201106151324.5f506574@carbon> <1E70B6D2-1212-43FA-989A-03B657EEE2F2@creamfinance.com> <20201106211940.4c30ccc9@carbon> <6963be0e-3eb5-5875-b53c-66033f50dc2d@gmail.com> <12D28386-7C00-4A31-91E4-37083C1674F9@creamfinance.com> <20201109092428.293104ea@carbon> <7723D882-4DAB-4A70-9D00-DF1976872AC2@creamfinance.com> <20201109124030.71216677@carbon> <27110D8E-77DF-4D10-A5EA-6430DBD55BC7@creamfinance.com> <20201112110508.59f402e7@carbon> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; markup=markdown Content-Transfer-Encoding: quoted-printable Subject: Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60 X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Nov 2020 11:26:26 -0000 On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote: > On Mon, 09 Nov 2020 15:33:46 +0100 > "Thomas Rosenstein" wrote: > >> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: >> >>> On Mon, 09 Nov 2020 11:09:33 +0100 >>> "Thomas Rosenstein" wrote: >>> >>>> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: >>>> >>>>> On Sat, 07 Nov 2020 14:00:04 +0100 >>>>> Thomas Rosenstein via Bloat wrote: >>>>> >>>>>> Here's an extract from the ethtool https://pastebin.com/cabpWGFz >>>>>> just >>>>>> in >>>>>> case there's something hidden. >>>>> >>>>> Yes, there is something hiding in the data from = >>>>> ethtool_stats.pl[1]: >>>>> (10G Mellanox Connect-X cards via 10G SPF+ DAC) >>>>> >>>>> stat: 1 ( 1) <=3D = >>>>> outbound_pci_stalled_wr_events /sec >>>>> stat: 339731557 (339,731,557) <=3D rx_buffer_passed_thres_phy = >>>>> /sec >>>>> >>>>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, >>>>> looking >>>>> in the kernel driver code it is related to = >>>>> "rx_buffer_almost_full". >>>>> The numbers per second is excessive (but it be related to a driver >>>>> bug >>>>> as it ends up reading "high" -> rx_buffer_almost_full_high in the >>>>> extended counters). >> >> I have now tested with a new kernel 5.9.4 build made from 3.10 with = >> make >> oldconfig and I noticed an interesting effect. >> >> The first ca. 2 minutes the router behaves completely normal as with >> 3.10, after that the ping times go crazy. >> >> I have recorded this with ethtool, and also the ping times. >> >> Ethtool: (13 MB) >> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view= ?usp=3Dsharing >> >> The transfer first was doing around 50 - 70 MB/s then once the ping >> times go worse it dropped to ~12 MB/s. >> ca. Line 74324 the transfer speed drops to 12 MB/s > > Lets add them here so we can see and talk about them: > > 74324 Ethtool(eth4 ) stat: 144584002 ( 144,584,002) <=3D = > rx_buffer_passed_thres_phy /sec > 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <=3D rx_byte= s = > /sec > 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <=3D = > rx_bytes_phy /sec > > Significant difference between RX-bytes Phy-level and OS-level. > > 17243186*8/10^6 =3D 138 Mbit/s > > 74327 Ethtool(eth4 ) stat: 321 ( 321) <=3D = > rx_cache_busy /sec > 74328 Ethtool(eth4 ) stat: 321 ( 321) <=3D = > rx_cache_full /sec > 74329 Ethtool(eth4 ) stat: 7323 ( 7,323) <=3D = > rx_cache_reuse /sec > 74330 Ethtool(eth4 ) stat: 15188 ( 15,188) <=3D = > rx_csum_complete /sec > 74331 Ethtool(eth4 ) stat: 1 ( 1) <=3D = > rx_csum_none /sec > 74332 Ethtool(eth4 ) stat: 96 ( 96) <=3D = > rx_csum_unnecessary /sec > 74333 Ethtool(eth4 ) stat: 1242 ( 1,242) <=3D = > rx_discards_phy /sec > > Packets are getting rx_discards > > 74334 Ethtool(eth4 ) stat: 15285 ( 15,285) <=3D = > rx_packets /sec > 74335 Ethtool(eth4 ) stat: 16655 ( 16,655) <=3D = > rx_packets_phy /sec > > (16655-15285 =3D 1370) > Again difference between RX-packets Phy-level and OS-level > > 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <=3D = > rx_prio0_bytes /sec > 74337 Ethtool(eth4 ) stat: 1264 ( 1,264) <=3D = > rx_prio0_discards /sec > 74338 Ethtool(eth4 ) stat: 15474 ( 15,474) <=3D = > rx_prio0_packets /sec I have also updated now OFED on the routers and checked with 3.10 and = 5.4.60 (new firmware for the adapters) That did change absolutly nothing! I noticed still that only under kernel 5.4.60 the discard happen, on = 3.10 they don't. Based on that I would blame the kernel instead of hardware. Do you agree on this? I wouldn't exclude any settings to be problematic for now, maybe some = defaults changed. > > > >> Seems you are right about the rx_buffer_passed_thres_phy if you check >> just those lines they appear more often once the speed dropped. >> Not sure if that's the cause or an effect of the underlying problem! > > Thanks for the link to Mellanox doc: > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-co= unters > > The counter: 'rx_buffer_passed_thres_phy' > Description: > 'The number of events where the port receive buffer was over 85% = > full. > Supported from kernel 4.14' > > The counter: 'tx_pause_ctrl_phy' > Description: > 'The number of link layer pause packets transmitted on a physical > port. If this counter is increasing, it implies that the NIC is > congested and cannot absorb the traffic coming from the network.' > > > So, the NIC is clearly telling us that something is wrong. You are not > near link speed (12MB/s =3D> 96Mbit/s). And still port receive buffer= > is over 85% full most of the time. There are indications that, the = > NIC > is congested and cannot absorb the traffic coming from the network. I have opened a support case with Mellanox maybe they have an idea = directly. Their first response based on the outbound_pci_stalled_wr was that the = PCIexpress credits for the card run out. Which means (if I interpret it correctly) that the buffer is not emptied = quickly enough anymore. > > Long-shot theory(1): Could this be caused by CPUs going into too deep > sleep states? And NIC not sending wake-up signal early enough, e.g. > because it holds on to packets for generating a bulk at IRQ raise. > > Long-shot theory(2): The NIC driver IRQ interrupt coalesce > adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is > because of TSO frames, due packet-to-byte ratio isn't taken into > account. here are the coalese parameters: (are the same for 3.10 and 5.4.60) ethtool -c eth4 Coalesce parameters for eth4: Adaptive RX: on TX: on stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 32 rx-frames: 64 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 8 tx-frames: 128 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 > > You can test theory(2) in two ways, via disabling TSO and via = > adjusting > ethtool --show-coalesce|--coalesce (-c|-C). The --coalesce change can > change part of theory(1), the NIC wake-up/IRQ interval. For CPU sleep > modes see software package 'tuned'. CPU mode is set to performance, I thought that would stop any deep sleep = states. > > >> Pings: >> https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view= ?usp=3Dsharing >> >> Pause frames were activated again after the restart. >> >> (Here a link for rerefence for the ethtool variables: >> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-co= unters) > > To easier digest, here are some of the interesting counters. > > 74171 Ethtool(eth4 ) stat: 24 ( 24) <=3D = > outbound_pci_stalled_wr /sec > 74172 Ethtool(eth4 ) stat: 1 ( 1) <=3D = > outbound_pci_stalled_wr_events /sec > [...] > 74628 Ethtool(eth5 ) stat: 40 ( 40) <=3D = > outbound_pci_stalled_wr /sec > 74629 Ethtool(eth5 ) stat: 1 ( 1) <=3D = > outbound_pci_stalled_wr_events /sec > > > 74539 Ethtool(eth4 ) stat: 432 ( 432) <=3D = > tx_global_pause /sec > 74540 Ethtool(eth4 ) stat: 551264 ( 551,264) <=3D = > tx_global_pause_duration /sec > 74541 Ethtool(eth4 ) stat: 434 ( 434) <=3D = > tx_mac_control_phy /sec > 74542 Ethtool(eth4 ) stat: 31 ( 31) <=3D = > tx_nop /sec > 74543 Ethtool(eth4 ) stat: 24888 ( 24,888) <=3D = > tx_packets /sec > 74544 Ethtool(eth4 ) stat: 25187 ( 25,187) <=3D = > tx_packets_phy /sec > 74545 Ethtool(eth4 ) stat: 434 ( 434) <=3D = > tx_pause_ctrl_phy /sec > 74546 Ethtool(eth4 ) stat: 15190865 ( 15,190,865) <=3D = > tx_prio0_bytes /sec > 74547 Ethtool(eth4 ) stat: 24907 ( 24,907) <=3D = > tx_prio0_packets /sec > 74548 Ethtool(eth4 ) stat: 647795 ( 647,795) <=3D = > tx_tso_bytes /sec > 74549 Ethtool(eth4 ) stat: 196 ( 196) <=3D = > tx_tso_packets /sec > 74550 Ethtool(eth4 ) stat: 15005949 ( 15,005,949) <=3D = > tx_vport_unicast_bytes /sec > 74551 Ethtool(eth4 ) stat: 24780 ( 24,780) <=3D = > tx_vport_unicast_packets /sec > 74552 Ethtool(eth4 ) stat: 57 ( 57) <=3D = > tx_xmit_more /sec > [...] > 75030 Ethtool(eth5 ) stat: 39 ( 39) <=3D = > tx_global_pause /sec > 75031 Ethtool(eth5 ) stat: 242374 ( 242,374) <=3D = > tx_global_pause_duration /sec > 75032 Ethtool(eth5 ) stat: 39 ( 39) <=3D = > tx_mac_control_phy /sec > 75033 Ethtool(eth5 ) stat: 166 ( 166) <=3D = > tx_nop /sec > 75034 Ethtool(eth5 ) stat: 14837 ( 14,837) <=3D = > tx_packets /sec > 75035 Ethtool(eth5 ) stat: 14816 ( 14,816) <=3D = > tx_packets_phy /sec > 75036 Ethtool(eth5 ) stat: 39 ( 39) <=3D = > tx_pause_ctrl_phy /sec > 75037 Ethtool(eth5 ) stat: 16748584 ( 16,748,584) <=3D = > tx_prio0_bytes /sec > 75038 Ethtool(eth5 ) stat: 14708 ( 14,708) <=3D = > tx_prio0_packets /sec > 75039 Ethtool(eth5 ) stat: 2477442 ( 2,477,442) <=3D = > tx_tso_bytes /sec > 75040 Ethtool(eth5 ) stat: 733 ( 733) <=3D = > tx_tso_packets /sec > 75041 Ethtool(eth5 ) stat: 16765330 ( 16,765,330) <=3D = > tx_vport_unicast_bytes /sec > 75042 Ethtool(eth5 ) stat: 14780 ( 14,780) <=3D = > tx_vport_unicast_packets /sec > 75043 Ethtool(eth5 ) stat: 2 ( 2) <=3D = > tx_xmit_more /sec > > > -- = > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer Is it fine to always reply all here, or should I just keep the list?