[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Jesper Dangaard Brouer brouer at redhat.com
Mon Nov 9 06:40:30 EST 2020


On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein" <thomas.rosenstein at creamfinance.com> wrote:

> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
> 
> > On Sat, 07 Nov 2020 14:00:04 +0100
> > Thomas Rosenstein via Bloat <bloat at lists.bufferbloat.net> wrote:
> >  
> >> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just 
> >> in
> >> case there's something hidden.  
> >
> > Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> > (10G Mellanox Connect-X cards via 10G SPF+ DAC)
> >
> >  stat:            1 (          1) <= outbound_pci_stalled_wr_events /sec
> >  stat:    339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
> >
> > I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking
> > in the kernel driver code it is related to "rx_buffer_almost_full".
> > The numbers per second is excessive (but it be related to a driver bug
> > as it ends up reading "high" -> rx_buffer_almost_full_high in the
> > extended counters).

Notice this indication is a strong red-flag that something is wrong.

> >  stat:     29583661 ( 29,583,661) <= rx_bytes /sec
> >  stat:     30343677 ( 30,343,677) <= rx_bytes_phy /sec
> >
> > You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
> > difference between what the OS sees (rx_bytes) and what the NIC
> > hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).
> >
> >  stat:        19552 (     19,552) <= rx_packets /sec
> >  stat:        19950 (     19,950) <= rx_packets_phy /sec  
> 
> Could these packets be from VLAN interfaces that are not used in the OS?
> 
> >
> > Above RX packet counters also indicated HW is seeing more packets that
> > OS is receiving.
> >
> > Next counters is likely your problem:
> >
> >  stat:          718 (        718) <= tx_global_pause /sec
> >  stat:       954035 (    954,035) <= tx_global_pause_duration /sec
> >  stat:          714 (        714) <= tx_pause_ctrl_phy /sec  
> 
> As far as I can see that's only the TX, and we are only doing RX on this 
> interface - so maybe that's irrelevant?
> 
> >
> > It looks like you have enabled Ethernet Flow-Control, and something is
> > causing pause frames to be generated.  It seem strange that this 
> > happen on a 10Gbit/s link with only 236 Mbit/s.
> >
> > The TX byte counters are also very strange:
> >
> >  stat:        26063 (     26,063) <= tx_bytes /sec
> >  stat:        71950 (     71,950) <= tx_bytes_phy /sec  
> 
> Also, it's TX, and we are only doing RX, as I said already somewhere, 
> it's async routing, so the TX data comes via another router back.

Okay, but as this is a router you also need to transmit this
(asymmetric) traffic out another interface right.

Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

 ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.


> > [1] 
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >
> > Strange size distribution:
> >  stat:     19922 (     19,922) <= rx_1519_to_2047_bytes_phy /sec
> >  stat:        14 (         14) <= rx_65_to_127_bytes_phy /sec  
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



More information about the Bloat mailing list