On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:

On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein" thomas.rosenstein@creamfinance.com wrote:

Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.

I have now also repeated the same test with 3.10, here are the ethtool outputs:

https://drive.google.com/file/d/1c98MVV0JYl6Su6xZTpqwS7m-6OlbmAFp/view?usp=sharing

and the ping times:

https://drive.google.com/file/d/1xhbGJHb5jUbPsee4frbx-c-uqh-7orXY/view?usp=sharing

Sadly the parameters we were looking at are not supported below 4.14.

but I immediatly saw 1 thing very different:

ethtool --statistics eth4 | grep discards
rx_discards_phy: 0
tx_discards_phy: 0

if we check the ethtool output from 5.9.4 were have:

 rx_discards_phy: 151793

And also the outbound_pci_stalled_wr_events get more frequent the lower the total bandwidth / the higher the ping is.
Logically there must be something blocking the the buffers, either they are not getting freed, or not rotated correctly, or processing is too slow.
I would exclude the processing, simply based on 0% CPU load, and also that it doesn't happen in 3.10.
Suspicious is also, that the issue only appears after a certain time of activity (maybe total traffic?!)