[Bloat] Excessive throttling with fq

Thu Jan 26 15:41:21 EST 2017

Can you post :

ethtool -i eth0
ethtool -k eth0

grep HZ /boot/config.... (what is the HZ value of your kernel)

I suspect a possible problem with TSO autodefer when/if HZ < 1000

Thanks.

On Thu, 2017-01-26 at 21:19 +0100, Hans-Kristian Bakke wrote:
> There are two packet captures from fq with and without pacing here:
> 
> 
> https://owncloud.proikt.com/index.php/s/KuXIl8h8bSFH1fM
> 
> 
> 
> The server (with fq pacing/nopacing) is 10.0.5.10 and is running a
> Apache2 webserver at port tcp port 443. The tcp client is nginx
> reverse proxy at 10.0.5.13 on the same subnet which again is proxying
> the connection from the Windows 10 client. 
> - I did try to connect directly to the server with the client (via a
> linux gateway router) avoiding the nginx proxy and just using plain
> no-ssl http. That did not change anything. 
> - I also tried stopping the eth0 interface to force the traffic to the
> eth1 interface in the LACP which changed nothing.
> - I also pulled each of the cable on the switch to force the traffic
> to switch between interfaces in the LACP link between the client
> switch and the server switch.
> 
> 
> The CPU is a 5-6 year old Intel Xeon X3430 CPU @ 4x2.40GHz on a
> SuperMicro platform. It is not very loaded and the results are always
> in the same ballpark with fq pacing on. 
> 
> 
> 
> top - 21:12:38 up 12 days, 11:08,  4 users,  load average: 0.56, 0.68,
> 0.77
> Tasks: 1344 total,   1 running, 1343 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  1.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0
> si,  0.0 st
> %Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 97.4 id,  2.0 wa,  0.0 hi,  0.3
> si,  0.0 st
> %Cpu2  :  0.0 us,  2.0 sy,  0.0 ni, 96.4 id,  1.3 wa,  0.0 hi,  0.3
> si,  0.0 st
> %Cpu3  :  0.7 us,  2.3 sy,  0.0 ni, 94.1 id,  3.0 wa,  0.0 hi,  0.0
> si,  0.0 st
> KiB Mem : 16427572 total,   173712 free,  9739976 used,  6513884
> buff/cache
> KiB Swap:  6369276 total,  6126736 free,   242540 used.  6224836 avail
> Mem
> 
> 
> This seems OK to me. It does have 24 drives in 3 ZFS pools at 144TB
> raw storage in total with several SAS HBAs that is pretty much always
> poking the system in some way or the other.
> 
> 
> There are around 32K interrupts when running @23 MB/s (as seen in
> chrome downloads) with pacing on and about 25K interrupts when running
> @105 MB/s with fq nopacing. Is that normal?
> 
> 
> Hans-Kristian
> 
> 
> 
> On 26 January 2017 at 20:58, David Lang <david at lang.hm> wrote:
>         Is there any CPU bottleneck?
>         
>         pacing causing this sort of problem makes me thing that the
>         CPU either can't keep up or that something (Hz setting type of
>         thing) is delaying when the CPU can get used.
>         
>         It's not clear from the posts if the problem is with sending
>         data or receiving data.
>         
>         David Lang
>         
>         
>         On Thu, 26 Jan 2017, Eric Dumazet wrote:
>         
>                 Nothing jumps on my head.
>                 
>                 We use FQ on links varying from 1Gbit to 100Gbit, and
>                 we have no such
>                 issues.
>                 
>                 You could probably check on the server the TCP various
>                 infos given by ss
>                 command
>                 
>                 
>                 ss -temoi dst <remoteip>
>                 
>                 
>                 pacing rate is shown. You might have some issues, but
>                 it is hard to say.
>                 
>                 
>                 On Thu, 2017-01-26 at 19:55 +0100, Hans-Kristian Bakke
>                 wrote:
>                         After some more testing I see that if I
>                         disable fq pacing the
>                         performance is restored to the expected
>                         levels: # for i in eth0 eth1; do tc qdisc
>                         replace dev $i root fq nopacing;
>                         done
>                         
>                         
>                         Is this expected behaviour? There is some
>                         background traffic, but only
>                         in the sub 100 mbit/s on the switches and
>                         gateway between the server
>                         and client.
>                         
>                         
>                         The chain:
>                         Windows 10 client -> 1000 mbit/s -> switch ->
>                         2xgigabit LACP -> switch
>                         -> 4 x gigabit LACP -> gw (fq_codel on all
>                         nics) -> 4 x gigabit LACP
>                         (the same as in) -> switch -> 2 x lacp ->
>                         server (with misbehaving fq
>                         pacing)
>                         
>                         
>                         
>                         On 26 January 2017 at 19:38, Hans-Kristian
>                         Bakke <hkbakke at gmail.com>
>                         wrote:
>                                 I can add that this is without BBR,
>                         just plain old kernel 4.8
>                                 cubic.
>                         
>                                 On 26 January 2017 at 19:36,
>                         Hans-Kristian Bakke
>                                 <hkbakke at gmail.com> wrote:
>                                         Another day, another fq issue
>                         (or user error).
>                         
>                         
>                                         I try to do the seeminlig
>                         simple task of downloading a
>                                         single large file over local
>                         gigabit  LAN from a
>                                         physical server running kernel
>                         4.8 and sch_fq on intel
>                                         server NICs.
>                         
>                         
>                                         For some reason it wouldn't go
>                         past around 25 MB/s.
>                                         After having replaced SSL with
>                         no SSL, replaced apache
>                                         with nginx and verified that
>                         there is plenty of
>                                         bandwith available between my
>                         client and the server I
>                                         tried to change qdisc from fq
>                         to pfifo_fast. It
>                                         instantly shot up to around
>                         the expected 85-90 MB/s.
>                                         The same happened with
>                         fq_codel in place of fq.
>                         
>                         
>                                         I then checked the statistics
>                         for fq and the throttled
>                                         counter is increasing
>                         massively every second (eth0 and
>                                         eth1 is LACPed using Linux
>                         bonding so both is seen
>                                         here):
>                         
>                         
>                                         qdisc fq 8007: root refcnt 2
>                         limit 10000p flow_limit
>                                         100p buckets 1024 orphan_mask
>                         1023 quantum 3028
>                                         initial_quantum 15140
>                         refill_delay 40.0ms
>                                          Sent 787131797 bytes 520082
>                         pkt (dropped 15,
>                                         overlimits 0 requeues 0)
>                                          backlog 98410b 65p requeues 0
>                                           15 flows (14 inactive, 1
>                         throttled)
>                                           0 gc, 2 highprio, 259920
>                         throttled, 15 flows_plimit
>                                         qdisc fq 8008: root refcnt 2
>                         limit 10000p flow_limit
>                                         100p buckets 1024 orphan_mask
>                         1023 quantum 3028
>                                         initial_quantum 15140
>                         refill_delay 40.0ms
>                                          Sent 2533167 bytes 6731 pkt
>                         (dropped 0, overlimits 0
>                                         requeues 0)
>                                          backlog 0b 0p requeues 0
>                                           24 flows (24 inactive, 0
>                         throttled)
>                                           0 gc, 2 highprio, 397
>                         throttled
>                         
>                         
>                                         Do you have any suggestions?
>                         
>                         
>                                         Regards,
>                                         Hans-Kristian
>                         
>                         
>                         
>                         
>                         _______________________________________________
>                         Bloat mailing list
>                         Bloat at lists.bufferbloat.net
>                         https://lists.bufferbloat.net/listinfo/bloat
>                 
>                 
>                 _______________________________________________
>                 Bloat mailing list
>                 Bloat at lists.bufferbloat.net
>                 https://lists.bufferbloat.net/listinfo/bloat
> 
>