[Bloat] Excessive throttling with fq
Eric Dumazet
eric.dumazet at gmail.com
Thu Jan 26 15:41:21 EST 2017
Can you post :
ethtool -i eth0
ethtool -k eth0
grep HZ /boot/config.... (what is the HZ value of your kernel)
I suspect a possible problem with TSO autodefer when/if HZ < 1000
Thanks.
On Thu, 2017-01-26 at 21:19 +0100, Hans-Kristian Bakke wrote:
> There are two packet captures from fq with and without pacing here:
>
>
> https://owncloud.proikt.com/index.php/s/KuXIl8h8bSFH1fM
>
>
>
> The server (with fq pacing/nopacing) is 10.0.5.10 and is running a
> Apache2 webserver at port tcp port 443. The tcp client is nginx
> reverse proxy at 10.0.5.13 on the same subnet which again is proxying
> the connection from the Windows 10 client.
> - I did try to connect directly to the server with the client (via a
> linux gateway router) avoiding the nginx proxy and just using plain
> no-ssl http. That did not change anything.
> - I also tried stopping the eth0 interface to force the traffic to the
> eth1 interface in the LACP which changed nothing.
> - I also pulled each of the cable on the switch to force the traffic
> to switch between interfaces in the LACP link between the client
> switch and the server switch.
>
>
> The CPU is a 5-6 year old Intel Xeon X3430 CPU @ 4x2.40GHz on a
> SuperMicro platform. It is not very loaded and the results are always
> in the same ballpark with fq pacing on.
>
>
>
> top - 21:12:38 up 12 days, 11:08, 4 users, load average: 0.56, 0.68,
> 0.77
> Tasks: 1344 total, 1 running, 1343 sleeping, 0 stopped, 0 zombie
> %Cpu0 : 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0
> si, 0.0 st
> %Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 97.4 id, 2.0 wa, 0.0 hi, 0.3
> si, 0.0 st
> %Cpu2 : 0.0 us, 2.0 sy, 0.0 ni, 96.4 id, 1.3 wa, 0.0 hi, 0.3
> si, 0.0 st
> %Cpu3 : 0.7 us, 2.3 sy, 0.0 ni, 94.1 id, 3.0 wa, 0.0 hi, 0.0
> si, 0.0 st
> KiB Mem : 16427572 total, 173712 free, 9739976 used, 6513884
> buff/cache
> KiB Swap: 6369276 total, 6126736 free, 242540 used. 6224836 avail
> Mem
>
>
> This seems OK to me. It does have 24 drives in 3 ZFS pools at 144TB
> raw storage in total with several SAS HBAs that is pretty much always
> poking the system in some way or the other.
>
>
> There are around 32K interrupts when running @23 MB/s (as seen in
> chrome downloads) with pacing on and about 25K interrupts when running
> @105 MB/s with fq nopacing. Is that normal?
>
>
> Hans-Kristian
>
>
>
> On 26 January 2017 at 20:58, David Lang <david at lang.hm> wrote:
> Is there any CPU bottleneck?
>
> pacing causing this sort of problem makes me thing that the
> CPU either can't keep up or that something (Hz setting type of
> thing) is delaying when the CPU can get used.
>
> It's not clear from the posts if the problem is with sending
> data or receiving data.
>
> David Lang
>
>
> On Thu, 26 Jan 2017, Eric Dumazet wrote:
>
> Nothing jumps on my head.
>
> We use FQ on links varying from 1Gbit to 100Gbit, and
> we have no such
> issues.
>
> You could probably check on the server the TCP various
> infos given by ss
> command
>
>
> ss -temoi dst <remoteip>
>
>
> pacing rate is shown. You might have some issues, but
> it is hard to say.
>
>
> On Thu, 2017-01-26 at 19:55 +0100, Hans-Kristian Bakke
> wrote:
> After some more testing I see that if I
> disable fq pacing the
> performance is restored to the expected
> levels: # for i in eth0 eth1; do tc qdisc
> replace dev $i root fq nopacing;
> done
>
>
> Is this expected behaviour? There is some
> background traffic, but only
> in the sub 100 mbit/s on the switches and
> gateway between the server
> and client.
>
>
> The chain:
> Windows 10 client -> 1000 mbit/s -> switch ->
> 2xgigabit LACP -> switch
> -> 4 x gigabit LACP -> gw (fq_codel on all
> nics) -> 4 x gigabit LACP
> (the same as in) -> switch -> 2 x lacp ->
> server (with misbehaving fq
> pacing)
>
>
>
> On 26 January 2017 at 19:38, Hans-Kristian
> Bakke <hkbakke at gmail.com>
> wrote:
> I can add that this is without BBR,
> just plain old kernel 4.8
> cubic.
>
> On 26 January 2017 at 19:36,
> Hans-Kristian Bakke
> <hkbakke at gmail.com> wrote:
> Another day, another fq issue
> (or user error).
>
>
> I try to do the seeminlig
> simple task of downloading a
> single large file over local
> gigabit LAN from a
> physical server running kernel
> 4.8 and sch_fq on intel
> server NICs.
>
>
> For some reason it wouldn't go
> past around 25 MB/s.
> After having replaced SSL with
> no SSL, replaced apache
> with nginx and verified that
> there is plenty of
> bandwith available between my
> client and the server I
> tried to change qdisc from fq
> to pfifo_fast. It
> instantly shot up to around
> the expected 85-90 MB/s.
> The same happened with
> fq_codel in place of fq.
>
>
> I then checked the statistics
> for fq and the throttled
> counter is increasing
> massively every second (eth0 and
> eth1 is LACPed using Linux
> bonding so both is seen
> here):
>
>
> qdisc fq 8007: root refcnt 2
> limit 10000p flow_limit
> 100p buckets 1024 orphan_mask
> 1023 quantum 3028
> initial_quantum 15140
> refill_delay 40.0ms
> Sent 787131797 bytes 520082
> pkt (dropped 15,
> overlimits 0 requeues 0)
> backlog 98410b 65p requeues 0
> 15 flows (14 inactive, 1
> throttled)
> 0 gc, 2 highprio, 259920
> throttled, 15 flows_plimit
> qdisc fq 8008: root refcnt 2
> limit 10000p flow_limit
> 100p buckets 1024 orphan_mask
> 1023 quantum 3028
> initial_quantum 15140
> refill_delay 40.0ms
> Sent 2533167 bytes 6731 pkt
> (dropped 0, overlimits 0
> requeues 0)
> backlog 0b 0p requeues 0
> 24 flows (24 inactive, 0
> throttled)
> 0 gc, 2 highprio, 397
> throttled
>
>
> Do you have any suggestions?
>
>
> Regards,
> Hans-Kristian
>
>
>
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
>
More information about the Bloat
mailing list