[Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)

Eric Dumazet eric.dumazet at gmail.com
Mon Mar 24 13:41:27 EDT 2014


On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:

> 
> It has long been my hope that conventional distros would start
> selecting sch_fq and sch_fq_codel up in safe scenarios.
> 
> 1) Can an appropriate clocksource be detected from userspace?
> 
> if [ have_good_clocksources ]
> then
> if [ i am a router ]
> then
> sysctl -w something=fq_codel # or is it an entry in proc?
> else
> sysctl -w something=sch_fq
> fi
> fi
> 

Sure you can do all this from user space.
Thats policy, and this should not belong to kernel.

sysctl -w net.core.default_qdisc=fq

# force a load/delete to bring default qdisc for all devices already up
for ETH in `list of network devices (excluding virtual devices)`
do
 tc qdisc add dev $ETH root pfifo 2>/dev/null
 tc qdisc del dev $ETH root 2>/dev/null
done

> How early in boot would this have to be to take effect?

It doesn't matter, if you force a load/unload of the qdisc.

> 
> 2) In the case of a server machine providing vms, and meeting the
> above precondition(s),
> what would be a more right qdisc, sch_fq or sch_codel?

sch_fq 'works' only for locally generated traffic, as we look at
skb->sk->sk_pacing_rate to read the per socket rate. No way an
hypervisor (or a router 2 hops away) can access to original socket
without hacks.

If your linux vm needs TCP pacing, then it also need fq packet scheduler
in the vm.

> 
> 3) Containers?
> 
> 4) The machine in the vm going through the virtual ethernet interface?
> 
> (I don't understand to what extent tracking the exit of packets from tcp through
> the stack and vm happens - I imagine a TSO is preserved all the way through,
> and also imagine that tcp small queues doesn't survive transit through the vm,
> but I am known to have a fevered imagination.

Small Queues controls the host queues.

Not the queues on external routers. Consider an hypervisor as a router.

> 
> 
> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
> > every ms. By design, it tends to get ACK trains, way before the cwnd
> > might reach BDP.
> 
> Fascinating! Push on one thing, break another. As best I recall hystart had a
> string of issues like this in it's early deployment.
> 
> /me looks forward to one day escaping 3.10-land and observing this for himself
> 
> so some sort of bidirectional awareness of the underlying qdisc would be needed
> to retune hystart properly.
> 
> Is ms resolution the best possible at this point?

Nope. Hystart ACK train detection is very lazy and current algo was kind
of a hack. If you use better resolution, then you have problems because
of ACK jitter in reverse path. Really, only looking at delay between 2
ACKS is not generic enough, we need something else, or just disable ACK
TRAIN detection, as it is not that useful. Delay detection is less
noisy.






More information about the Bloat mailing list