[Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)

Dave Taht dave.taht at gmail.com
Mon Mar 24 19:10:54 EDT 2014


On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet <eric.dumazet at gmail.com> wrote:
> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>
>>
>> It has long been my hope that conventional distros would start
>> selecting sch_fq and sch_fq_codel up in safe scenarios.
>>
>> 1) Can an appropriate clocksource be detected from userspace?
>>
>> if [ have_good_clocksources ]
>> then
>> if [ i am a router ]
>> then
>> sysctl -w something=fq_codel # or is it an entry in proc?
>> else
>> sysctl -w something=sch_fq
>> fi
>> fi
>>
>
> Sure you can do all this from user space.
> Thats policy, and this should not belong to kernel.

I tend to agree, except that until recently it was extremely hard to gather
all the data needed to automagically come up with a more optimal
queuing policy on a per machine basis.

How do you tell if a box is a vm or container or raw hardware??
Look for inadaquate randomness?

One of my concerns is that sch_fq is (?) currently inadaquately
explored in the case of wifi - as best as I recall there were a
ton of drivers than cloned and disappeared the skb deep
in driver buffers, making fq_codel a mildly better choice...

and for a system with both wifi and lan, what then? There's only
the global default....

> sysctl -w net.core.default_qdisc=fq
>
> # force a load/delete to bring default qdisc for all devices already up
> for ETH in `list of network devices (excluding virtual devices)`
> do
>  tc qdisc add dev $ETH root pfifo 2>/dev/null
>  tc qdisc del dev $ETH root 2>/dev/null
> done
>
>> How early in boot would this have to be to take effect?
>
> It doesn't matter, if you force a load/unload of the qdisc.

well, I'd argue for it in the if.preup stage as being a decent
place, or early after sysctl is run but before the devices are
initialized. Some data will be lost by this otherwise.

>
>>
>> 2) In the case of a server machine providing vms, and meeting the
>> above precondition(s),
>> what would be a more right qdisc, sch_fq or sch_codel?
>
> sch_fq 'works' only for locally generated traffic, as we look at
> skb->sk->sk_pacing_rate to read the per socket rate. No way an
> hypervisor (or a router 2 hops away) can access to original socket
> without hacks.
>
> If your linux vm needs TCP pacing, then it also need fq packet scheduler
> in the vm.

Got it.

Most of my own interest on the pacing side is seeing the impact on
slow start on slow egress links to the internet. I kind of hope it's
groovy... (gotta go compile a 3.14 kernel now. wet paint)

>
>>
>> 3) Containers?
>>
>> 4) The machine in the vm going through the virtual ethernet interface?
>>
>> (I don't understand to what extent tracking the exit of packets from tcp through
>> the stack and vm happens - I imagine a TSO is preserved all the way through,
>> and also imagine that tcp small queues doesn't survive transit through the vm,
>> but I am known to have a fevered imagination.
>
> Small Queues controls the host queues.
>
> Not the queues on external routers. Consider an hypervisor as a router.

Got it. See above for question on how to determine that reliably?

>
>>
>>
>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
>> > every ms. By design, it tends to get ACK trains, way before the cwnd
>> > might reach BDP.
>>
>> Fascinating! Push on one thing, break another. As best I recall hystart had a
>> string of issues like this in it's early deployment.
>>
>> /me looks forward to one day escaping 3.10-land and observing this for himself
>>
>> so some sort of bidirectional awareness of the underlying qdisc would be needed
>> to retune hystart properly.
>>
>> Is ms resolution the best possible at this point?
>
> Nope. Hystart ACK train detection is very lazy and current algo was kind
> of a hack. If you use better resolution, then you have problems because
> of ACK jitter in reverse path. Really, only looking at delay between 2
> ACKS is not generic enough, we need something else, or just disable ACK
> TRAIN detection, as it is not that useful. Delay detection is less
> noisy.

I enjoyed re-reading the hystart related papers this morning. The
first 5 hits for it on google were
what I'd remembered...

ns2 does not support hystart. Apparently ns3 sort of supports it when
run through the linux driver interface but not in pure simulation...


>
>
>



-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html



More information about the Bloat mailing list