* [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
@ 2014-03-24 17:09 Dave Taht
2014-03-24 17:41 ` Eric Dumazet
0 siblings, 1 reply; 5+ messages in thread
From: Dave Taht @ 2014-03-24 17:09 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Steinar H. Gunderson, bloat
As this thread has forked considerably from "AQM STILL not making it
into l2 equipment",
forking it...
On Sun, Mar 23, 2014 at 12:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-03-21 at 22:13 +0000, Dave Taht wrote:
>
>> Are you ready to make sch_fq the default in 3.15?
>
>
> sch_fq depends on ktime_get(), so it is a no go if you have
> clocksource using hpet. pfifo_fast doesn't have such issues.
It has long been my hope that conventional distros would start
selecting sch_fq and sch_fq_codel up in safe scenarios.
1) Can an appropriate clocksource be detected from userspace?
if [ have_good_clocksources ]
then
if [ i am a router ]
then
sysctl -w something=fq_codel # or is it an entry in proc?
else
sysctl -w something=sch_fq
fi
fi
How early in boot would this have to be to take effect?
2) In the case of a server machine providing vms, and meeting the
above precondition(s),
what would be a more right qdisc, sch_fq or sch_codel?
3) Containers?
4) The machine in the vm going through the virtual ethernet interface?
(I don't understand to what extent tracking the exit of packets from tcp through
the stack and vm happens - I imagine a TSO is preserved all the way through,
and also imagine that tcp small queues doesn't survive transit through the vm,
but I am known to have a fevered imagination.
> Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
> early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
> every ms. By design, it tends to get ACK trains, way before the cwnd
> might reach BDP.
Fascinating! Push on one thing, break another. As best I recall hystart had a
string of issues like this in it's early deployment.
/me looks forward to one day escaping 3.10-land and observing this for himself
so some sort of bidirectional awareness of the underlying qdisc would be needed
to retune hystart properly.
Is ms resolution the best possible at this point?
>
>
>
--
Dave Täht
Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
2014-03-24 17:09 [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes) Dave Taht
@ 2014-03-24 17:41 ` Eric Dumazet
2014-03-24 23:10 ` Dave Taht
0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2014-03-24 17:41 UTC (permalink / raw)
To: Dave Taht; +Cc: Steinar H. Gunderson, bloat
On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>
> It has long been my hope that conventional distros would start
> selecting sch_fq and sch_fq_codel up in safe scenarios.
>
> 1) Can an appropriate clocksource be detected from userspace?
>
> if [ have_good_clocksources ]
> then
> if [ i am a router ]
> then
> sysctl -w something=fq_codel # or is it an entry in proc?
> else
> sysctl -w something=sch_fq
> fi
> fi
>
Sure you can do all this from user space.
Thats policy, and this should not belong to kernel.
sysctl -w net.core.default_qdisc=fq
# force a load/delete to bring default qdisc for all devices already up
for ETH in `list of network devices (excluding virtual devices)`
do
tc qdisc add dev $ETH root pfifo 2>/dev/null
tc qdisc del dev $ETH root 2>/dev/null
done
> How early in boot would this have to be to take effect?
It doesn't matter, if you force a load/unload of the qdisc.
>
> 2) In the case of a server machine providing vms, and meeting the
> above precondition(s),
> what would be a more right qdisc, sch_fq or sch_codel?
sch_fq 'works' only for locally generated traffic, as we look at
skb->sk->sk_pacing_rate to read the per socket rate. No way an
hypervisor (or a router 2 hops away) can access to original socket
without hacks.
If your linux vm needs TCP pacing, then it also need fq packet scheduler
in the vm.
>
> 3) Containers?
>
> 4) The machine in the vm going through the virtual ethernet interface?
>
> (I don't understand to what extent tracking the exit of packets from tcp through
> the stack and vm happens - I imagine a TSO is preserved all the way through,
> and also imagine that tcp small queues doesn't survive transit through the vm,
> but I am known to have a fevered imagination.
Small Queues controls the host queues.
Not the queues on external routers. Consider an hypervisor as a router.
>
>
> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
> > every ms. By design, it tends to get ACK trains, way before the cwnd
> > might reach BDP.
>
> Fascinating! Push on one thing, break another. As best I recall hystart had a
> string of issues like this in it's early deployment.
>
> /me looks forward to one day escaping 3.10-land and observing this for himself
>
> so some sort of bidirectional awareness of the underlying qdisc would be needed
> to retune hystart properly.
>
> Is ms resolution the best possible at this point?
Nope. Hystart ACK train detection is very lazy and current algo was kind
of a hack. If you use better resolution, then you have problems because
of ACK jitter in reverse path. Really, only looking at delay between 2
ACKS is not generic enough, we need something else, or just disable ACK
TRAIN detection, as it is not that useful. Delay detection is less
noisy.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
2014-03-24 17:41 ` Eric Dumazet
@ 2014-03-24 23:10 ` Dave Taht
2014-03-25 0:18 ` Eric Dumazet
2014-03-25 0:38 ` Dave Taht
0 siblings, 2 replies; 5+ messages in thread
From: Dave Taht @ 2014-03-24 23:10 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Steinar H. Gunderson, bloat
On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>
>>
>> It has long been my hope that conventional distros would start
>> selecting sch_fq and sch_fq_codel up in safe scenarios.
>>
>> 1) Can an appropriate clocksource be detected from userspace?
>>
>> if [ have_good_clocksources ]
>> then
>> if [ i am a router ]
>> then
>> sysctl -w something=fq_codel # or is it an entry in proc?
>> else
>> sysctl -w something=sch_fq
>> fi
>> fi
>>
>
> Sure you can do all this from user space.
> Thats policy, and this should not belong to kernel.
I tend to agree, except that until recently it was extremely hard to gather
all the data needed to automagically come up with a more optimal
queuing policy on a per machine basis.
How do you tell if a box is a vm or container or raw hardware??
Look for inadaquate randomness?
One of my concerns is that sch_fq is (?) currently inadaquately
explored in the case of wifi - as best as I recall there were a
ton of drivers than cloned and disappeared the skb deep
in driver buffers, making fq_codel a mildly better choice...
and for a system with both wifi and lan, what then? There's only
the global default....
> sysctl -w net.core.default_qdisc=fq
>
> # force a load/delete to bring default qdisc for all devices already up
> for ETH in `list of network devices (excluding virtual devices)`
> do
> tc qdisc add dev $ETH root pfifo 2>/dev/null
> tc qdisc del dev $ETH root 2>/dev/null
> done
>
>> How early in boot would this have to be to take effect?
>
> It doesn't matter, if you force a load/unload of the qdisc.
well, I'd argue for it in the if.preup stage as being a decent
place, or early after sysctl is run but before the devices are
initialized. Some data will be lost by this otherwise.
>
>>
>> 2) In the case of a server machine providing vms, and meeting the
>> above precondition(s),
>> what would be a more right qdisc, sch_fq or sch_codel?
>
> sch_fq 'works' only for locally generated traffic, as we look at
> skb->sk->sk_pacing_rate to read the per socket rate. No way an
> hypervisor (or a router 2 hops away) can access to original socket
> without hacks.
>
> If your linux vm needs TCP pacing, then it also need fq packet scheduler
> in the vm.
Got it.
Most of my own interest on the pacing side is seeing the impact on
slow start on slow egress links to the internet. I kind of hope it's
groovy... (gotta go compile a 3.14 kernel now. wet paint)
>
>>
>> 3) Containers?
>>
>> 4) The machine in the vm going through the virtual ethernet interface?
>>
>> (I don't understand to what extent tracking the exit of packets from tcp through
>> the stack and vm happens - I imagine a TSO is preserved all the way through,
>> and also imagine that tcp small queues doesn't survive transit through the vm,
>> but I am known to have a fevered imagination.
>
> Small Queues controls the host queues.
>
> Not the queues on external routers. Consider an hypervisor as a router.
Got it. See above for question on how to determine that reliably?
>
>>
>>
>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
>> > every ms. By design, it tends to get ACK trains, way before the cwnd
>> > might reach BDP.
>>
>> Fascinating! Push on one thing, break another. As best I recall hystart had a
>> string of issues like this in it's early deployment.
>>
>> /me looks forward to one day escaping 3.10-land and observing this for himself
>>
>> so some sort of bidirectional awareness of the underlying qdisc would be needed
>> to retune hystart properly.
>>
>> Is ms resolution the best possible at this point?
>
> Nope. Hystart ACK train detection is very lazy and current algo was kind
> of a hack. If you use better resolution, then you have problems because
> of ACK jitter in reverse path. Really, only looking at delay between 2
> ACKS is not generic enough, we need something else, or just disable ACK
> TRAIN detection, as it is not that useful. Delay detection is less
> noisy.
I enjoyed re-reading the hystart related papers this morning. The
first 5 hits for it on google were
what I'd remembered...
ns2 does not support hystart. Apparently ns3 sort of supports it when
run through the linux driver interface but not in pure simulation...
>
>
>
--
Dave Täht
Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
2014-03-24 23:10 ` Dave Taht
@ 2014-03-25 0:18 ` Eric Dumazet
2014-03-25 0:38 ` Dave Taht
1 sibling, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2014-03-25 0:18 UTC (permalink / raw)
To: Dave Taht; +Cc: Steinar H. Gunderson, bloat
On Mon, 2014-03-24 at 16:10 -0700, Dave Taht wrote:
> One of my concerns is that sch_fq is (?) currently inadaquately
> explored in the case of wifi - as best as I recall there were a
> ton of drivers than cloned and disappeared the skb deep
> in driver buffers, making fq_codel a mildly better choice...
Thats because fq_codel do not add delays (other than RR scheduling among
flows). Remember, if no queue is in fq_codel itself, there nothing we
can control using codel law.
In the fq case, packets are held in the qdisc the needed time and
delivered 'at the right time' to whatever lower device. Part of the
'original TCP bursts' are amortized by sch_fq before reaching device.
TCP pacing has nothing to do with additional delays imposed in various
wifi drivers. Consider a packet in wifi queue exactly as if it was
already sent on the network.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
2014-03-24 23:10 ` Dave Taht
2014-03-25 0:18 ` Eric Dumazet
@ 2014-03-25 0:38 ` Dave Taht
1 sibling, 0 replies; 5+ messages in thread
From: Dave Taht @ 2014-03-25 0:38 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Steinar H. Gunderson, bloat
On Mon, Mar 24, 2014 at 4:10 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>>
>>>
>>> It has long been my hope that conventional distros would start
>>> selecting sch_fq and sch_fq_codel up in safe scenarios.
>>>
>>> 1) Can an appropriate clocksource be detected from userspace?
>>>
>>> if [ have_good_clocksources ]
>>> then
>>> if [ i am a router ]
>>> then
>>> sysctl -w something=fq_codel # or is it an entry in proc?
>>> else
>>> sysctl -w something=sch_fq
>>> fi
>>> fi
>>>
>>
>> Sure you can do all this from user space.
>> Thats policy, and this should not belong to kernel.
>
> I tend to agree, except that until recently it was extremely hard to gather
> all the data needed to automagically come up with a more optimal
> queuing policy on a per machine basis.
>
> How do you tell if a box is a vm or container or raw hardware??
> Look for inadaquate randomness?
Furthermore, how can you tell if a box is hosting vms and should
use fq_codel underneath rather than fq_codel?
>
> One of my concerns is that sch_fq is (?) currently inadaquately
> explored in the case of wifi - as best as I recall there were a
> ton of drivers than cloned and disappeared the skb deep
> in driver buffers, making fq_codel a mildly better choice...
>
> and for a system with both wifi and lan, what then? There's only
> the global default....
>
>> sysctl -w net.core.default_qdisc=fq
>>
>> # force a load/delete to bring default qdisc for all devices already up
>> for ETH in `list of network devices (excluding virtual devices)`
>> do
>> tc qdisc add dev $ETH root pfifo 2>/dev/null
>> tc qdisc del dev $ETH root 2>/dev/null
>> done
>>
>>> How early in boot would this have to be to take effect?
>>
>> It doesn't matter, if you force a load/unload of the qdisc.
>
> well, I'd argue for it in the if.preup stage as being a decent
> place, or early after sysctl is run but before the devices are
> initialized. Some data will be lost by this otherwise.
>
>>
>>>
>>> 2) In the case of a server machine providing vms, and meeting the
>>> above precondition(s),
>>> what would be a more right qdisc, sch_fq or sch_codel?
>>
>> sch_fq 'works' only for locally generated traffic, as we look at
>> skb->sk->sk_pacing_rate to read the per socket rate. No way an
>> hypervisor (or a router 2 hops away) can access to original socket
>> without hacks.
>>
>> If your linux vm needs TCP pacing, then it also need fq packet scheduler
>> in the vm.
>
> Got it.
>
> Most of my own interest on the pacing side is seeing the impact on
> slow start on slow egress links to the internet. I kind of hope it's
> groovy... (gotta go compile a 3.14 kernel now. wet paint)
>
>>
>>>
>>> 3) Containers?
>>>
>>> 4) The machine in the vm going through the virtual ethernet interface?
>>>
>>> (I don't understand to what extent tracking the exit of packets from tcp through
>>> the stack and vm happens - I imagine a TSO is preserved all the way through,
>>> and also imagine that tcp small queues doesn't survive transit through the vm,
>>> but I am known to have a fevered imagination.
>>
>> Small Queues controls the host queues.
>>
>> Not the queues on external routers. Consider an hypervisor as a router.
>
> Got it. See above for question on how to determine that reliably?
>
>>
>>>
>>>
>>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
>>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
>>> > every ms. By design, it tends to get ACK trains, way before the cwnd
>>> > might reach BDP.
>>>
>>> Fascinating! Push on one thing, break another. As best I recall hystart had a
>>> string of issues like this in it's early deployment.
>>>
>>> /me looks forward to one day escaping 3.10-land and observing this for himself
>>>
>>> so some sort of bidirectional awareness of the underlying qdisc would be needed
>>> to retune hystart properly.
>>>
>>> Is ms resolution the best possible at this point?
>>
>> Nope. Hystart ACK train detection is very lazy and current algo was kind
>> of a hack. If you use better resolution, then you have problems because
>> of ACK jitter in reverse path. Really, only looking at delay between 2
>> ACKS is not generic enough, we need something else, or just disable ACK
>> TRAIN detection, as it is not that useful. Delay detection is less
>> noisy.
>
One of my all time favorite commits to the kernel:
Bugfix of the day: Pouring through what must have been an enormous
data set, looking for a cause for a problem that spiked 24 days after
boot and then went away 24 days later. Merely figuring that
periodicity out must have been a hacker high.
Total size of the problem? A single bit.
commit cd6b423afd3c08b27e1fed52db828ade0addbc6b
Author: Eric Dumazet
Date: Mon Aug 5 20:05:12 2013 -0700
tcp: cubic: fix bug in bictcp_acked()
While investigating about strange increase of retransmit rates
on hosts ~24 days after boot, Van found hystart was disabled
if ca->epoch_start was 0, as following condition is true
when tcp_time_stamp high order bit is set.
(s32)(tcp_time_stamp - ca->epoch_start) < HZ
Quoting Van :
At initialization & after every loss ca->epoch_start is set to zero so
I believe that the above line will turn off hystart as soon as the 2^31
bit is set in tcp_time_stamp & hystart will stay off for 24 days.
I think we've observed that cubic's restart is too aggressive without
hystart so this might account for the higher drop rate we observe.
> I enjoyed re-reading the hystart related papers this morning. The
> first 5 hits for it on google were
> what I'd remembered...
>
> ns2 does not support hystart. Apparently ns3 sort of supports it when
> run through the linux driver interface but not in pure simulation...
--
Dave Täht
Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-25 0:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-24 17:09 [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes) Dave Taht
2014-03-24 17:41 ` Eric Dumazet
2014-03-24 23:10 ` Dave Taht
2014-03-25 0:18 ` Eric Dumazet
2014-03-25 0:38 ` Dave Taht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox