From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-wi0-x236.google.com (mail-wi0-x236.google.com
	[IPv6:2a00:1450:400c:c05::236])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 4CDCE21F277
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 17:38:24 -0700 (PDT)
Received: by mail-wi0-f182.google.com with SMTP id d1so111831wiv.3
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 17:38:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=8NAeRhPmSDjye8MHgdhoFMXmpZVfGtNLe1eVWg+HN/I=;
	b=V8hxtuQu8R+2fmIIx8yApKHfo4gU27TvkOlGWaxMTmmOHV5cfPlpxWAo7B7G+rrgn5
	VYh2iRsF7Md90lcKmhyoVvjQjKuEqJd3rIqhDXKo6yNgxQZyhnBWEj/PCZAWGmXB2Yyr
	U2WPhmdiKyi9YH2w7Z0HsG4eHckSR3FUg+nFOct+VOAa8Qywqqo3arHetQGTHkT2y98P
	VRfPKMV7wYG2X36RtyNjHol2TY32ehgOl572XXmbNIoib18E+A4YTr7KXfXNg/8czJXE
	4GPU7cyjigU8zFBd3sKLv/8dG4c3NYPNXsZO/HkYRMtJyvSuFzHHY9+H4Xu0ytP6NV22
	MSKw==
MIME-Version: 1.0
X-Received: by 10.180.188.169 with SMTP id gb9mr18433922wic.17.1395707902411; 
	Mon, 24 Mar 2014 17:38:22 -0700 (PDT)
Received: by 10.216.8.1 with HTTP; Mon, 24 Mar 2014 17:38:22 -0700 (PDT)
In-Reply-To: <CAA93jw5GLNEaiVyLo4cyQ0cucdsg_QmRB5JVmin8t3kJdAB4ow@mail.gmail.com>
References: <CAA93jw41HM19HjYM3Ny7NLm9XtFpscc+1kFPhcG89Kx1KOrJ6A@mail.gmail.com>
	<1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com>
	<CAA93jw5GLNEaiVyLo4cyQ0cucdsg_QmRB5JVmin8t3kJdAB4ow@mail.gmail.com>
Date: Mon, 24 Mar 2014 17:38:22 -0700
Message-ID: <CAA93jw5Oe3e0pZ7HCDZ-1yYsQpZ5jTHatut-Xgbqyu0_cpUFKA@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "Steinar H. Gunderson" <sesse@samfundet.no>,
	bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Tue, 25 Mar 2014 00:38:25 -0000

On Mon, Mar 24, 2014 at 4:10 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet <eric.dumazet@gmail.com> w=
rote:
>> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>>
>>>
>>> It has long been my hope that conventional distros would start
>>> selecting sch_fq and sch_fq_codel up in safe scenarios.
>>>
>>> 1) Can an appropriate clocksource be detected from userspace?
>>>
>>> if [ have_good_clocksources ]
>>> then
>>> if [ i am a router ]
>>> then
>>> sysctl -w something=3Dfq_codel # or is it an entry in proc?
>>> else
>>> sysctl -w something=3Dsch_fq
>>> fi
>>> fi
>>>
>>
>> Sure you can do all this from user space.
>> Thats policy, and this should not belong to kernel.
>
> I tend to agree, except that until recently it was extremely hard to gath=
er
> all the data needed to automagically come up with a more optimal
> queuing policy on a per machine basis.
>
> How do you tell if a box is a vm or container or raw hardware??
> Look for inadaquate randomness?

Furthermore, how can you tell if a box is hosting vms and should
use fq_codel underneath rather than fq_codel?

>
> One of my concerns is that sch_fq is (?) currently inadaquately
> explored in the case of wifi - as best as I recall there were a
> ton of drivers than cloned and disappeared the skb deep
> in driver buffers, making fq_codel a mildly better choice...
>
> and for a system with both wifi and lan, what then? There's only
> the global default....
>
>> sysctl -w net.core.default_qdisc=3Dfq
>>
>> # force a load/delete to bring default qdisc for all devices already up
>> for ETH in `list of network devices (excluding virtual devices)`
>> do
>>  tc qdisc add dev $ETH root pfifo 2>/dev/null
>>  tc qdisc del dev $ETH root 2>/dev/null
>> done
>>
>>> How early in boot would this have to be to take effect?
>>
>> It doesn't matter, if you force a load/unload of the qdisc.
>
> well, I'd argue for it in the if.preup stage as being a decent
> place, or early after sysctl is run but before the devices are
> initialized. Some data will be lost by this otherwise.
>
>>
>>>
>>> 2) In the case of a server machine providing vms, and meeting the
>>> above precondition(s),
>>> what would be a more right qdisc, sch_fq or sch_codel?
>>
>> sch_fq 'works' only for locally generated traffic, as we look at
>> skb->sk->sk_pacing_rate to read the per socket rate. No way an
>> hypervisor (or a router 2 hops away) can access to original socket
>> without hacks.
>>
>> If your linux vm needs TCP pacing, then it also need fq packet scheduler
>> in the vm.
>
> Got it.
>
> Most of my own interest on the pacing side is seeing the impact on
> slow start on slow egress links to the internet. I kind of hope it's
> groovy... (gotta go compile a 3.14 kernel now. wet paint)
>
>>
>>>
>>> 3) Containers?
>>>
>>> 4) The machine in the vm going through the virtual ethernet interface?
>>>
>>> (I don't understand to what extent tracking the exit of packets from tc=
p through
>>> the stack and vm happens - I imagine a TSO is preserved all the way thr=
ough,
>>> and also imagine that tcp small queues doesn't survive transit through =
the vm,
>>> but I am known to have a fevered imagination.
>>
>> Small Queues controls the host queues.
>>
>> Not the queues on external routers. Consider an hypervisor as a router.
>
> Got it. See above for question on how to determine that reliably?
>
>>
>>>
>>>
>>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that trigger=
s
>>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clockin=
g
>>> > every ms. By design, it tends to get ACK trains, way before the cwnd
>>> > might reach BDP.
>>>
>>> Fascinating! Push on one thing, break another. As best I recall hystart=
 had a
>>> string of issues like this in it's early deployment.
>>>
>>> /me looks forward to one day escaping 3.10-land and observing this for =
himself
>>>
>>> so some sort of bidirectional awareness of the underlying qdisc would b=
e needed
>>> to retune hystart properly.
>>>
>>> Is ms resolution the best possible at this point?
>>
>> Nope. Hystart ACK train detection is very lazy and current algo was kind
>> of a hack. If you use better resolution, then you have problems because
>> of ACK jitter in reverse path. Really, only looking at delay between 2
>> ACKS is not generic enough, we need something else, or just disable ACK
>> TRAIN detection, as it is not that useful. Delay detection is less
>> noisy.
>

One of my all time favorite commits to the kernel:

Bugfix of the day: Pouring through what must have been an enormous
data set, looking for a cause for a problem that spiked 24 days after
boot and then went away 24 days later. Merely figuring that
periodicity out must have been a hacker high.

Total size of the problem? A single bit.

commit cd6b423afd3c08b27e1fed52db828ade0addbc6b
Author: Eric Dumazet
Date:   Mon Aug 5 20:05:12 2013 -0700

    tcp: cubic: fix bug in bictcp_acked()

    While investigating about strange increase of retransmit rates
    on hosts ~24 days after boot, Van found hystart was disabled
    if ca->epoch_start was 0, as following condition is true
    when tcp_time_stamp high order bit is set.

    (s32)(tcp_time_stamp - ca->epoch_start) < HZ

    Quoting Van :

     At initialization & after every loss ca->epoch_start is set to zero so
     I believe that the above line will turn off hystart as soon as the 2^3=
1
     bit is set in tcp_time_stamp & hystart will stay off for 24 days.
     I think we've observed that cubic's restart is too aggressive without
     hystart so this might account for the higher drop rate we observe.

> I enjoyed re-reading the hystart related papers this morning. The
> first 5 hits for it on google were
> what I'd remembered...
>
> ns2 does not support hystart. Apparently ns3 sort of supports it when
> run through the linux driver interface but not in pure simulation...

--=20
Dave T=E4ht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.=
html