From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-we0-x233.google.com (mail-we0-x233.google.com
	[IPv6:2a00:1450:400c:c03::233])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id A765E21F1EC
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 16:10:56 -0700 (PDT)
Received: by mail-we0-f179.google.com with SMTP id x48so3801805wes.38
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 16:10:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=ElSYJu1CYGUuaMpQaoidXuGq98MZ1bUKhbh6NBuM3Gs=;
	b=tqLSqZ7EsXLIXwfneP0xfT/LhDDdnloDcjj1FqPrvPKiey6o9vLYqwZrDsy7ynwTw2
	h7/P6SKW4yAH2I3w3EVeiDl9teshZX54LfCivoTZnQTEQvi0UH9wH33kmT9SfrEhdpBx
	qiUAr1klDGiYnXGRHKVi5qxDqZrTj+PM5T6T0VL+UQGTIrfkG0jR28SvSD86m5dVmInn
	naDV5yA1pcNJV4s6IFNQBabKA031zbC938THiNvk52Gc055aUzljceLlzDb7URlws/ns
	khd6Q/8Wr3mreDk4eOSWSUmCdTzNzFPix4rj2bMdx2+fRgnloUlWVpDlr4FJ4fyzUlH0
	HcQg==
MIME-Version: 1.0
X-Received: by 10.180.164.106 with SMTP id yp10mr19324810wib.48.1395702654622; 
	Mon, 24 Mar 2014 16:10:54 -0700 (PDT)
Received: by 10.216.8.1 with HTTP; Mon, 24 Mar 2014 16:10:54 -0700 (PDT)
In-Reply-To: <1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com>
References: <CAA93jw41HM19HjYM3Ny7NLm9XtFpscc+1kFPhcG89Kx1KOrJ6A@mail.gmail.com>
	<1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Mon, 24 Mar 2014 16:10:54 -0700
Message-ID: <CAA93jw5GLNEaiVyLo4cyQ0cucdsg_QmRB5JVmin8t3kJdAB4ow@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "Steinar H. Gunderson" <sesse@samfundet.no>,
	bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 24 Mar 2014 23:10:57 -0000

On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wro=
te:
> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:
>
>>
>> It has long been my hope that conventional distros would start
>> selecting sch_fq and sch_fq_codel up in safe scenarios.
>>
>> 1) Can an appropriate clocksource be detected from userspace?
>>
>> if [ have_good_clocksources ]
>> then
>> if [ i am a router ]
>> then
>> sysctl -w something=3Dfq_codel # or is it an entry in proc?
>> else
>> sysctl -w something=3Dsch_fq
>> fi
>> fi
>>
>
> Sure you can do all this from user space.
> Thats policy, and this should not belong to kernel.

I tend to agree, except that until recently it was extremely hard to gather
all the data needed to automagically come up with a more optimal
queuing policy on a per machine basis.

How do you tell if a box is a vm or container or raw hardware??
Look for inadaquate randomness?

One of my concerns is that sch_fq is (?) currently inadaquately
explored in the case of wifi - as best as I recall there were a
ton of drivers than cloned and disappeared the skb deep
in driver buffers, making fq_codel a mildly better choice...

and for a system with both wifi and lan, what then? There's only
the global default....

> sysctl -w net.core.default_qdisc=3Dfq
>
> # force a load/delete to bring default qdisc for all devices already up
> for ETH in `list of network devices (excluding virtual devices)`
> do
>  tc qdisc add dev $ETH root pfifo 2>/dev/null
>  tc qdisc del dev $ETH root 2>/dev/null
> done
>
>> How early in boot would this have to be to take effect?
>
> It doesn't matter, if you force a load/unload of the qdisc.

well, I'd argue for it in the if.preup stage as being a decent
place, or early after sysctl is run but before the devices are
initialized. Some data will be lost by this otherwise.

>
>>
>> 2) In the case of a server machine providing vms, and meeting the
>> above precondition(s),
>> what would be a more right qdisc, sch_fq or sch_codel?
>
> sch_fq 'works' only for locally generated traffic, as we look at
> skb->sk->sk_pacing_rate to read the per socket rate. No way an
> hypervisor (or a router 2 hops away) can access to original socket
> without hacks.
>
> If your linux vm needs TCP pacing, then it also need fq packet scheduler
> in the vm.

Got it.

Most of my own interest on the pacing side is seeing the impact on
slow start on slow egress links to the internet. I kind of hope it's
groovy... (gotta go compile a 3.14 kernel now. wet paint)

>
>>
>> 3) Containers?
>>
>> 4) The machine in the vm going through the virtual ethernet interface?
>>
>> (I don't understand to what extent tracking the exit of packets from tcp=
 through
>> the stack and vm happens - I imagine a TSO is preserved all the way thro=
ugh,
>> and also imagine that tcp small queues doesn't survive transit through t=
he vm,
>> but I am known to have a fevered imagination.
>
> Small Queues controls the host queues.
>
> Not the queues on external routers. Consider an hypervisor as a router.

Got it. See above for question on how to determine that reliably?

>
>>
>>
>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
>> > every ms. By design, it tends to get ACK trains, way before the cwnd
>> > might reach BDP.
>>
>> Fascinating! Push on one thing, break another. As best I recall hystart =
had a
>> string of issues like this in it's early deployment.
>>
>> /me looks forward to one day escaping 3.10-land and observing this for h=
imself
>>
>> so some sort of bidirectional awareness of the underlying qdisc would be=
 needed
>> to retune hystart properly.
>>
>> Is ms resolution the best possible at this point?
>
> Nope. Hystart ACK train detection is very lazy and current algo was kind
> of a hack. If you use better resolution, then you have problems because
> of ACK jitter in reverse path. Really, only looking at delay between 2
> ACKS is not generic enough, we need something else, or just disable ACK
> TRAIN detection, as it is not that useful. Delay detection is less
> noisy.

I enjoyed re-reading the hystart related papers this morning. The
first 5 hits for it on google were
what I'd remembered...

ns2 does not support hystart. Apparently ns3 sort of supports it when
run through the linux driver interface but not in pure simulation...


>
>
>


--=20
Dave T=E4ht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.=
html