From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <eric.dumazet@gmail.com>
Received: from mail-pd0-x230.google.com (mail-pd0-x230.google.com
	[IPv6:2607:f8b0:400e:c02::230])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 4179721F1AC
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 10:41:29 -0700 (PDT)
Received: by mail-pd0-f176.google.com with SMTP id r10so5637241pdi.35
	for <bloat@lists.bufferbloat.net>; Mon, 24 Mar 2014 10:41:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=message-id:subject:from:to:cc:date:in-reply-to:references
	:content-type:content-transfer-encoding:mime-version;
	bh=3HWc+XdxxEk3qbafZzJVyrw5GV5GoTrcKMSPypGl1m0=;
	b=yBzJFI7DGLfgp+HbtWm7vwYra4q/ZvXonBt8zdpUFS3ZQxJidKHByqou17VR+Qi3Yi
	EdDImMekLpOFYhJ3GUTm9Tdfthmdd6/vySrYcaQyg8AGF/+eUCIAiupQHR5l72JcE0bm
	l5HyfFJFCuajVM0GsHmI6n3LxsCpNE7VHxqOWJhhi6H8s783RFuHk28QtVjYc1oATD+K
	7G2P+seVQzKHhXJJhvKHwLdFeN1clOJkTrw1oLXWhQoe778wnOOWWqoMv4aJ2/UOQAlv
	wggKk35mn5f6sMBDr1ekB/rbE635MOSPao9WnJds4AsFHMRvW0NzqHGnNKHci7oJ4y9n
	jlaA==
X-Received: by 10.68.237.99 with SMTP id vb3mr73163279pbc.76.1395682888858;
	Mon, 24 Mar 2014 10:41:28 -0700 (PDT)
Received: from ?IPv6:2620:0:1000:3e02:24c4:1b92:9759:b60a?
	([2620:0:1000:3e02:24c4:1b92:9759:b60a])
	by mx.google.com with ESMTPSA id
	qq5sm34386319pbb.24.2014.03.24.10.41.27 for <multiple recipients>
	(version=SSLv3 cipher=RC4-SHA bits=128/128);
	Mon, 24 Mar 2014 10:41:28 -0700 (PDT)
Message-ID: <1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Dave Taht <dave.taht@gmail.com>
Date: Mon, 24 Mar 2014 10:41:27 -0700
In-Reply-To: <CAA93jw41HM19HjYM3Ny7NLm9XtFpscc+1kFPhcG89Kx1KOrJ6A@mail.gmail.com>
References: <CAA93jw41HM19HjYM3Ny7NLm9XtFpscc+1kFPhcG89Kx1KOrJ6A@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.2.3-0ubuntu6 
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0
Cc: "Steinar H.
	Gunderson" <sesse@samfundet.no>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes)
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Mon, 24 Mar 2014 17:41:29 -0000

On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote:

> 
> It has long been my hope that conventional distros would start
> selecting sch_fq and sch_fq_codel up in safe scenarios.
> 
> 1) Can an appropriate clocksource be detected from userspace?
> 
> if [ have_good_clocksources ]
> then
> if [ i am a router ]
> then
> sysctl -w something=fq_codel # or is it an entry in proc?
> else
> sysctl -w something=sch_fq
> fi
> fi
> 

Sure you can do all this from user space.
Thats policy, and this should not belong to kernel.

sysctl -w net.core.default_qdisc=fq

# force a load/delete to bring default qdisc for all devices already up
for ETH in `list of network devices (excluding virtual devices)`
do
 tc qdisc add dev $ETH root pfifo 2>/dev/null
 tc qdisc del dev $ETH root 2>/dev/null
done

> How early in boot would this have to be to take effect?

It doesn't matter, if you force a load/unload of the qdisc.

> 
> 2) In the case of a server machine providing vms, and meeting the
> above precondition(s),
> what would be a more right qdisc, sch_fq or sch_codel?

sch_fq 'works' only for locally generated traffic, as we look at
skb->sk->sk_pacing_rate to read the per socket rate. No way an
hypervisor (or a router 2 hops away) can access to original socket
without hacks.

If your linux vm needs TCP pacing, then it also need fq packet scheduler
in the vm.

> 
> 3) Containers?
> 
> 4) The machine in the vm going through the virtual ethernet interface?
> 
> (I don't understand to what extent tracking the exit of packets from tcp through
> the stack and vm happens - I imagine a TSO is preserved all the way through,
> and also imagine that tcp small queues doesn't survive transit through the vm,
> but I am known to have a fevered imagination.

Small Queues controls the host queues.

Not the queues on external routers. Consider an hypervisor as a router.

> 
> 
> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers
> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking
> > every ms. By design, it tends to get ACK trains, way before the cwnd
> > might reach BDP.
> 
> Fascinating! Push on one thing, break another. As best I recall hystart had a
> string of issues like this in it's early deployment.
> 
> /me looks forward to one day escaping 3.10-land and observing this for himself
> 
> so some sort of bidirectional awareness of the underlying qdisc would be needed
> to retune hystart properly.
> 
> Is ms resolution the best possible at this point?

Nope. Hystart ACK train detection is very lazy and current algo was kind
of a hack. If you use better resolution, then you have problems because
of ACK jitter in reverse path. Really, only looking at delay between 2
ACKS is not generic enough, we need something else, or just disable ACK
TRAIN detection, as it is not that useful. Delay detection is less
noisy.