From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-x230.google.com (mail-pd0-x230.google.com [IPv6:2607:f8b0:400e:c02::230]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 4179721F1AC for ; Mon, 24 Mar 2014 10:41:29 -0700 (PDT) Received: by mail-pd0-f176.google.com with SMTP id r10so5637241pdi.35 for ; Mon, 24 Mar 2014 10:41:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:content-transfer-encoding:mime-version; bh=3HWc+XdxxEk3qbafZzJVyrw5GV5GoTrcKMSPypGl1m0=; b=yBzJFI7DGLfgp+HbtWm7vwYra4q/ZvXonBt8zdpUFS3ZQxJidKHByqou17VR+Qi3Yi EdDImMekLpOFYhJ3GUTm9Tdfthmdd6/vySrYcaQyg8AGF/+eUCIAiupQHR5l72JcE0bm l5HyfFJFCuajVM0GsHmI6n3LxsCpNE7VHxqOWJhhi6H8s783RFuHk28QtVjYc1oATD+K 7G2P+seVQzKHhXJJhvKHwLdFeN1clOJkTrw1oLXWhQoe778wnOOWWqoMv4aJ2/UOQAlv wggKk35mn5f6sMBDr1ekB/rbE635MOSPao9WnJds4AsFHMRvW0NzqHGnNKHci7oJ4y9n jlaA== X-Received: by 10.68.237.99 with SMTP id vb3mr73163279pbc.76.1395682888858; Mon, 24 Mar 2014 10:41:28 -0700 (PDT) Received: from ?IPv6:2620:0:1000:3e02:24c4:1b92:9759:b60a? ([2620:0:1000:3e02:24c4:1b92:9759:b60a]) by mx.google.com with ESMTPSA id qq5sm34386319pbb.24.2014.03.24.10.41.27 for (version=SSLv3 cipher=RC4-SHA bits=128/128); Mon, 24 Mar 2014 10:41:28 -0700 (PDT) Message-ID: <1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com> From: Eric Dumazet To: Dave Taht Date: Mon, 24 Mar 2014 10:41:27 -0700 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.3-0ubuntu6 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Cc: "Steinar H. Gunderson" , bloat Subject: Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes) X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Mar 2014 17:41:29 -0000 On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote: > > It has long been my hope that conventional distros would start > selecting sch_fq and sch_fq_codel up in safe scenarios. > > 1) Can an appropriate clocksource be detected from userspace? > > if [ have_good_clocksources ] > then > if [ i am a router ] > then > sysctl -w something=fq_codel # or is it an entry in proc? > else > sysctl -w something=sch_fq > fi > fi > Sure you can do all this from user space. Thats policy, and this should not belong to kernel. sysctl -w net.core.default_qdisc=fq # force a load/delete to bring default qdisc for all devices already up for ETH in `list of network devices (excluding virtual devices)` do tc qdisc add dev $ETH root pfifo 2>/dev/null tc qdisc del dev $ETH root 2>/dev/null done > How early in boot would this have to be to take effect? It doesn't matter, if you force a load/unload of the qdisc. > > 2) In the case of a server machine providing vms, and meeting the > above precondition(s), > what would be a more right qdisc, sch_fq or sch_codel? sch_fq 'works' only for locally generated traffic, as we look at skb->sk->sk_pacing_rate to read the per socket rate. No way an hypervisor (or a router 2 hops away) can access to original socket without hacks. If your linux vm needs TCP pacing, then it also need fq packet scheduler in the vm. > > 3) Containers? > > 4) The machine in the vm going through the virtual ethernet interface? > > (I don't understand to what extent tracking the exit of packets from tcp through > the stack and vm happens - I imagine a TSO is preserved all the way through, > and also imagine that tcp small queues doesn't survive transit through the vm, > but I am known to have a fevered imagination. Small Queues controls the host queues. Not the queues on external routers. Consider an hypervisor as a router. > > > > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that triggers > > early, since goal of TSO autosizing + FQ/pacing is to get ACK clocking > > every ms. By design, it tends to get ACK trains, way before the cwnd > > might reach BDP. > > Fascinating! Push on one thing, break another. As best I recall hystart had a > string of issues like this in it's early deployment. > > /me looks forward to one day escaping 3.10-land and observing this for himself > > so some sort of bidirectional awareness of the underlying qdisc would be needed > to retune hystart properly. > > Is ms resolution the best possible at this point? Nope. Hystart ACK train detection is very lazy and current algo was kind of a hack. If you use better resolution, then you have problems because of ACK jitter in reverse path. Really, only looking at delay between 2 ACKS is not generic enough, we need something else, or just disable ACK TRAIN detection, as it is not that useful. Delay detection is less noisy.