From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-x236.google.com (mail-wi0-x236.google.com [IPv6:2a00:1450:400c:c05::236]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 4CDCE21F277 for ; Mon, 24 Mar 2014 17:38:24 -0700 (PDT) Received: by mail-wi0-f182.google.com with SMTP id d1so111831wiv.3 for ; Mon, 24 Mar 2014 17:38:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=8NAeRhPmSDjye8MHgdhoFMXmpZVfGtNLe1eVWg+HN/I=; b=V8hxtuQu8R+2fmIIx8yApKHfo4gU27TvkOlGWaxMTmmOHV5cfPlpxWAo7B7G+rrgn5 VYh2iRsF7Md90lcKmhyoVvjQjKuEqJd3rIqhDXKo6yNgxQZyhnBWEj/PCZAWGmXB2Yyr U2WPhmdiKyi9YH2w7Z0HsG4eHckSR3FUg+nFOct+VOAa8Qywqqo3arHetQGTHkT2y98P VRfPKMV7wYG2X36RtyNjHol2TY32ehgOl572XXmbNIoib18E+A4YTr7KXfXNg/8czJXE 4GPU7cyjigU8zFBd3sKLv/8dG4c3NYPNXsZO/HkYRMtJyvSuFzHHY9+H4Xu0ytP6NV22 MSKw== MIME-Version: 1.0 X-Received: by 10.180.188.169 with SMTP id gb9mr18433922wic.17.1395707902411; Mon, 24 Mar 2014 17:38:22 -0700 (PDT) Received: by 10.216.8.1 with HTTP; Mon, 24 Mar 2014 17:38:22 -0700 (PDT) In-Reply-To: References: <1395682887.12610.62.camel@edumazet-glaptop2.roam.corp.google.com> Date: Mon, 24 Mar 2014 17:38:22 -0700 Message-ID: From: Dave Taht To: Eric Dumazet Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: "Steinar H. Gunderson" , bloat Subject: Re: [Bloat] Replacing pfifo_fast? (and using sch_fq + hystart fixes) X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Mar 2014 00:38:25 -0000 On Mon, Mar 24, 2014 at 4:10 PM, Dave Taht wrote: > On Mon, Mar 24, 2014 at 10:41 AM, Eric Dumazet w= rote: >> On Mon, 2014-03-24 at 10:09 -0700, Dave Taht wrote: >> >>> >>> It has long been my hope that conventional distros would start >>> selecting sch_fq and sch_fq_codel up in safe scenarios. >>> >>> 1) Can an appropriate clocksource be detected from userspace? >>> >>> if [ have_good_clocksources ] >>> then >>> if [ i am a router ] >>> then >>> sysctl -w something=3Dfq_codel # or is it an entry in proc? >>> else >>> sysctl -w something=3Dsch_fq >>> fi >>> fi >>> >> >> Sure you can do all this from user space. >> Thats policy, and this should not belong to kernel. > > I tend to agree, except that until recently it was extremely hard to gath= er > all the data needed to automagically come up with a more optimal > queuing policy on a per machine basis. > > How do you tell if a box is a vm or container or raw hardware?? > Look for inadaquate randomness? Furthermore, how can you tell if a box is hosting vms and should use fq_codel underneath rather than fq_codel? > > One of my concerns is that sch_fq is (?) currently inadaquately > explored in the case of wifi - as best as I recall there were a > ton of drivers than cloned and disappeared the skb deep > in driver buffers, making fq_codel a mildly better choice... > > and for a system with both wifi and lan, what then? There's only > the global default.... > >> sysctl -w net.core.default_qdisc=3Dfq >> >> # force a load/delete to bring default qdisc for all devices already up >> for ETH in `list of network devices (excluding virtual devices)` >> do >> tc qdisc add dev $ETH root pfifo 2>/dev/null >> tc qdisc del dev $ETH root 2>/dev/null >> done >> >>> How early in boot would this have to be to take effect? >> >> It doesn't matter, if you force a load/unload of the qdisc. > > well, I'd argue for it in the if.preup stage as being a decent > place, or early after sysctl is run but before the devices are > initialized. Some data will be lost by this otherwise. > >> >>> >>> 2) In the case of a server machine providing vms, and meeting the >>> above precondition(s), >>> what would be a more right qdisc, sch_fq or sch_codel? >> >> sch_fq 'works' only for locally generated traffic, as we look at >> skb->sk->sk_pacing_rate to read the per socket rate. No way an >> hypervisor (or a router 2 hops away) can access to original socket >> without hacks. >> >> If your linux vm needs TCP pacing, then it also need fq packet scheduler >> in the vm. > > Got it. > > Most of my own interest on the pacing side is seeing the impact on > slow start on slow egress links to the internet. I kind of hope it's > groovy... (gotta go compile a 3.14 kernel now. wet paint) > >> >>> >>> 3) Containers? >>> >>> 4) The machine in the vm going through the virtual ethernet interface? >>> >>> (I don't understand to what extent tracking the exit of packets from tc= p through >>> the stack and vm happens - I imagine a TSO is preserved all the way thr= ough, >>> and also imagine that tcp small queues doesn't survive transit through = the vm, >>> but I am known to have a fevered imagination. >> >> Small Queues controls the host queues. >> >> Not the queues on external routers. Consider an hypervisor as a router. > > Got it. See above for question on how to determine that reliably? > >> >>> >>> >>> > Another issue is TCP CUBIC Hystart 'ACK TRAIN' detection that trigger= s >>> > early, since goal of TSO autosizing + FQ/pacing is to get ACK clockin= g >>> > every ms. By design, it tends to get ACK trains, way before the cwnd >>> > might reach BDP. >>> >>> Fascinating! Push on one thing, break another. As best I recall hystart= had a >>> string of issues like this in it's early deployment. >>> >>> /me looks forward to one day escaping 3.10-land and observing this for = himself >>> >>> so some sort of bidirectional awareness of the underlying qdisc would b= e needed >>> to retune hystart properly. >>> >>> Is ms resolution the best possible at this point? >> >> Nope. Hystart ACK train detection is very lazy and current algo was kind >> of a hack. If you use better resolution, then you have problems because >> of ACK jitter in reverse path. Really, only looking at delay between 2 >> ACKS is not generic enough, we need something else, or just disable ACK >> TRAIN detection, as it is not that useful. Delay detection is less >> noisy. > One of my all time favorite commits to the kernel: Bugfix of the day: Pouring through what must have been an enormous data set, looking for a cause for a problem that spiked 24 days after boot and then went away 24 days later. Merely figuring that periodicity out must have been a hacker high. Total size of the problem? A single bit. commit cd6b423afd3c08b27e1fed52db828ade0addbc6b Author: Eric Dumazet Date: Mon Aug 5 20:05:12 2013 -0700 tcp: cubic: fix bug in bictcp_acked() While investigating about strange increase of retransmit rates on hosts ~24 days after boot, Van found hystart was disabled if ca->epoch_start was 0, as following condition is true when tcp_time_stamp high order bit is set. (s32)(tcp_time_stamp - ca->epoch_start) < HZ Quoting Van : At initialization & after every loss ca->epoch_start is set to zero so I believe that the above line will turn off hystart as soon as the 2^3= 1 bit is set in tcp_time_stamp & hystart will stay off for 24 days. I think we've observed that cubic's restart is too aggressive without hystart so this might account for the higher drop rate we observe. > I enjoyed re-reading the hystart related papers this morning. The > first 5 hits for it on google were > what I'd remembered... > > ns2 does not support hystart. Apparently ns3 sort of supports it when > run through the linux driver interface but not in pure simulation... --=20 Dave T=E4ht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.= html