From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com [IPv6:2607:f8b0:4003:c01::231]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id 02F7A21F2BD for ; Mon, 30 Mar 2015 12:28:46 -0700 (PDT) Received: by obbps3 with SMTP id ps3so22736823obb.3 for ; Mon, 30 Mar 2015 12:28:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=i2qW7AT6emAmtOvlr1MCSOvdVeTNFktdiiuAsTCiCdc=; b=DWCUCBSarLJu9bNnJerzpSTR7fC2Yt/LjBh4f7ACo3yUHnCEuTIaH0HXt6mNrZKLJJ YB8bC4z2xxkkLJJ4tRbV56ldynX91P7CA14NoDvql+OIA/MHluq+n/f0fubU4eMnGAaR PZCi4+74gIcEa90hhvVwF3Sgd6nW3KPBkG2XL2bnEF36qOsiuJ2WuxjRTaTbUuj0pIcs WYK0QDiMfGhAlXivJAP6jQxjH2xgR/LnUODPNoDrMn5asFsTTEOMa1S+n03MoIzAvhER BMFJ/pscQyOecNIM+OPJQtthDijv4XQ5+dVdlLcqb4N9ENRkdQZLH8v7R7AGEFnLdO3f iqfw== MIME-Version: 1.0 X-Received: by 10.182.230.132 with SMTP id sy4mr28731438obc.29.1427743725648; Mon, 30 Mar 2015 12:28:45 -0700 (PDT) Received: by 10.202.51.66 with HTTP; Mon, 30 Mar 2015 12:28:45 -0700 (PDT) In-Reply-To: References: <7081A75C-899A-4DB7-8D77-935A37B362D8@gmail.com> <5509957B.30600@pollere.com> <491C7497-BE3E-452B-A797-C7FC1102E7ED@gmail.com> <750FA673-E20D-4D48-9386-097D32CD31FB@gmail.com> <51665FD7-629A-4B8C-B258-2E9AF8E3B5D0@gmx.de> Date: Mon, 30 Mar 2015 12:28:45 -0700 Message-ID: From: Dave Taht To: Jonathan Morton Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: "codel@lists.bufferbloat.net" Subject: Re: [Codel] The next slice of cake X-BeenThere: codel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: CoDel AQM discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Mar 2015 19:29:15 -0000 On Mon, Mar 30, 2015 at 10:28 AM, Jonathan Morton w= rote: > I made a lot of progress over the weekend, and got cake3 running on a sac= rificial box today. No crashes yet! Preliminary patches always appreciated. I plan to do a build thursday for a test cycle this weekend of various things. > The set-associative hashing is in, currently hard-coded to 8-way, which s= hould be more than adequate for typical home workloads (especially if BitTo= rrent is marked as background). I=E2=80=99m now adding statistics outputs = so that we can all see how well it works under real loads, distinguishing b= etween direct hits (the fast path, where the hash points directly to a corr= ectly allocated queue), indirect hits (where a search through the set locat= es an allocated queue), misses (where an empty queue was allocated) and col= lisions (where all the queues were already occupied by other flows). Note = that =E2=80=9Callocation=E2=80=9D in this case simply means updating the ta= g on the queue to match the packet, not allocating memory. These statistic= s should help with tuning the number of flow queues in each class. > > The new Diffserv logic is also in and apparently working well. I=E2=80= =99m able to observe both the priority-balanced and bandwidth-balanced beha= viours under appropriate circumstances. It is, however, fairly sensitive t= o the baseline quantum value from which the outer DRR weights are derived. = Setting it to the MTU lets the high-priority classes burst too much, tempo= rarily starving the lower classes. It seems to work a lot better using 256= as a baseline, but I hope the overhead of going around the DRR loop severa= l times per packet isn=E2=80=99t too high. It might be sane to vary the ba= seline quantum based on the bandwidth setting, but we can experiment with t= hat later. > >> We keep track of how many bytes of system memory are consumed by a packe= t in 'skb->truesize'. > > The buffer-occupancy queue limit is in, using the field quoted above. Pr= eviously, it would just count packets. As with previous versions of cake (= and fq_codel), it checks the limit immediately after enqueuing each packet,= and drops one packet from the head of the longest queue (by byte count) if= it=E2=80=99s violated. did this include the overload protection fallback to dropping rather than ecn-ing packets? > Calculating the appropriate size to use here was an interesting problem. = I start by estimating the BDP (from the shaper bandwidth and the Codel int= erval), quadrupling that and enforcing a lower limit of 64KB. If it=E2=80= =99s set to =E2=80=9Cunlimited bandwidth=E2=80=9D, I set the size to an arb= itrary 1MB. Then I look at the packet limit, multiply that by the MTU and = clamp the Hmm.. resulting size down to that - providing a straightforward mechanism to set a limit on memory usage. > > This all comes with a *fifth* almost-copy of codel.h, which incorporates = the more efficient dequeue callback from codel4.h, but puts back the separa= te treatment of interval and target. Cake auto-tunes both of them to relat= ed values, but the relationship isn=E2=80=99t as simple as scaling one from= the other, and I=E2=80=99m not comfortable with the mental contortions inv= olved in the new reasoning. k. will test. > I also added a lookup table for the first few inverse square roots, becau= se I found (using a spreadsheet) that these are very inaccurate if you only= do one Newton iteration on them - something like 1.0, 0.500, 0.563, 0.488 = instead of 1.0, 0.707, 0.577, 0.500. Nebulous problems with =E2=80=9Cinsuf= ficiently tight control=E2=80=9D might possibly be traced to this. With th= e first fifteen entries calculated up-front using four iterations and store= d, accuracy is vastly improved and, in practice, almost all square-root cal= culations will now reduce to a simple table lookup. The memory cost is tri= vial, and Newton is good enough for exceptionally large drop counts. We explored the sqrt cache idea in the very first implementations of codel which you can revisit in the earliest, exciting days of the existence of the codel mailing list. I am not allergic to trying it again. What we had found then was the count variable was increasing without bound (with the original published ns2 version), for which we substituted another modification that works well at low bandwidths but less well at higher ones. My sole mathematical contribution to the invsqrt approximation in linux codel was the discovery that newton's method had much larger errors than 2% when run in reverse (eg, when decreasing count by more than 1, as various forms of the algorithm have done while trying to come up with a better estimate for the the resumption portion of the algorithm) When using alternate forms of recalculating count, I have usually run newton's twice to smooth out that error somewhat, but I am open to other alternatives. One thing I would like us to be looking at much harder this time, instead of the latency of the fq'd measurement flows, is the actual size of cwnd and other tcp behaviors from the packet captures, against modern tcps. Us perpetually publishing just the fq results in rrul is misleading some people I think. It would be nice to be sampling these across the course of a run also from netperf-wrapper but I don't know how to do that without web10g. > Along with the hashing statistics, I=E2=80=99m also adding three measures= of latency to replace the one from the first version. This no longer reli= es on data provided by Codel (so there=E2=80=99s no space penalty per flow = for collecting it, and the inner loop in the stats dump is also eliminated)= , but performs the necessary calculation just after dequeuing each packet, = storing the aggregated values per class instead of per flow. Each measure = is an EWMA (using optimised weights which reduce to shifts and adds), but t= wo of them are biased - one towards high latencies (so it=E2=80=99ll mostly= track the fattest flows), and one towards low latencies (so it=E2=80=99ll = mostly track sparse flows). Nifty. Note that I tried hard in codel4 (I think it was) to reduce the size of the codel stats to a single 32byte cache-line on 32 bit arches - down from 48 or so bytes - and succeeded with one word to spare. I am not allergic to keeping lots of statistics but we are also aiming for speed on lame processors here.... > > Still lots of little things to do=E2=80=A6 > > - Jonathan Morton --=20 Dave T=C3=A4ht Let's make wifi fast, less jittery and reliable again! https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb