From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-x230.google.com (mail-lf0-x230.google.com [IPv6:2a00:1450:4010:c07::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 6A9E23BA8E for ; Mon, 6 Mar 2017 08:30:25 -0500 (EST) Received: by mail-lf0-x230.google.com with SMTP id a6so72450629lfa.0 for ; Mon, 06 Mar 2017 05:30:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=PQGt/qVsflg4eqESHJFPt2L6thqZ1cR8PB3HjKC+o7o=; b=bWXFrPamrFCnNz7HLPBocloHK5LcqWAwruUSIWbKrbLUh6tFHniAwqBo+MtaN7R6zy McwuTTwE7lt6ggQiWjB/SoQelOrF/xZhHz5iIifmkdxbocfXapoOPE2lFSRPDPW6SD7+ 8CswKr536+3IKBm2o/Z7waXAFFnehPnAjE2pQArO8qBgtaTDeXQNlqImpGTSR2EMikHD v25YpjTV02HbpHwqL8GyRkKpPVtVvm0kaUDwYqqNKPRYB+iMSmBKXwN22RKJLEn/Tui4 kNMaxJw/Bs2hTeZxhzKGDgmxhNI6ZkJUCdlhHqveHdmbODqqsKsbpeKlbkYs32rQlQ0x oD8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=PQGt/qVsflg4eqESHJFPt2L6thqZ1cR8PB3HjKC+o7o=; b=Tmxcq4twEUlHu5NFNSnLiq+8cmOwSR/31uEs2ukMSyiOXoYVs1DFGUIFbHHUZ2XoY9 7VCx0iuKdmCcK3Tvo0wut9jbjwBaaCKqbqgb/GwzsHVXNj/QL32L05YWAQFegFb6R6GT nw3fIXTHEqr10Y6++doRNkFT5ByEoVhs39BY1vWFseLYUlEgSQ44DYc6zGLd6BNjwZq+ oAJ0gP4g5XNhbfaqMk880WDqUH/vzAue8tD3k+HvuysD7Mu394x+sz0OzuFYnGzBcqqm eelEFMl/mw1UGJE7liY/t4SIBhih8eU16TayoTuK9yzGy7N5HG1WCYuw4JWmYyvJimtG efcw== X-Gm-Message-State: AMke39mcU5N84cvyQsONhfLKrn25w2bmBoD+nE9px+8rW7MkTjd8adHfq+/XGjtAR5idz3rYnGoOh4T+uBLEYg== X-Received: by 10.46.69.214 with SMTP id s205mr3610194lja.125.1488807024158; Mon, 06 Mar 2017 05:30:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.198.7 with HTTP; Mon, 6 Mar 2017 05:30:23 -0800 (PST) In-Reply-To: References: <07479F0A-40DD-44E5-B67E-28117C7CF228@gmx.de> <1488400107.3610.1@smtp.autistici.org> <2B251BF1-C965-444D-A831-9981861E453E@gmx.de> <1488484262.16753.0@smtp.autistici.org> <98E899EF-5E66-42CC-88AA-79FA80A4F228@gmail.com> <2D2AE632-75BB-4DDE-B370-0996EFECF14B@gmail.com> From: Benjamin Cronce Date: Mon, 6 Mar 2017 07:30:23 -0600 Message-ID: To: Dave Taht Cc: Jonathan Morton , "cake@lists.bufferbloat.net" , Eric Luehrsen Content-Type: multipart/alternative; boundary=001a114b0a26d06d2a054a0fe50e Subject: Re: [Cake] [LEDE-DEV] Cake SQM killing my DIR-860L - was: [17.01] Kernel: bump to 4.4.51 X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Mar 2017 13:30:25 -0000 --001a114b0a26d06d2a054a0fe50e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Fri, Mar 3, 2017 at 12:21 AM, Dave Taht wrote: > As this is devolving into a cake specific discussion, removing the > lede mailing list. > > On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton > wrote: > > > >> On 3 Mar, 2017, at 07:00, Eric Luehrsen > wrote: > >> > >> That's not what I was going for. Agree, it would not be good to depend > >> on an inferior hash. You mentioned divide as a "cost." So I was > >> proposing a thought around a "benefit" estimate. If hash collisions ar= e > >> not as important (or are they), then what is "benefit / cost?" > > > > The computational cost of one divide is not the only consideration I > have in mind. > > > > Cake=E2=80=99s set-associative hash is fundamentally predicated on the = number of > hash buckets *not* being prime, as it requires further decomposing the ha= sh > into a major and minor part when a collision is detected. The minor part > is then iterated to try to locate a matching or free bucket. > > > > This is considerably easier to do and reason about when everything is a > power of two. Then, modulus is a masking operation, and divide is a shif= t, > either of which can be done in one cycle flat. > > > > AFAIK, however, the main CPU cost of the hash function in Cake is not > the hash itself, but the packet dissection required to obtain the data it > operates on. This is something a profile would shed more light on. > > Tried. Mips wasn't a good target. > > The jhash3 setup cost is bad, but I agree flow dissection can be > deeply expensive. As well as the other 42+ functions a packet needs to > traverse to get from ingress to egress. > > But staying on hashing: > > One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash > if one already existed (injected already by tcp, or by hardware, or > the tunneling tool). we only need to compute a partial hash on the > smaller subset of keys in that case (if we can rely on the skb->hash > which we cannot do in the nat case) > > Another thing I did, long ago, was read the (60s-era!) liturature > about set-associative cpu cache architectures... and... > > In all of these cases I really, really wanted to just punt all this > extra work to hardware in ingress - computing 3 hashes can be easily > done in parallel there and appended to the packet as it completes. > > I have been working quite a bit more with the arm architecture of > late, and the "perf" profiler over there is vastly better than the > mips one we've had. > > (and aarch64 is *nice*. So is NEON) > > - but I hadn't got around to dinking with cake there until yesterday. > > One thing I'm noticing is that even the gigE capable arms have weak or > non-existent L2 caches, and generally struggle to get past 700Mbits > bidirectionally on the network. > > some quick tests of pfifo vs cake on the "lime-2" (armv7 dual core) are > here: > > http://www.taht.net/~d/lime-2/ > > The rrul tests were not particularly pleasing. [1] > > ... > > A second thing on my mind is to be able to take advantage of A) more core= s > > ... and B) hardware that increasingly has 4 or more lanes in it. > > 1) Presently fq_codel (and cake's) behavior there when set as a > default qdisc is sub-optimal - if you have 64 hardware queues you end > up with 64 instances, each with 1024 queues. While this might be > awesome from a FQ perspective I really don't think the aqm will be as > good. Or maybe it might be - what happens with 64000 queues at > 100Mbit? > > 2) It's currently impossible to shape network traffic across cores. > I'd like to imagine that with a single atomic exchange or sloppily > shared values shaping would be feasible. > > When you need to worry about multithreading, many times perfect is very much the enemy of good. Depending on how quickly you need to make the network react, you could do something along the lines of a "shared pool" of bandwidth. Each core gets a split of the bandwidth, any unused bandwidth can be added to the pool, and cores that want more bandwidth can take bandwidth from the pool. You could treat it like task stealing, except each core can generate tokens that represent a quantum of bandwidth that is only valid for some interval. If a core suddenly needs bandwidth, it can attempt to "take back" from its publicly shared pool. If other cores have already borrowed, it can attempt to borrow from another core. If it can't find any spare bandwidth, it just waits for some interval related to how long a quantum is valid, and assumes it's safe. Or something.. I don't know, it's 7am and I just woke up. > (also softirq is a single thread, I believe) > > 3) mq and mqprio are commonly deployed on the high end for this. > > So I've thought about doing up another version - call it - I dunno - > smq - "smart multi-queue" - and seeing how far we could get. > > > - Jonathan Morton > > > > _______________________________________________ > > Cake mailing list > > Cake@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/cake > > > > [1] If you are on this list and are not using flent, tough. I'm not > going through the trouble of generating graphs myself anymore. > > -- > Dave T=C3=A4ht > Let's go make home routers and wifi faster! With better software! > http://blog.cerowrt.org > _______________________________________________ > Cake mailing list > Cake@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cake > --001a114b0a26d06d2a054a0fe50e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Fri, Mar 3, 2017 at 12:21 AM, Dave Taht <dave.taht@gmail.com&= gt; wrote:
As this is devolving in= to a cake specific discussion, removing the
lede mailing list.

On Thu, Mar 2, 2017 at 9:49 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
>
>> On 3 Mar, 2017, at 07:00, Eric Luehrsen <ericluehrsen@hotmail.com> wrote:
>>
>> That's not what I was going for. Agree, it would not be good t= o depend
>> on an inferior hash. You mentioned divide as a "cost." S= o I was
>> proposing a thought around a "benefit" estimate. If hash= collisions are
>> not as important (or are they), then what is "benefit / cost?= "
>
> The computational cost of one divide is not the only consideration I h= ave in mind.
>
> Cake=E2=80=99s set-associative hash is fundamentally predicated on the= number of hash buckets *not* being prime, as it requires further decomposi= ng the hash into a major and minor part when a collision is detected.=C2=A0= The minor part is then iterated to try to locate a matching or free bucket= .
>
> This is considerably easier to do and reason about when everything is = a power of two.=C2=A0 Then, modulus is a masking operation, and divide is a= shift, either of which can be done in one cycle flat.
>
> AFAIK, however, the main CPU cost of the hash function in Cake is not = the hash itself, but the packet dissection required to obtain the data it o= perates on.=C2=A0 This is something a profile would shed more light on.

Tried. Mips wasn't a good target.

The jhash3 setup cost is bad, but I agree flow dissection can be
deeply expensive. As well as the other 42+ functions a packet needs to
traverse to get from ingress to egress.

But staying on hashing:

One thing that landed 4.10? 4.11? was fq_codel relying on a skb->hash if one already existed (injected already by tcp, or by hardware, or
the tunneling tool). we only need to compute a partial hash on the
smaller subset of keys in that case (if we can rely on the skb->hash
which we cannot do in the nat case)

Another thing I did, long ago, was read the (60s-era!) liturature
about set-associative cpu cache architectures... and...

In all of these cases I really, really wanted to just punt all this
extra work to hardware in ingress - computing 3 hashes can be easily
done in parallel there and appended to the packet as it completes.

I have been working quite a bit more with the arm architecture of
late, and the "perf" profiler over there is vastly better than th= e
mips one we've had.

(and aarch64 is *nice*. So is NEON)

- but I hadn't got around to dinking with cake there until yesterday.
One thing I'm noticing is that even the gigE capable arms have weak or<= br> non-existent L2 caches, and generally struggle to get past 700Mbits
bidirectionally on the network.

some quick tests of pfifo vs cake on the "lime-2" (armv7 dual cor= e) are here:

http://www.taht.net/~d/lime-2/

The rrul tests were not particularly pleasing. [1]

...

A second thing on my mind is to be able to take advantage of A) more cores<= br>
... and B) hardware that increasingly has 4 or more lanes in it.

1)=C2=A0 Presently fq_codel (and cake's) behavior there when set as a default qdisc is sub-optimal - if you have 64 hardware queues you end
up with 64 instances, each with 1024 queues. While this might be
awesome from a FQ perspective I really don't think the aqm will be as good. Or maybe it might be - what happens with 64000 queues at
100Mbit?

2) It's currently impossible to shape network traffic across cores.
I'd like to imagine that with a single atomic exchange or sloppily
shared values shaping would be feasible.


When you need to worry about multithre= ading, many times perfect is very much the enemy of good. Depending on how = quickly you need to make the network react, you could do something along th= e lines of a "shared pool" of bandwidth. Each core gets a split o= f the bandwidth, any unused bandwidth can be added to the pool, and cores t= hat want more bandwidth can take bandwidth from the pool.=C2=A0
<= br>
You could treat it like task stealing, except each core can g= enerate tokens that represent a quantum of bandwidth that is only valid for= some interval. If a core suddenly needs bandwidth, it can attempt to "= ;take back" from its publicly shared pool. If other cores have already= borrowed, it can attempt to borrow from another core. If it can't find= any spare bandwidth, it just waits for some interval related to how long a= quantum is valid, and assumes it's safe.

Or s= omething.. I don't know, it's 7am and I just woke up.
=C2= =A0
(also softirq is a single thread, I believe)

3) mq and mqprio are commonly deployed on the high end for this.

So I've thought about doing up another version - call it - I dunno - smq - "smart multi-queue" - and seeing how far we could get.

>=C2=A0 - Jonathan Morton
>
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.n= et
> https://lists.bufferbloat.net/listinfo/cake=



[1] If you are on this list and are not using flent, tough. I'm = not
going through the trouble of generating graphs myself anymore.

--
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
ht= tp://blog.cerowrt.org
_____________________= __________________________
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake

--001a114b0a26d06d2a054a0fe50e--