From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 8C41C3B29E for ; Sat, 28 Mar 2020 18:47:01 -0400 (EDT) Received: by mail-qk1-x72d.google.com with SMTP id h14so15098589qke.5 for ; Sat, 28 Mar 2020 15:47:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ClWkIX4JsY7zopmnO/YTdGYxSGR++gLU4FXpJjxuBo0=; b=UunbnWx2KKJf7DpLrkKJ6tjwfjRdrrjbg59EyWGonF5eCmq9UrcT6Iqm9O/Kivu2z8 fOfqoNbHDUXmg7bRCifCckih1kQFWQ5zyH5ib+ViJEH5Dh6Sk2tG3sDQMcISS3A3GoDU 8RrqQrAVN4aL+Cc9BPI5XE+RktJyYvOl5kEqJW2Y6ZqeSj7NU68O8kvzJUHBiHWNOEo6 kTOKvhBa8YXTb8W+owpISb/VCE2q/lsdxlrilxQMeETRuH5XO4hgYJE4Au6k0J7K30Mq Y0hrbyo/J+V6syO20JJ0mNEDj7zrmsC0R+MhbehJnhHqUhwVCz5MQ519ltpYnTEQ1aSd w3qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ClWkIX4JsY7zopmnO/YTdGYxSGR++gLU4FXpJjxuBo0=; b=Fj8Pz/RaNXSqJU5SW49JUMNIhme3iI22pM0cqPlupUUk+u35o5goc5JCpAkduC5wn+ 611tO/FwflFtVp2QWR+yBA03u+d8+BMH9Uxpr02QrFWPypMjdu5bmSYCcWUK7IIBCmHV xp1vid7/A/bfM13qPcrZrXcAvOYURfM+7mr22b27KBn1u2eJB0l5vFjOSmnrF5VHU9rZ TIgBRAEiFP4RyPR2P2+Z15hPqxy6L8un7qTs+H5paXdPg1J9yrB/AzD1Fl04zVLbGHfl C3O+iJXKj8lEhjSeaqwAbvY1gzfntQHTfe+k4TpWHweqI5y7nbg+h02trfg1St02r6ny VDAA== X-Gm-Message-State: ANhLgQ2O3XVdALzYaIVSkzY6zdwIDTcekoabvRrq+ae4biWquZLOyaKm nvbiJzd00JHScgs1eXKOvu8Rqn+wNBYcOIBW7ZE= X-Google-Smtp-Source: ADFU+vvFKdEAkTDVlRParTWqogV3xk/XShmcHTnLmKfaMPUanuacoewaobf11e08wSv1sOdao7M0NLirK+cdvN6LKSc= X-Received: by 2002:a37:8d86:: with SMTP id p128mr5916490qkd.250.1585435621121; Sat, 28 Mar 2020 15:47:01 -0700 (PDT) MIME-Version: 1.0 References: <875zesret5.fsf@toke.dk> <87r1xgpuhm.fsf@toke.dk> In-Reply-To: From: Aaron Wood Date: Sat, 28 Mar 2020 15:46:49 -0700 Message-ID: To: Dave Taht Cc: =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , bloat Content-Type: multipart/alternative; boundary="00000000000002e27c05a1f1ffd1" Subject: Re: [Bloat] Still seeing bloat with a DOCSIS 3.1 modem X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Mar 2020 22:47:01 -0000 --00000000000002e27c05a1f1ffd1 Content-Type: text/plain; charset="UTF-8" On Wed, Mar 25, 2020 at 12:18 PM Dave Taht wrote: > On Wed, Mar 25, 2020 at 8:58 AM Aaron Wood wrote: > > > > One other thought I've had with this, is that the apu2 is multi-core, > and the i210 is multi-queue. > > > > Cake/htb aren't, iirc, setup to run on multiple cores (as the rate > limiters then don't talk to each other). But with the correct tuple > hashing in the i210, I _should_ be able to split things and do two cores at > 500Mbps each (with lots of compute left over). > > > > Obviously, that puts a limit on single-connection rates, but as the > number of connections climb, they should more or less even out (I remember > Dave Taht showing the oddities that happen with say 4 streams and 2 cores, > where it's common to end up with 3 streams on the same core). But assuming > that the hashing function results in even sharing of streams, it should be > fairly balanced (after plotting some binomial distributions with higher "n" > values). Still not perfect, especially since streams aren't likely to all > be elephants. > > We live with imperfect per core tcp flow behavior already. > Do you think this idea would make it worse, or better? (I couldn't tell from your comment how, exactly, you meant that) OTOH, any gains I'd get over 500Mbps would just be gravy, as my current router can't do more than that downstream on a single core, so even if the sharing I have is uneven, it's better than what I have now (~400Mbps and not pretty), so even if all fat streams landed on one core (unlikely starting at 4+ streams), so I think I'd see overall gains (in my situation, others might not). > What I wanted to happen was the "list" ingress improvement to become > more generally available ( I can't find the lwn link at the moment). > It has. I thought that then we could express a syntax of tc qdisc add > dev eth0 ingress cake-mq bandwidth whatever, and it would rock. > > I figured getting rid of the cost of the existing ifb and tc mirred, > and having a fast path preserving each hardware queue, then using > rcu to do a sloppy allocate atomic lock for shaped bandwidth and merge > every ms or so might be then low-cost enough. Certainly folding > everything into a single queue has a cost! > Sharing the tracked state by each cake-mq "thread" and updating it every so often? Or doing the rate limiting in one core, and the fq'ing on another? (I don't think this is what you meant?) > I was (before money ran out) prototyping adding a shared shaper to mq > at one point (no rcu, just There have been so many other things > toss around (bpf?) > > As for load balancing better, google "RSS++", if you must. I few years ago (before my current job ate all my brain cycles), I was toying around with taking the ideas from OpenFlow/OpenVSwitch and RSS and using them for parallelizing tasks like this: - have N worker threads (say N=real cores, or real cores-1, or somesuch), each fed by RSS / RPS / multiqueue etc. - have a single controller thread (like the OpenFlow "controller") Each worker publishes state/observations to the controller, as well as forwarding "decisions to make", while the controller publishes each worker's operating parameters to them individually. The workers then just move packets as fast as they can, using their simple rules, with no shared state between the workers, or need to access global tables like connection tracking (e.g. NAT tables to map NAT'd tuples to lan address tuples). The controller deals with the decisions and the balancing of params (such as dynamic configuration of the policer to keep things "fair"). I never got much farther than sketches on paper, and laying out how I'd do it in a heavily multi-threaded userspace app (workers would use select() to receive the control messages in-band, instead of needing to do shared memory access). I was also hoping that it would generalize to the hardware packet accelerators, but I think to really take advantage of that, they would need to be able to implement. And, I never seem to have the time to try to stand up a rough framework for this, to try it out... --00000000000002e27c05a1f1ffd1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Mar 25, 2020 at 12:18 PM Dave Tah= t <dave.taht@gmail.com> wr= ote:
On Wed, Mar 25, 2= 020 at 8:58 AM Aaron Wood <woody77@gmail.com> wrote:
>
> One other thought I've had with this, is that the apu2 is multi-co= re, and the i210 is multi-queue.
>
> Cake/htb aren't, iirc, setup to run on multiple cores (as the rate= limiters then don't talk to each other).=C2=A0 But with the correct tu= ple hashing in the i210, I _should_ be able to split things and do two core= s at 500Mbps each (with lots of compute left over).
>
> Obviously, that puts a limit on single-connection rates, but as the nu= mber of connections climb, they should more or less even out (I remember Da= ve Taht showing the oddities that happen with say 4 streams and 2 cores, wh= ere it's common to end up with 3 streams on the same core).=C2=A0 But a= ssuming that the hashing function results in even sharing of streams, it sh= ould be fairly balanced (after plotting some binomial distributions with hi= gher "n" values).=C2=A0 Still not perfect, especially since strea= ms aren't likely to all be elephants.

We live with imperfect per core tcp flow behavior already.
=

Do you think this idea would make it worse, or better? = =C2=A0(I couldn't tell from your comment how, exactly, you meant that)<= /div>

OTOH, any gains I'd get over 500Mbps would jus= t be gravy, as my current router can't do more than that downstream on = a single core, so even if the sharing I have is uneven, it's better tha= n what I have now (~400Mbps and not pretty), so even if all fat streams lan= ded on one core (unlikely starting at 4+ streams), so I think I'd see o= verall gains (in my situation, others might not).
=C2=A0
What I wanted to happen was the "list" ingress improvement to bec= ome
more generally available ( I can't find the lwn link at the moment). It has. I thought that then we could express a syntax of tc qdisc add
dev eth0 ingress cake-mq bandwidth whatever, and it would rock.

I figured getting rid of the cost of the existing ifb and tc mirred,
and having a fast path preserving each hardware queue, then using
rcu to do a sloppy allocate atomic lock for shaped bandwidth and merge
every ms or so might be then low-cost enough. Certainly folding
everything into a single queue has a cost!

<= div>Sharing the tracked state by each cake-mq "thread" and updati= ng it every so often?

Or doing the rate limiting i= n one core, and the fq'ing on another? (I don't think this is what = you meant?)
=C2=A0
I was (before money ran out) prototyping adding a shared shaper to mq
at one point (no rcu, just=C2=A0 There have been so many other things
toss around (bpf?)

As for load balancing better, google "RSS++", if you must.

I few years ago (before my current job ate all my= brain cycles), I was toying around with taking the ideas from OpenFlow/Ope= nVSwitch and RSS and using them for parallelizing tasks like this:

- have N worker threads (say N=3Dreal cores, or real cores= -1, or somesuch), each fed by RSS / RPS / multiqueue etc.
- have = a single controller thread (like the OpenFlow "controller")
=

Each worker publishes state/observations to the control= ler, as well as forwarding "decisions to make", while the control= ler publishes each worker's operating parameters to them individually.<= /div>

The workers then just move packets as fast as they= can, using their simple rules, with no shared state between the workers, o= r need to access global tables like connection tracking (e.g. NAT tables to= map NAT'd tuples to lan address tuples).

The = controller deals with the decisions and the balancing of params (such as dy= namic configuration of the policer to keep things "fair").
<= div>
I never got much farther than sketches on paper, and lay= ing out how I'd do it in a heavily multi-threaded userspace app (worker= s would use select() to receive the control messages in-band, instead of ne= eding to do shared memory access).=C2=A0

I was als= o hoping that it would generalize to the hardware packet accelerators, but = I think to really take advantage of that, they would need to be able to imp= lement.

And, I never seem to have the time to try = to stand up a rough framework for this, to try it out...
--00000000000002e27c05a1f1ffd1--