From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-x22c.google.com (mail-oi0-x22c.google.com [IPv6:2607:f8b0:4003:c06::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 101F03B2B0 for ; Mon, 13 Jun 2016 00:34:08 -0400 (EDT) Received: by mail-oi0-x22c.google.com with SMTP id w5so111987384oib.2 for ; Sun, 12 Jun 2016 21:34:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=SCgRJjYzQG+RJREz1pvsMet8iUbA1WRm7ck8CTvAFME=; b=ZRefKYHVN1V0USGY1muzpYVZHSdfyirLxVlvMgpfouv3iD+OBf7Y7e1Zf0zBuoHA3L Y0xAP1XFLQ7pj1baRjliuVC0Kv9V6HVYxdwvS74cKhh1ZojDJD+1lvsDgft8kKrzGuox pGRv9LkaCNw2/BcrASve9zvSdkMRHvfScMDOpFVZO4/pqSPN3r2zFfxGshTKvzb+kXyT MMqhToJNsscc0SenMcxw7GsYj0k8UoE5sEU80mrPC0Jau8e8wToTn35i0BKw0Q5Smj91 UwvaSvkAaqbw3Sg6PxRPaLnJ9AoxQYjlgO+iWMh5GD4wfMZonqs12ziaD+POxgoKk+oV iKQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=SCgRJjYzQG+RJREz1pvsMet8iUbA1WRm7ck8CTvAFME=; b=SyynyZGQWhb5Aq6Len+PO+sfOmiVmbqsLjg0VDD/CFR8grCSdudzRPhAvxqB8xGLkd ybDoI3W73UPyk1mckXVuWBlXLRd3Jiw1eFCNv+jx3w46WzQ3rpHdLc7v95vyNtFt1IUv vA1IwOOSIeAgx3RgWqWmnzASpTWAMkpyVGs+QN/TJ+GYKGcgJVgCBsyNoMhWpU4NA1i9 j2zw1Df7uypjDLOgy9VQO6+EY8IghgZZrJfJ1TI6wqGwlbVFCIy6N5BctcS1qG+hMsvD tFYV2Xi//nrFm+W59y8GA0xt22ucUCQt3Y5ylCFhc0yIoEaaNbQE7Zfl/g+WYrr67V0m ihdA== X-Gm-Message-State: ALyK8tKyDFIMpbxLvVoC/rqgK9xKhku8kEkVmHnVEqNP8MJ2GEnVdC1ftVqLMZo1K1ri9rK/jXDsDWUwj4gjdw== X-Received: by 10.157.34.99 with SMTP id o90mr3594665ota.63.1465792447313; Sun, 12 Jun 2016 21:34:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.229.210 with HTTP; Sun, 12 Jun 2016 21:34:06 -0700 (PDT) In-Reply-To: References: <151299a8-87ec-6a8a-b44b-9f710c31a46f@gmail.com> <20160612212440.GB25090@sesse.net> From: Dave Taht Date: Sun, 12 Jun 2016 21:34:06 -0700 Message-ID: To: Benjamin Cronce Cc: Jesper Louis Andersen , bloat Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Bloat] Questions About Switch Buffering X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2016 04:34:08 -0000 On Sun, Jun 12, 2016 at 6:07 PM, Benjamin Cronce wrote: > I'm not arguing the an AQM isn't useful, just that you can't have your ca= ke > and eat it to. I wouldn't spend much time going down this path without fi= rst > talking to someone with a strong background in ASICs for network switches > and asking if it's even a feasible. Everything(very little) I know about > ASICs and buffers is buffering is a very hard problem that east up a lot = of > transistors and more transistors means slower ASICs. Almost always a > trade-off between performance and buffer size. CoDel and especially Cake > sound like they would not be ASIC friendly. For much of two years I shopped around trying to get a basic proof of concept done in a FPGA, going so far as to help fund onenetswitch's softswitch, and trying to line up a university or other lab to tackle the logic (given my predilection for wanting things to be open source). Several shops were very enthusiastic for a while... then went dark. ... Codel would be very ASIC friendly - all you need to do is prepend a timestamp to the packet on input, and transport it across the internal switch fabric. Getting the timestamp on ingress is essentially free. Making the drop or mark decision is also cheap. Looping, as codel can do on overload, not so much, but it isn't that necessary either. As for the "fq" portion - proofs of concept already exist for DRR in the netfpga.org's project's verilog https://github.com/NetFPGA/netfpga/wiki/DRRNetFPGA Where the dpi required to create the hash would have to end up at the end of the packet, and thus get bypassed by cut-through switching (when the output port was empty), and need at least one packet queued up at the destination to be able to slide it into the right virtual queue. The stumbling blocks were: A) that most shops were only interested in producing the next 100gps or faster chip, rather than trying to address 10GigE and lower. And were also intimidated by the big players in the market. Even the big players are intimidated - broadcom just exited the wifi biz (selling off their business to cypress) - as one example. B) The potential revenues from producing a revolutionary smart "gbit" switch chip using technologies that could just forward packets dumbly at 10gbit was a lot of NRE for chips that sell for well under a few dollars. Everybody has a gbit switch chip and ethernet chip already paid for.... C) The building blocks for packet processing in hardware are hard to license together except for a very few players, and what's out there in open source, basically limited to opencore's and netfpga's work. I have no doubt that eventually someone will produce implementations of codel, fq_codel and/or cake algorithms in gates, tho I lean more towards something qfq derived than drr derived. Best I could estimate was that a DRR version would need at least a 3 packet "pipeline" to hash and route, tho I thought some interesting things could be done by also having multiple CAMs to route the different flows and be able to handle cache misses better. In the interim, I kind of expect stuff derived from the QCA or caviums' specialized co-processors to gain some of these ideas. There are also hugely parallel network processors around the corner based on arm architectures in cavium's and amd's roadmap. Cisco has some insanely parallel processors in their designs. More than fixing the native DC switch market (where everybody generally overprovisions anyway) or improving consumer switches, I'd felt that ISP edge devices (dslams/cable) were where someone would innovate with a hardware assist to meet their business models ( http://jvimal.github.io/senic/ ), and maybe we'd see some assistance for inbound rate limiting also arrive in consumer hardware with offloads, a more limited version of senic perhaps. But it takes a long time to develop hardware, chip designers scarce, and without clear market demand... I dunno, who knows? perhaps what the OCP folk are doing will start feeding back into actual chip designs one day. It was a nice diversion for me to play with the chisel language, the risc-v and mill cpus , at least. Anybody up for repurposing some machine learning chips? These look like fun= : https://www.engadget.com/2016/04/28/movidius-fathom-neural-compute-stick/ Oh, yea, mellonox's latest programmable ethernet devices look promising als= o. http://www.mellanox.com/page/programmable_network_adapters The ironic thing is that the biggest problems in 10GigE+ is on input, not output. In fact, on much hardware, even at lower rates,we tend to be dropping on input more than enough to balance out the potential bufferbloat problems there. Moving the timestamp and hash and having parallel memory channels is sort of happening on multiple newer chips on the rx path... > On Sun , Jun 12, 2016 at 5:01 PM, Jesper Louis Andersen > wrote: >> >> This *is* commonly a problem. Look up "TCP incast". >> >> The scenario is exactly as you describe. A distributed database makes >> queries over the same switch to K other nodes in order to verify the >> integrity of the answer. Data is served from memory and thus access time= s >> are roughly the same on all the K nodes. If the data response is sizable= , >> then the switch output port is overwhelmed with traffic, and it drops >> packets. TCPs congestion algorithm gets into play. >> >> It is almost like resonance in engineering. At the wrong "frequency", th= e >> bridge/switch will resonate and make everything go haywire. >> >> >> On Sun, Jun 12, 2016 at 11:24 PM, Steinar H. Gunderson >> wrote: >>> >>> On Sun, Jun 12, 2016 at 01:25:17PM -0500, Benjamin Cronce wrote: >>> > Internal networks rarely have bandwidth issues and congestion only >>> > happens >>> > when you don't have enough bandwidth. >>> >>> I don't think this is true. You might not have an aggregate bandwidth >>> issues, >>> but given the burstiness of TCP and the typical switch buffer depth >>> (64 frames is a typical number), it's very very easy to lose packets in >>> your >>> switch even on a relatively quiet network with no downconversion. >>> (Witness >>> the rise of DCTCP, made especially for internal traffic on this kind of >>> network.) >>> >>> /* Steinar */ >>> -- >>> Homepage: https://www.sesse.net/ >>> _______________________________________________ >>> Bloat mailing list >>> Bloat@lists.bufferbloat.net >>> https://lists.bufferbloat.net/listinfo/bloat >> >> >> >> >> -- >> J. >> >> _______________________________________________ >> Bloat mailing list >> Bloat@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/bloat >> > > > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat > --=20 Dave T=C3=A4ht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org