From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ia0-f170.google.com (mail-ia0-f170.google.com [209.85.210.170]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id D6D0321F199; Thu, 20 Dec 2012 01:13:19 -0800 (PST) Received: by mail-ia0-f170.google.com with SMTP id i1so2653069iaa.1 for ; Thu, 20 Dec 2012 01:13:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=TH9cMd15ssn9gamwwRUiJO4X1U0RrojddWaufDZXqIM=; b=yKwqOj/6K4N9SLHlQm3mk+7rLlo88Uf46Fss3okeP6W+ce2OO9BrZI+K+IO9g7kn4G p/IfH4n4HYbPd3cS6hppQlxdq7jGyoHyhxDVmvuZh4ZURX7uAaRZhymCSWwx30pRLjbP HpxeO1TN9llzC1mfRs1X4wiRkST7fhG5tdw2IGZykvL3ef0+mULgZJz1dy9cG4L7QKpa FlUHVZfoqtcug5IL+a0yQz9wAfIW/o4N4T568eirnUPFE7vs5UzA37tRkXm/m4a4IeSJ rxwoK68qL8ZCJBoKC1NAdGjgGcwCt3619DT2u3WsyB3Su7Hnk2bOcXxrD7OvUEts6Gyd +dbg== MIME-Version: 1.0 Received: by 10.50.56.139 with SMTP id a11mr4648174igq.86.1355994799073; Thu, 20 Dec 2012 01:13:19 -0800 (PST) Received: by 10.64.135.39 with HTTP; Thu, 20 Dec 2012 01:13:18 -0800 (PST) In-Reply-To: <20121220081737.1A681800037@ip-64-139-1-69.sjc.megapath.net> References: <20121220081737.1A681800037@ip-64-139-1-69.sjc.megapath.net> Date: Thu, 20 Dec 2012 04:13:18 -0500 Message-ID: From: Dave Taht To: Hal Murray Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: codel@lists.bufferbloat.net, bloat-devel , cerowrt-devel@lists.bufferbloat.net Subject: Re: [Codel] hardware hacking on fq_codel in FPGA form at 10GigE X-BeenThere: codel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: CoDel AQM discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Dec 2012 09:13:21 -0000 On Thu, Dec 20, 2012 at 3:17 AM, Hal Murray wrote= : > > If I was going to do something like that, I'd build a small/simple CPU an= d do > the work in microcode. There are two ppc 440 cpus already onboard the 10GigE device, I think. It's a REALLY NICE fpga. http://netfpga.org/10G_specs.html http://www.xilinx.com/support/documentation/data_sheets/ds100.pdf If we really wanted to get a jump on the high end: http://www.hitechglobal.com/boards/100gig.htm > >> implementing {n,e,s}fq_codel onboard looks very feasible > > How many lines of assembler code would it take? I could do a dump of the current code into any given assembly language. It's not a lot, but there are a lot of out of band functions. > How many registers do you need? Do you need any memory other than queues= ? > Maybe counters? The total overhead for fq_codel is presently 1024*64 bytes for 1024 flows, and 4-8k of pointer overhead (32 or 64 bit). I would argue for such a device to hash to 64k flows, or heck, higher. And the per-flow overhead can be reduced a lot in a dedicated device. As to what of that needs to be on-board the fpga or off-board, is a fairly good question. The sfq/codel queue management stuff sits nicely in parallel with getting the packets so that's an obvious second bus/cache arch... >> The only thing that is seriously serial about fq_codel is shooting the >> biggest flow when the queue limit is exceeded, and that could be made >> embarrassingly parallel with enough gates.There are no doubt other trick= y >> issues. > > Would it be better to do the fq work in the main CPU and let the FPGA gra= b Well there are a few things that would benefit from moving directly into hardware - the 5 tuple hash, for example. > packets from some shared data structure in memory? The problem that I would like to beat is that TSO/GSO seem to be necessary on the host processor to reduce the interrupt count to sanity at 10GigE. A goal here would be to allow for TSO generation (and GRO receive) to hand off to the board, but for the board to interleave and aqm packets from there to the wire. Rather than a tx descriptor ring you'd have a tx descriptor list and tx completion ring so that you could send streams out of order. > Can you work out a > memory structure that doesn't need locks? The enqueue and dequeue algorithms are entirely decoupled, with the exception of this error handling phase of (out of queue space) One thought would be to track packet count on enqueue (this is more "sfq"-like than fq_codel-like) which still has a tiny lock... :grumble: > > > -- > These are my opinions. I hate spam. > > > --=20 Dave T=E4ht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.= html