hardware hacking on fq_codel in FPGA form at 10GigE

Thu Dec 20 04:13:18 EST 2012

On Thu, Dec 20, 2012 at 3:17 AM, Hal Murray <hmurray at megapathdsl.net> wrote:
>
> If I was going to do something like that, I'd build a small/simple CPU and do
> the work in microcode.

There are two ppc 440 cpus already onboard the 10GigE device, I think.
It's a REALLY NICE fpga.

http://netfpga.org/10G_specs.html

http://www.xilinx.com/support/documentation/data_sheets/ds100.pdf

If we really wanted to get a jump on the high end:

http://www.hitechglobal.com/boards/100gig.htm

>
>> implementing {n,e,s}fq_codel onboard looks very feasible
>
> How many lines of assembler code would it take?

I could do a dump of the current code into any given assembly
language. It's not a lot, but there are a lot of out of band
functions.

> How many registers do you need?  Do you need any memory other than queues?
> Maybe counters?

The total overhead for fq_codel is presently 1024*64 bytes for 1024
flows, and 4-8k of pointer overhead (32 or 64 bit). I would argue for
such a device to hash to 64k flows, or heck, higher. And the per-flow
overhead can be reduced a lot in a dedicated device.

As to what of that needs to be on-board the fpga or off-board, is a
fairly good question. The sfq/codel queue management stuff sits nicely
in parallel with getting the packets so that's an obvious second
bus/cache arch...

>> The only thing that is seriously serial about fq_codel is shooting the
>> biggest flow when the queue limit is exceeded, and that could be made
>> embarrassingly parallel with enough gates.There are no doubt other tricky
>> issues.
>
> Would it be better to do the fq work in the main CPU and let the FPGA grab

Well there are a few things that would benefit from moving directly
into hardware - the 5 tuple hash, for example.

> packets from some shared  data structure in memory?

The problem that I would like to beat is that TSO/GSO seem to be
necessary on the host processor to reduce the interrupt count to
sanity at 10GigE. A goal here would be to allow for TSO generation
(and GRO receive) to hand off to the board, but for the board to
interleave and aqm packets from there to the wire. Rather than a tx
descriptor ring you'd have a tx descriptor list and tx completion ring
so that you could send streams out of order.

> Can you work out a
> memory structure that doesn't need locks?

The enqueue and dequeue algorithms are entirely decoupled, with the
exception of this error handling phase of (out of queue space) One
thought would be to track packet count on enqueue (this is more
"sfq"-like than fq_codel-like) which still has a tiny lock...
:grumble:

>
>
> --
> These are my opinions.  I hate spam.
>
>
>

-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html