[Bloat] RED against bufferbloat

Thu Feb 26 12:04:52 EST 2015

On Thu, Feb 26, 2015 at 7:18 AM, MUSCARIELLO Luca IMT/OLN
<luca.muscariello at orange.com> wrote:
> On 02/26/2015 03:18 PM, Mikael Abrahamsson wrote:
>
> On Thu, 26 Feb 2015, MUSCARIELLO Luca IMT/OLN wrote:
>
> Done with the vendor itself with related NDA etc. It takes longer to set the
> agreement than coding the system. The problem is that this process is not
> ok. An ISP cannot maintain someone else product if it is closed.
>
>
> Do you have a requirement document that makes sense to the people
> programming these ASICs for vendors? When I try to explain what needs to be
> done I usually run into very frustrating discussions.
>
>
> I think there are people in this list that should be able to answer to this
> question better than me.
>
> AFAIK the process is complex because even vendors use network processors
> they don't build and
> traffic management is developed by the chipco in the chip. Especially for
> the segment we are considering here.
> In the end the dequeue process is always managed  by someone else and
> mechanisms and their implementations opaque.
> You can do testing on the equipment and do some reverse engineering. What a
> waste of time...
>
> This is why single queue AQM is preferred by vendors, because it does not
> affect current product lines
> and the enqueue is easier to code. FQ requires to recode the dequeue or to
> shadow the hardware dequeue.

OK, I need to dispel a few misconceptions.

First, everyone saying that fq_codel can't be done in hardware is
*wrong* See my last point far below, I know I write over-long
emails...

YES, at the lowest level of the hardware, where packets turn into
light or fuzzy electrons, or need to be tied up in a an aggregate (as
in cable) and shipped onto the wire, you can't do it. BUT, as BQL has
shown, you CAN stick in enough buffering on the final single queue to
*keep the device busy* - which on fiber might be a few us, on cable
modems 2ms - and then do smarter things above that portion of the
device. I am perfectly willing to lose those 2ms when you can cut off
hundreds elsewhere.

Everybody saying that it cant be done doesnt understand how BQL works.
Demonstrably. On two+ dozen devices. Already. For 3 years now.

On the higher end than CPE, people that keep whinging about have
thousands of queues are not thinking about it correctly. This is
merely a change in memory access pattern, not anything physical at
all, and the overall reduction in queue lengths lead to much better
cache behavior, and they should, um, at least try this stuff on a
modern intel processor - and just go deploy that. 10GigE was totally
feasible on day one, ongoing work is getting up to 40GigE (but running
into trouble elsewhere on the rx path which jesper can talk to in
painful detail)

Again, if you do it right, on any architecture, all the smarts happen
long before you hit the wire. You are not - quite - dropping from the
head of the queue, but we never have, and I really don't get why
people don't grok this.

EVERY benchmark I ever publish can show the intrinsic latency in the
system as hovering around Xms due to the context switch overhead of
the processor and OS - and although I don't mind shaving that figure,
compared to the gains of seconds elsewhere that can come from using
these algorithms - I find getting those down more satisfying.
(admittedly I have spent tons of time trying to shave off a few
hundred us at that level too, as has jonathon and many others in the
linux community)

I also don't give a hoot about core routers, mostly the edge. I *do
care deeply* about FQ/AQMing interconnects between providers,
particularly in event of a disaster like earthquake or tsunami when
suddenly 2/3s the interconnects get drowned. What will happen in that
case, if an earthquake hit california the same size as the one that
hit japan? It worries me. I live 500 meters from the intersection of
two major fault lines.

Secondly,  I need to clarify a statement above:

"This is why single queue AQM is preferred by vendors *participating
publicly on the aqm mailing list*, because it does not affect current
product lines, and the enqueue is easier to code. "

When fq_codel landed, the next morning, I said to myself, "ok, is it
time to walk down sand hill road? We need this in switch chips and
given the two monopolistic vendors left, it is ripe for disruption".

After wrestling with myself for a few weeks I decided it would be
simpler and easier if I tried to pursuade the chipmakers making packet
co-processing engines (like octeon, intel, tilera) that this algorithm
and htb-like rate control would be a valuable addition to their
products. Note - in *none* of their cases did it have to reduce to
gates, they have a specialized cpu co-processor that struggles to work
at line rate (with features like nat offloads, etc) - with specialized
firmware that they write and hold proprietary, and there were *no*
hardware mods needed - Their struggle at line rate was not the point,
I wanted something that could work at an ISPs set rate, which is very
easy to do...

I talked to all these chipmakers (and a few more startup-like ones in
particularly the high speed trading market).

The told me there was no demand. So I went talked to their customers...

and I am certain that more than one of the companies I have talked to
in the last 3 years is actually doing FQ now, and certain that codel
is also implemented - but I cannot reveal which ones, and for all I
know the ones that are not talking to me (anymore) are off doing. And
at least one of the companies doing it in their packet co-processor,
was getting it wrong, until I straightened 'em out, and for all I know
they didn't listen.

I figured whichever vendor shipped products first would have a market
advantage, and then everybody else would pile on, and that if I
focused on creating demand for the algorithm (as I did all over the
world, and with ubnt in particular I went to the trouble of
backporting it to their edgerouter personally), demand would be
created for better firmware, from the chipcos and products would
arrive.

and they have. Every 802.11ac router now has some sort of "better" QoS
system in it. (and of course, openwrt and derivatives). There is a ton
of stuff in the pipeline.

The streamboost folk were pretty effective in spreading their meme,
but I am mad at them about quite a few things about their
implementation and test regimes, so I'll save what they have done
wrong for another day when I have more venom stored up, and have
acquired stuff I can say publicly about their implementation via a bit
more inspection of their GPL drops and testing the related products.

...

"FQ requires to recode the dequeue or to shadow the hardware dequeue."

Well this statement is not correct.

*Lastly*: IF you want to reduce things to gates, rather than use a
packet co-processor:

1) DRR in hardware is entirely doable. How do I know this? - because
it was done for the netfpga.org project *7* years ago. Here is the
project, paper, and *verilog*:
https://github.com/NetFPGA/netfpga/wiki/DRRNetFPGA

It is a single define to synthesize a configurable number of queues,
and it worked on top of the GigE Xilinx virtex-2 pro FPGA, which is
like so low-end now as I am not even sure if they are made anymore.
http://netfpga.org/2014/#/systems/4netfpga-1g/details/

They never got around to writing a five-tuple packet inspector/hasher
but that is straightforward.

2) Rate management in hardware is entirely doable, also: Here is the
project, paper, and verilog: https://github.com/gengyl08/SENIC

3) I long ago figured out how to make something fq_codel-like work (in
theory) in hardware with enough parallization (and a bit of BQL).
Sticking points were a complete re-org of the ethernet device and
device driver, and a whole lot of Xilinx IP I wanted to dump, and I am
really too busy to do any of the work, but:

Since I am fed up with the debate, I am backing this kickstarter
project. I have had several discussions with the people doing it -
they are using all the same hardware I chose for my mental design -
and I urge others here to do so.

https://www.kickstarter.com/projects/onetswitch/onetswitch-open-source-hardware-for-networking

I am not entirely broke for a change, and plan to throw in 1k or so.
Need to find 25k from other people for them to make their initial
targets.

That board meets all the needs for fixing wifi also. They already have
shipping, more complex products that might be more right for the job,
as working out the number of gates needed is something that needs
actual work and simulation.

But I did like this:

https://github.com/MeshSr/wiki/wiki/ONetSwitch45

I will write a bit more about this (as negotiations continue) in a
less long, more specialized mail in the coming weeks, and perhaps, as
so often happens around here (I am thinking of renaming this the
"bufferbloat/stone soup project"), some EEs will show up eager to do
something truly new, and amazing, as a summer project. If you got any
spare students, well, go to town.

I really, really like chisel in particular,
https://chisel.eecs.berkeley.edu/ and the openrisc folk could use a
better ethernet device.

> My experience is not based on providing a requirement document, well we
> tried that first,
> but on joint coding with the chipco because you need to see a lot of chip
> internals.
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

-- 
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb