[Bloat] Questions About Switch Buffering

Mon Jun 13 00:34:06 EDT 2016

On Sun, Jun 12, 2016 at 6:07 PM, Benjamin Cronce <bcronce at gmail.com> wrote:
> I'm not arguing the an AQM isn't useful, just that you can't have your cake
> and eat it to. I wouldn't spend much time going down this path without first
> talking to someone with a strong background in ASICs for network switches
> and asking if it's even a feasible. Everything(very little) I know about
> ASICs and buffers is buffering is a very hard problem that east up a lot of
> transistors and more transistors means slower ASICs. Almost always a
> trade-off between performance and buffer size. CoDel and especially Cake
> sound like they would not be ASIC friendly.

For much of two years I shopped around trying to get a basic proof of
concept done in a FPGA, going so far as to help fund onenetswitch's
softswitch, and trying to line up a university or other lab to tackle
the logic (given my predilection for wanting things to be open
source). Several shops were very enthusiastic for a while... then went
dark.

...

Codel would be very ASIC friendly - all you need to do is prepend a
timestamp to the packet on input, and transport it across the internal
switch fabric. Getting the timestamp on ingress is essentially free.
Making the drop or mark decision is also cheap. Looping, as codel can
do on overload, not so much, but it isn't that necessary either.

As for the "fq" portion - proofs of concept already exist for DRR in
the netfpga.org's project's verilog

https://github.com/NetFPGA/netfpga/wiki/DRRNetFPGA

Where the dpi required to create the hash would have to end up at the
end of the packet, and thus get bypassed by cut-through switching
(when the output port was empty), and need at least one packet queued
up at the destination to be able to slide it into the right virtual
queue.

The stumbling blocks were:

A) that most shops were only interested in producing the next 100gps
or faster chip, rather than trying to address 10GigE and lower. And
were also intimidated by the big players in the market. Even the big
players are intimidated - broadcom just exited the wifi biz (selling
off their business to cypress) - as one example.

B) The potential revenues from producing a revolutionary smart "gbit"
switch chip using technologies that could just forward packets dumbly
at 10gbit was a lot of NRE for chips that sell for well under a few
dollars. Everybody has a gbit switch chip and ethernet chip already
paid for....

C) The building blocks for packet processing in hardware are hard to
license together except for a very few players, and what's out there
in open source, basically limited to opencore's and netfpga's work.

I have no doubt that eventually someone will produce implementations
of codel, fq_codel and/or cake algorithms in gates, tho I lean more
towards something qfq derived than drr derived. Best I could estimate
was that a DRR version would need at least a 3 packet "pipeline" to
hash and route, tho I thought some interesting things could be done by
also having multiple CAMs to route the different flows and be able to
handle cache misses better.

In the interim, I kind of expect stuff derived from the QCA or
caviums' specialized co-processors to gain some of these ideas. There
are also hugely parallel network processors around the corner based on
arm architectures in cavium's and amd's roadmap. Cisco has some
insanely parallel processors in their designs.

More than fixing the native DC switch market (where everybody
generally overprovisions anyway) or improving consumer switches, I'd
felt that ISP edge devices (dslams/cable) were where someone would
innovate with a hardware assist to meet their business models (
http://jvimal.github.io/senic/ ), and maybe we'd see some assistance
for inbound rate limiting also arrive in consumer hardware with
offloads, a more limited version of senic perhaps.

But it takes a long time to develop hardware, chip designers scarce,
and without clear market demand... I dunno, who knows? perhaps what
the OCP folk are doing will start feeding back into actual chip
designs one day.

It was a nice diversion for me to play with the chisel language, the
risc-v and mill cpus , at least.

Anybody up for repurposing some machine learning chips? These look like fun:

https://www.engadget.com/2016/04/28/movidius-fathom-neural-compute-stick/

Oh, yea, mellonox's latest programmable ethernet devices look promising also.

http://www.mellanox.com/page/programmable_network_adapters

The ironic thing is that the biggest problems in 10GigE+ is on input,
not output. In fact, on much hardware, even at lower rates,we tend to
be dropping on input more than enough to balance out the potential
bufferbloat problems there. Moving the timestamp and hash and having
parallel memory channels is sort of happening on multiple newer chips
on the rx path...

> On Sun , Jun 12, 2016 at 5:01 PM, Jesper Louis Andersen
> <jesper.louis.andersen at gmail.com> wrote:
>>
>> This *is* commonly a problem. Look up "TCP incast".
>>
>> The scenario is exactly as you describe. A distributed database makes
>> queries over the same switch to K other nodes in order to verify the
>> integrity of the answer. Data is served from memory and thus access times
>> are roughly the same on all the K nodes. If the data response is sizable,
>> then the switch output port is overwhelmed with traffic, and it drops
>> packets. TCPs congestion algorithm gets into play.
>>
>> It is almost like resonance in engineering. At the wrong "frequency", the
>> bridge/switch will resonate and make everything go haywire.
>>
>>
>> On Sun, Jun 12, 2016 at 11:24 PM, Steinar H. Gunderson
>> <sgunderson at bigfoot.com> wrote:
>>>
>>> On Sun, Jun 12, 2016 at 01:25:17PM -0500, Benjamin Cronce wrote:
>>> > Internal networks rarely have bandwidth issues and congestion only
>>> > happens
>>> > when you don't have enough bandwidth.
>>>
>>> I don't think this is true. You might not have an aggregate bandwidth
>>> issues,
>>> but given the burstiness of TCP and the typical switch buffer depth
>>> (64 frames is a typical number), it's very very easy to lose packets in
>>> your
>>> switch even on a relatively quiet network with no downconversion.
>>> (Witness
>>> the rise of DCTCP, made especially for internal traffic on this kind of
>>> network.)
>>>
>>> /* Steinar */
>>> --
>>> Homepage: https://www.sesse.net/
>>> _______________________________________________
>>> Bloat mailing list
>>> Bloat at lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/bloat
>>
>>
>>
>>
>> --
>> J.
>>
>> _______________________________________________
>> Bloat mailing list
>> Bloat at lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat
>>
>
>
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org