Cake - FQ_codel the next generation
 help / color / mirror / Atom feed
* [Cake] are anyone playing with dpdk and vpp?
@ 2016-04-27 18:57 Dave Taht
  2016-04-27 19:28 ` [Cake] [Bloat] " Aaron Wood
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Taht @ 2016-04-27 18:57 UTC (permalink / raw)
  To: bloat, cake

https://fd.io/technology seems to have come a long way.

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat] are anyone playing with dpdk and vpp?
  2016-04-27 18:57 [Cake] are anyone playing with dpdk and vpp? Dave Taht
@ 2016-04-27 19:28 ` Aaron Wood
  2016-04-27 19:32   ` Stephen Hemminger
  0 siblings, 1 reply; 7+ messages in thread
From: Aaron Wood @ 2016-04-27 19:28 UTC (permalink / raw)
  To: Dave Taht; +Cc: bloat, cake

[-- Attachment #1: Type: text/plain, Size: 799 bytes --]

I'm looking at DPDK for a project, but I think I can make substantial gains
with just AF_PACKET + FANOUT and SO_REUSEPORT.  It's not clear to my yet
how much DPDK is going to gain over those (and those can go a long way on
higher-powered platforms).

On lower-end systems, I'm more suspicious of the memory bus (and the cache
in particular), than I am the raw CPU power.

-Aaron

On Wed, Apr 27, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:

> https://fd.io/technology seems to have come a long way.
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 1505 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat] are anyone playing with dpdk and vpp?
  2016-04-27 19:28 ` [Cake] [Bloat] " Aaron Wood
@ 2016-04-27 19:32   ` Stephen Hemminger
  2016-04-27 19:37     ` Mikael Abrahamsson
  2016-04-27 19:45     ` Dave Taht
  0 siblings, 2 replies; 7+ messages in thread
From: Stephen Hemminger @ 2016-04-27 19:32 UTC (permalink / raw)
  To: Aaron Wood; +Cc: Dave Taht, cake, bloat

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

DPDK gets impressive performance on large systems (like 14M packets/sec per
core), but not convinced on smaller systems.
Performance depends on having good CPU cache. I get poor performance on
Atom etc.
Also driver support is limited (mostly 10G and above)

On Wed, Apr 27, 2016 at 12:28 PM, Aaron Wood <woody77@gmail.com> wrote:

> I'm looking at DPDK for a project, but I think I can make substantial
> gains with just AF_PACKET + FANOUT and SO_REUSEPORT.  It's not clear to my
> yet how much DPDK is going to gain over those (and those can go a long way
> on higher-powered platforms).
>
> On lower-end systems, I'm more suspicious of the memory bus (and the cache
> in particular), than I am the raw CPU power.
>
> -Aaron
>
> On Wed, Apr 27, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:
>
>> https://fd.io/technology seems to have come a long way.
>>
>> --
>> Dave Täht
>> Let's go make home routers and wifi faster! With better software!
>> http://blog.cerowrt.org
>> _______________________________________________
>> Bloat mailing list
>> Bloat@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat
>>
>
>
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
>
>

[-- Attachment #2: Type: text/html, Size: 2519 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat]   are anyone playing with dpdk and vpp?
  2016-04-27 19:32   ` Stephen Hemminger
@ 2016-04-27 19:37     ` Mikael Abrahamsson
  2016-04-27 19:45     ` Dave Taht
  1 sibling, 0 replies; 7+ messages in thread
From: Mikael Abrahamsson @ 2016-04-27 19:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: cake, bloat

On Wed, 27 Apr 2016, Stephen Hemminger wrote:

> DPDK gets impressive performance on large systems (like 14M packets/sec per
> core), but not convinced on smaller systems.
> Performance depends on having good CPU cache. I get poor performance on

As soon as you can't find information in cache and have to go to RAM to 
get it (and you need it to proceed), you've lost the impressive 
performance.

VPP is all about pre-fetching (tell memory subsystem to go get information 
into cache you probably will need in the not so distant future). It 
actually reminds me of demo programming on C64/Amiga that I was involved 
in in the 80ties. Lots of small optimisations needed to yield these 
results.

So yes, cache is extremely important for VPP.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat] are anyone playing with dpdk and vpp?
  2016-04-27 19:32   ` Stephen Hemminger
  2016-04-27 19:37     ` Mikael Abrahamsson
@ 2016-04-27 19:45     ` Dave Taht
  2016-04-27 19:50       ` Dave Taht
  2016-04-27 22:40       ` Aaron Wood
  1 sibling, 2 replies; 7+ messages in thread
From: Dave Taht @ 2016-04-27 19:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Aaron Wood, cake, bloat

On Wed, Apr 27, 2016 at 12:32 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> DPDK gets impressive performance on large systems (like 14M packets/sec per
> core), but not convinced on smaller systems.

My take on dpdk has been mostly that it's a great way to heat data
centers. Still I would really like to see these advanced algorithms
(cake, pie, fq_codel, htb) tested on it at these higher speeds.

And I still have great hope for cheap, FPGA-assisted designs that
could one day be turned into asics, but not as much as I did last year
when I first started fiddling with the meshsr onenetswitch. I really
wish I could find a few good EE's to tackle making something fq_codel
like work on the netfpga project, the proof of concept verilog already
exists for DRR and AQM technologies.

> Performance depends on having good CPU cache. I get poor performance on Atom
> etc.

I had hoped that the rangeley class atoms would do better on dpdk, as
they do I/O direct to cache. I am not sure which processors that is
actually in, anymore.

> Also driver support is limited (mostly 10G and above)

Well, as we push end-user class devices to 1GigE, we are having issues
with overuse of offloads to get there, and in terms
of PPS, certainly pushing small packets is becoming a problem, on
ethernet and wifi. I would like to see a 100 dollar router that could
do full PPS at that speed, feeding fiber and going over 802.11ac, and
we are quite far from there. I see, for example, that meraki is using
click (I think) to push more processing into userspace.

Also the time for a packet to transit linux from read to write is
"interesting". Last I looked it was something like 42 function calls
in the path to "get there", and some of my benchmarks on both the c2
and apu2 are showing that that time is significant enough for fq_codel
to start kicking in to compensate. (which is kind of cool to see the
packet processing adapt to the cpu load, actually - and I still long
for timestamping on rx directly to adapt ever better)

I have also acquired a mild dislike for seeing stuff like this:

where the tx and rx rings are cleaned up in the same thread and there
is only one interrupt line for both.

  51:         18      59244     253350     314273   PCI-MSI
1572865-edge      enp3s0-TxRx-0
  52:          5     484274     141746     197260   PCI-MSI
1572866-edge      enp3s0-TxRx-1
  53:          9     152225      29943     436749   PCI-MSI
1572867-edge      enp3s0-TxRx-2
  54:         22      54327     299670     360356   PCI-MSI
1572868-edge      enp3s0-TxRx-3
  56:     525343     513165    2355680     525593   PCI-MSI
2097152-edge      ath10k_pci

and the ath10k only uses one interrupt. Maybe I'm wrong on my
assumptions, I'd think in today's multi-core environment that
processing tx and rx separately might be a win. (?)

I keep hoping for on-board assist for routing table lookups on
something - your classic cam - for example. I saw today that there has
been some work on getting source specific routing into dpdk, which
makes me happy -

https://www.ietf.org/proceedings/95/slides/slides-95-hackathon-18.pdf

which is, incidentally, where I found the reference to the vpp stuff.

https://www.ietf.org/blog/author/jari/


>
> On Wed, Apr 27, 2016 at 12:28 PM, Aaron Wood <woody77@gmail.com> wrote:
>>
>> I'm looking at DPDK for a project, but I think I can make substantial
>> gains with just AF_PACKET + FANOUT and SO_REUSEPORT.  It's not clear to my
>> yet how much DPDK is going to gain over those (and those can go a long way
>> on higher-powered platforms).
>>
>> On lower-end systems, I'm more suspicious of the memory bus (and the cache
>> in particular), than I am the raw CPU power.
>>
>> -Aaron
>>
>> On Wed, Apr 27, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> https://fd.io/technology seems to have come a long way.
>>>
>>> --
>>> Dave Täht
>>> Let's go make home routers and wifi faster! With better software!
>>> http://blog.cerowrt.org
>>> _______________________________________________
>>> Bloat mailing list
>>> Bloat@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/bloat
>>
>>
>>
>> _______________________________________________
>> Cake mailing list
>> Cake@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cake
>>
>



-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat] are anyone playing with dpdk and vpp?
  2016-04-27 19:45     ` Dave Taht
@ 2016-04-27 19:50       ` Dave Taht
  2016-04-27 22:40       ` Aaron Wood
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Taht @ 2016-04-27 19:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Aaron Wood, cake, bloat

Not really relevant to this thread, probably, was this very good
article on scaling linux to many cores:

https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

I still like the idea of making single threaded cpus better, but only
the millcomputer even comes close to trying, effectively.


On Wed, Apr 27, 2016 at 12:45 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Wed, Apr 27, 2016 at 12:32 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>> DPDK gets impressive performance on large systems (like 14M packets/sec per
>> core), but not convinced on smaller systems.
>
> My take on dpdk has been mostly that it's a great way to heat data
> centers. Still I would really like to see these advanced algorithms
> (cake, pie, fq_codel, htb) tested on it at these higher speeds.
>
> And I still have great hope for cheap, FPGA-assisted designs that
> could one day be turned into asics, but not as much as I did last year
> when I first started fiddling with the meshsr onenetswitch. I really
> wish I could find a few good EE's to tackle making something fq_codel
> like work on the netfpga project, the proof of concept verilog already
> exists for DRR and AQM technologies.
>
>> Performance depends on having good CPU cache. I get poor performance on Atom
>> etc.
>
> I had hoped that the rangeley class atoms would do better on dpdk, as
> they do I/O direct to cache. I am not sure which processors that is
> actually in, anymore.
>
>> Also driver support is limited (mostly 10G and above)
>
> Well, as we push end-user class devices to 1GigE, we are having issues
> with overuse of offloads to get there, and in terms
> of PPS, certainly pushing small packets is becoming a problem, on
> ethernet and wifi. I would like to see a 100 dollar router that could
> do full PPS at that speed, feeding fiber and going over 802.11ac, and
> we are quite far from there. I see, for example, that meraki is using
> click (I think) to push more processing into userspace.
>
> Also the time for a packet to transit linux from read to write is
> "interesting". Last I looked it was something like 42 function calls
> in the path to "get there", and some of my benchmarks on both the c2
> and apu2 are showing that that time is significant enough for fq_codel
> to start kicking in to compensate. (which is kind of cool to see the
> packet processing adapt to the cpu load, actually - and I still long
> for timestamping on rx directly to adapt ever better)
>
> I have also acquired a mild dislike for seeing stuff like this:
>
> where the tx and rx rings are cleaned up in the same thread and there
> is only one interrupt line for both.
>
>   51:         18      59244     253350     314273   PCI-MSI
> 1572865-edge      enp3s0-TxRx-0
>   52:          5     484274     141746     197260   PCI-MSI
> 1572866-edge      enp3s0-TxRx-1
>   53:          9     152225      29943     436749   PCI-MSI
> 1572867-edge      enp3s0-TxRx-2
>   54:         22      54327     299670     360356   PCI-MSI
> 1572868-edge      enp3s0-TxRx-3
>   56:     525343     513165    2355680     525593   PCI-MSI
> 2097152-edge      ath10k_pci
>
> and the ath10k only uses one interrupt. Maybe I'm wrong on my
> assumptions, I'd think in today's multi-core environment that
> processing tx and rx separately might be a win. (?)
>
> I keep hoping for on-board assist for routing table lookups on
> something - your classic cam - for example. I saw today that there has
> been some work on getting source specific routing into dpdk, which
> makes me happy -
>
> https://www.ietf.org/proceedings/95/slides/slides-95-hackathon-18.pdf
>
> which is, incidentally, where I found the reference to the vpp stuff.
>
> https://www.ietf.org/blog/author/jari/
>
>
>>
>> On Wed, Apr 27, 2016 at 12:28 PM, Aaron Wood <woody77@gmail.com> wrote:
>>>
>>> I'm looking at DPDK for a project, but I think I can make substantial
>>> gains with just AF_PACKET + FANOUT and SO_REUSEPORT.  It's not clear to my
>>> yet how much DPDK is going to gain over those (and those can go a long way
>>> on higher-powered platforms).
>>>
>>> On lower-end systems, I'm more suspicious of the memory bus (and the cache
>>> in particular), than I am the raw CPU power.
>>>
>>> -Aaron
>>>
>>> On Wed, Apr 27, 2016 at 11:57 AM, Dave Taht <dave.taht@gmail.com> wrote:
>>>>
>>>> https://fd.io/technology seems to have come a long way.
>>>>
>>>> --
>>>> Dave Täht
>>>> Let's go make home routers and wifi faster! With better software!
>>>> http://blog.cerowrt.org
>>>> _______________________________________________
>>>> Bloat mailing list
>>>> Bloat@lists.bufferbloat.net
>>>> https://lists.bufferbloat.net/listinfo/bloat
>>>
>>>
>>>
>>> _______________________________________________
>>> Cake mailing list
>>> Cake@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cake
>>>
>>
>
>
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org



-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cake] [Bloat] are anyone playing with dpdk and vpp?
  2016-04-27 19:45     ` Dave Taht
  2016-04-27 19:50       ` Dave Taht
@ 2016-04-27 22:40       ` Aaron Wood
  1 sibling, 0 replies; 7+ messages in thread
From: Aaron Wood @ 2016-04-27 22:40 UTC (permalink / raw)
  To: Dave Taht; +Cc: Stephen Hemminger, cake, bloat

[-- Attachment #1: Type: text/plain, Size: 894 bytes --]

> where the tx and rx rings are cleaned up in the same thread and there
> is only one interrupt line for both.
>
>   51:         18      59244     253350     314273   PCI-MSI
> 1572865-edge      enp3s0-TxRx-0
>   52:          5     484274     141746     197260   PCI-MSI
> 1572866-edge      enp3s0-TxRx-1
>   53:          9     152225      29943     436749   PCI-MSI
> 1572867-edge      enp3s0-TxRx-2
>   54:         22      54327     299670     360356   PCI-MSI
> 1572868-edge      enp3s0-TxRx-3
>   56:     525343     513165    2355680     525593   PCI-MSI
> 2097152-edge      ath10k_pci
>
> and the ath10k only uses one interrupt. Maybe I'm wrong on my
> assumptions, I'd think in today's multi-core environment that
> processing tx and rx separately might be a win. (?)
>

The TX interrupt is used to free the SKB after the DMA from memory to the
NIC, correct? (hard_start_xmit()?)

-Aaron

[-- Attachment #2: Type: text/html, Size: 1306 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-04-27 22:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 18:57 [Cake] are anyone playing with dpdk and vpp? Dave Taht
2016-04-27 19:28 ` [Cake] [Bloat] " Aaron Wood
2016-04-27 19:32   ` Stephen Hemminger
2016-04-27 19:37     ` Mikael Abrahamsson
2016-04-27 19:45     ` Dave Taht
2016-04-27 19:50       ` Dave Taht
2016-04-27 22:40       ` Aaron Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox