[Bloat] Hardware upticks

General list for discussing Bufferbloat
 help / color / mirror / Atom feed

* [Bloat] Hardware upticks
@ 2016-01-05  6:37 Jonathan Morton
  2016-01-05 17:42 ` Aaron Wood
  0 siblings, 1 reply; 19+ messages in thread
From: Jonathan Morton @ 2016-01-05  6:37 UTC (permalink / raw)
  To: bloat

This looks potentially interesting:  http://www.theregister.co.uk/2016/01/05/broadcom_pimps_iot_router_chip/

Even if that particular device turns out to be hard to work with in an open-source manner, it looks like hardware in general might be about to improve.

 - Jonathan Morton


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05  6:37 [Bloat] Hardware upticks Jonathan Morton
@ 2016-01-05 17:42 ` Aaron Wood
  2016-01-05 18:27   ` Jonathan Morton
  0 siblings, 1 reply; 19+ messages in thread
From: Aaron Wood @ 2016-01-05 17:42 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

'5Gbps system throughput "without taxing the CPU,"'

Lots of offloads?

On Mon, Jan 4, 2016 at 10:37 PM, Jonathan Morton <chromatix99@gmail.com>
wrote:

> This looks potentially interesting:
> http://www.theregister.co.uk/2016/01/05/broadcom_pimps_iot_router_chip/
>
> Even if that particular device turns out to be hard to work with in an
> open-source manner, it looks like hardware in general might be about to
> improve.
>
>  - Jonathan Morton
>

[-- Attachment #2: Type: text/html, Size: 1281 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 17:42 ` Aaron Wood
@ 2016-01-05 18:27   ` Jonathan Morton
  2016-01-05 18:57     ` Dave Täht
  0 siblings, 1 reply; 19+ messages in thread
From: Jonathan Morton @ 2016-01-05 18:27 UTC (permalink / raw)
  To: Aaron Wood; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 104 bytes --]

Undoubtedly.  But that beefy quad-core CPU should be able to handle it
without them.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 147 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 18:27   ` Jonathan Morton
@ 2016-01-05 18:57     ` Dave Täht
  2016-01-05 19:29       ` Steinar H. Gunderson
  2016-01-05 21:36       ` Benjamin Cronce
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Täht @ 2016-01-05 18:57 UTC (permalink / raw)
  To: bloat

On 1/5/16 10:27 AM, Jonathan Morton wrote:
> Undoubtedly.  But that beefy quad-core CPU should be able to handle it
> without them.

Sigh. It's not just the CPU that matters. Context switch time, memory
bus and I/O bus architecture, the intelligence or lack thereof of the
network interface, and so on.

To give a real world case of stupidity in a hardware design - the armada
385 in the linksys platform connects tx and rx packet related interrupts
to a single interrupt line, requiring that tx and rx ring buffer cleanup
(in particular) be executed on a single cpu, *at the same time, in a
dedicated thread*.

Saving a single pin (which doesn't even exist off chip) serializes
tx and rx processing. DUMB. (I otherwise quite like much of the marvel
ethernet design and am looking forward to the turris omnia very much)

...

Context switch time is probably one of the biggest hidden nightmares in
modern OOO cpu architectures - they only go fast in a straight line. I'd
love to see a 1ghz processor that could context switch in 5 cycles.

Having 4 cores responding to interrupts masks this latency somewhat
when having multiple sources of interrupt contending... (but see above -
you need dedicated interrupt lines per major source of interrupts for
it to work)

and the inherent context switch latency is still always there. (sound
cheshire's rant)

The general purpose "mainstream" processors not handling interrupts well
anymore is one of the market drivers towards specialized co-processors.

...

Knowing broadcom, there's probably so many invasive offloads, bugs
and errata in this new chip that 90% of the features will never be
used. But "forwarding in-inspected, un-firewalled, packets in massive
bulk so as to win a benchmark race", ok, happy they are trying.

Maybe they'll even publish a data sheet worth reading.

> 
> - Jonathan Morton
> 
> 
> 
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 18:57     ` Dave Täht
@ 2016-01-05 19:29       ` Steinar H. Gunderson
  2016-01-05 19:37         ` Dave Täht
  2016-01-05 21:36       ` Benjamin Cronce
  1 sibling, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-05 19:29 UTC (permalink / raw)
  To: bloat

On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
> Context switch time is probably one of the biggest hidden nightmares in
> modern OOO cpu architectures - they only go fast in a straight line. I'd
> love to see a 1ghz processor that could context switch in 5 cycles.

It's called hyperthreading? ;-)

Anyway, the biggest cost of a context switch isn't necessarily the time used
to set up registers and such. It's increased L1 pressure; your CPU is now
running different code and looking at (largely) different data.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 19:29       ` Steinar H. Gunderson
@ 2016-01-05 19:37         ` Dave Täht
  2016-01-05 20:13           ` David Collier-Brown
                             ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Dave Täht @ 2016-01-05 19:37 UTC (permalink / raw)
  To: bloat



On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:
> On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
>> Context switch time is probably one of the biggest hidden nightmares in
>> modern OOO cpu architectures - they only go fast in a straight line. I'd
>> love to see a 1ghz processor that could context switch in 5 cycles.
> 
> It's called hyperthreading? ;-)
> 
> Anyway, the biggest cost of a context switch isn't necessarily the time used
> to set up registers and such. It's increased L1 pressure; your CPU is now
> running different code and looking at (largely) different data.

+10.

A L1/L2 Icache dedicated to interrupt processing code could make a great
deal of difference, if only cpu makers and benchmarkers would make
CS time something we valued.

Dcache, not so much, except for the intel architectures which are now
doing DMA direct to cache. (any arms doing that?)

> /* Steinar */
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 19:37         ` Dave Täht
@ 2016-01-05 20:13           ` David Collier-Brown
  2016-01-05 20:27           ` Stephen Hemminger
  2016-01-05 23:17           ` Steinar H. Gunderson
  2 siblings, 0 replies; 19+ messages in thread
From: David Collier-Brown @ 2016-01-05 20:13 UTC (permalink / raw)
  To: bloat

The SPARC T5 is surprisingly good here, with a very short path to cache 
and a moderate number of threads with hot cache lines.  Cache 
performance was one of the surprises when the slowish early T-machines 
came out, and surprised a smarter colleague and I who had apps 
bottlenecking on cold cache lines on what were nominally much faster 
processors.

I'd love to have a T5-1 on an experimenter board, or perhaps even in my 
laptop (I used to own a SPARC laptop), but that's not where Snoracle is 
going.

--dave

On 05/01/16 02:37 PM, Dave Täht wrote:
>
> On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:
>> On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
>>> Context switch time is probably one of the biggest hidden nightmares in
>>> modern OOO cpu architectures - they only go fast in a straight line. I'd
>>> love to see a 1ghz processor that could context switch in 5 cycles.
>> It's called hyperthreading? ;-)
>>
>> Anyway, the biggest cost of a context switch isn't necessarily the time used
>> to set up registers and such. It's increased L1 pressure; your CPU is now
>> running different code and looking at (largely) different data.
> +10.
>
> A L1/L2 Icache dedicated to interrupt processing code could make a great
> deal of difference, if only cpu makers and benchmarkers would make
> CS time something we valued.
>
> Dcache, not so much, except for the intel architectures which are now
> doing DMA direct to cache. (any arms doing that?)
>
>> /* Steinar */
>>
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat


-- 
David Collier-Brown,         | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb@spamcop.net           |                      -- Mark Twain


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 19:37         ` Dave Täht
  2016-01-05 20:13           ` David Collier-Brown
@ 2016-01-05 20:27           ` Stephen Hemminger
  2016-01-05 21:10             ` Jonathan Morton
  2016-01-05 23:20             ` Steinar H. Gunderson
  2016-01-05 23:17           ` Steinar H. Gunderson
  2 siblings, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2016-01-05 20:27 UTC (permalink / raw)
  To: Dave Täht; +Cc: bloat

On Tue, 5 Jan 2016 11:37:02 -0800
Dave Täht <dave@taht.net> wrote:

> 
> 
> On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:
> > On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
> >> Context switch time is probably one of the biggest hidden nightmares in
> >> modern OOO cpu architectures - they only go fast in a straight line. I'd
> >> love to see a 1ghz processor that could context switch in 5 cycles.
> > 
> > It's called hyperthreading? ;-)
> > 
> > Anyway, the biggest cost of a context switch isn't necessarily the time used
> > to set up registers and such. It's increased L1 pressure; your CPU is now
> > running different code and looking at (largely) different data.
> 
> +10.
> 
> A L1/L2 Icache dedicated to interrupt processing code could make a great
> deal of difference, if only cpu makers and benchmarkers would make
> CS time something we valued.
> 
> Dcache, not so much, except for the intel architectures which are now
> doing DMA direct to cache. (any arms doing that?)
> 
> > /* Steinar */
> > 
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

Intel has some new Cache QoS stuff that allows configuring how much
cache is allowed per context.  But of course it is only on the newest/latest/unoptinium

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 20:27           ` Stephen Hemminger
@ 2016-01-05 21:10             ` Jonathan Morton
  2016-01-05 23:20             ` Steinar H. Gunderson
  1 sibling, 0 replies; 19+ messages in thread
From: Jonathan Morton @ 2016-01-05 21:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: bloat, Dave Täht

[-- Attachment #1: Type: text/plain, Size: 649 bytes --]

Yes, Intel is the master of market segmentation here.  I don't believe for
a second that most of their best features just happen to have a high defect
rate that warrants setting the kill bit on all the cheaper badges slapped
on the common die.

A few years ago, I got a killer deal from AMD.  The Phenom II X2 555 BE.
In the right motherboard, it would happily attempt to turn the two missing
cores back on.  If successful, you had a Phenom II X4 955 BE.  And so I
did.  It's still a pretty nice beast - shame it's locked away in storage
for the moment.

Intel doesn't allow such nice tricks.  They'd lose too much money from it.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 753 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 18:57     ` Dave Täht
  2016-01-05 19:29       ` Steinar H. Gunderson
@ 2016-01-05 21:36       ` Benjamin Cronce
  2016-01-06  0:01         ` Steinar H. Gunderson
  1 sibling, 1 reply; 19+ messages in thread
From: Benjamin Cronce @ 2016-01-05 21:36 UTC (permalink / raw)
  To: Dave Täht; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 3342 bytes --]

On Tue, Jan 5, 2016 at 12:57 PM, Dave Täht <dave@taht.net> wrote:

>
>
> On 1/5/16 10:27 AM, Jonathan Morton wrote:
> > Undoubtedly.  But that beefy quad-core CPU should be able to handle it
> > without them.
>
> Sigh. It's not just the CPU that matters. Context switch time, memory
> bus and I/O bus architecture, the intelligence or lack thereof of the
> network interface, and so on.
>
> To give a real world case of stupidity in a hardware design - the armada
> 385 in the linksys platform connects tx and rx packet related interrupts
> to a single interrupt line, requiring that tx and rx ring buffer cleanup
> (in particular) be executed on a single cpu, *at the same time, in a
> dedicated thread*.
>
> Saving a single pin (which doesn't even exist off chip) serializes
> tx and rx processing. DUMB. (I otherwise quite like much of the marvel
> ethernet design and am looking forward to the turris omnia very much)
>
> ...
>
> Context switch time is probably one of the biggest hidden nightmares in
> modern OOO cpu architectures - they only go fast in a straight line. I'd
> love to see a 1ghz processor that could context switch in 5 cycles.
>

Seeing that most modern CPUs take thousands to tens of thousands of cycles
to switch, 5 is similar to saying "instantly". Some of that overhead is
shooting down the TLB and many layers of cache misses. You can't have
different virtual memory space and not take some large switching overhead
without devoting a lot of transistors to massive caches. And the larger the
caches, the higher the latency.

Modern PC hardware can use soft interrupts to reduce hardware interrupts
and context switching. My Intel i350 issues a steady about 150 interrupts
per second per core regardless the network load, while maintaining tens of
microsecond ping times.

I'm not sure what they could do with custom architectures, but there will
always be an issue with context switching overhead, but they may be able to
cache a few specific contexts knowing that the embedded system will rarely
have more than a few contexts doing the bulk of the work.


>
> Having 4 cores responding to interrupts masks this latency somewhat
> when having multiple sources of interrupt contending... (but see above -
> you need dedicated interrupt lines per major source of interrupts for
> it to work)
>
> and the inherent context switch latency is still always there. (sound
> cheshire's rant)
>
> The general purpose "mainstream" processors not handling interrupts well
> anymore is one of the market drivers towards specialized co-processors.
>
> ...
>
> Knowing broadcom, there's probably so many invasive offloads, bugs
> and errata in this new chip that 90% of the features will never be
> used. But "forwarding in-inspected, un-firewalled, packets in massive
> bulk so as to win a benchmark race", ok, happy they are trying.
>
> Maybe they'll even publish a data sheet worth reading.
>
> >
> > - Jonathan Morton
> >
> >
> >
> > _______________________________________________
> > Bloat mailing list
> > Bloat@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
> >
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 4394 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 19:37         ` Dave Täht
  2016-01-05 20:13           ` David Collier-Brown
  2016-01-05 20:27           ` Stephen Hemminger
@ 2016-01-05 23:17           ` Steinar H. Gunderson
  2 siblings, 0 replies; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-05 23:17 UTC (permalink / raw)
  To: bloat

On Tue, Jan 05, 2016 at 11:37:02AM -0800, Dave Täht wrote:
>> Anyway, the biggest cost of a context switch isn't necessarily the time used
>> to set up registers and such. It's increased L1 pressure; your CPU is now
>> running different code and looking at (largely) different data.
> A L1/L2 Icache dedicated to interrupt processing code could make a great
> deal of difference, if only cpu makers and benchmarkers would make
> CS time something we valued.
> 
> Dcache, not so much, except for the intel architectures which are now
> doing DMA direct to cache. (any arms doing that?)

Note that I'm saying L1 pressure, not “bad choice of what to keep in L1”.
If you dedicate L1i space to interrupt processing code (which includes,
presumably, large parts of your TCP/IP stack?), you will have less for
your normal userspace, and I'd like to see some very hard data on that being
a win before I'll believe it at face value.

In a sense, if you tie your interrupts to a dedicated core, you get exactly
this, though.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 20:27           ` Stephen Hemminger
  2016-01-05 21:10             ` Jonathan Morton
@ 2016-01-05 23:20             ` Steinar H. Gunderson
  1 sibling, 0 replies; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-05 23:20 UTC (permalink / raw)
  To: bloat

On Tue, Jan 05, 2016 at 12:27:03PM -0800, Stephen Hemminger wrote:
> Intel has some new Cache QoS stuff that allows configuring how much
> cache is allowed per context.  But of course it is only on the newest/latest/unoptinium

Note that this is for improving fairness between applications, not total
throughput of the machine. It will gain you nothing if you run a single
server; what it gains you is that you can take your low-priority batch job
and run it next to your mission-critical web server, without having to worry
it will eat up all your L3.

Needless to say, it is for advanced users.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-05 21:36       ` Benjamin Cronce
@ 2016-01-06  0:01         ` Steinar H. Gunderson
  2016-01-06  0:06           ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-06  0:01 UTC (permalink / raw)
  To: bloat

On Tue, Jan 05, 2016 at 03:36:10PM -0600, Benjamin Cronce wrote:
> You can't have different virtual memory space and not take some large
> switching overhead without devoting a lot of transistors to massive caches.
> And the larger the caches, the higher the latency.

I'm sure you already know this, but just to add to what you're saying:
Modern CPUs actually have cache-line tagging tricks so that they don't have
to blow the entire L1 just because you do a context switch. It would be too
expensive.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:01         ` Steinar H. Gunderson
@ 2016-01-06  0:06           ` Stephen Hemminger
  2016-01-06  0:22             ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2016-01-06  0:06 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: bloat

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

The expensive part is often having to save and restore all the state in
registers and other bits on context switch.


On Tue, Jan 5, 2016 at 4:01 PM, Steinar H. Gunderson <sgunderson@bigfoot.com
> wrote:

> On Tue, Jan 05, 2016 at 03:36:10PM -0600, Benjamin Cronce wrote:
> > You can't have different virtual memory space and not take some large
> > switching overhead without devoting a lot of transistors to massive
> caches.
> > And the larger the caches, the higher the latency.
>
> I'm sure you already know this, but just to add to what you're saying:
> Modern CPUs actually have cache-line tagging tricks so that they don't have
> to blow the entire L1 just because you do a context switch. It would be too
> expensive.
>
> /* Steinar */
> --
> Homepage: https://www.sesse.net/
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>

[-- Attachment #2: Type: text/html, Size: 1631 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:06           ` Stephen Hemminger
@ 2016-01-06  0:22             ` Steinar H. Gunderson
  2016-01-06  0:53               ` Stephen Hemminger
  2016-01-06  6:18               ` Jonathan Morton
  0 siblings, 2 replies; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-06  0:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: bloat

On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
> The expensive part is often having to save and restore all the state in
> registers and other bits on context switch.

Are you sure? There's not really all that much state to save, and all I've
been taught before says the opposite.

Also, I've never ever seen the actual context switch turn up high in a perf
profile.  Is this because of some sampling artifact?

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:22             ` Steinar H. Gunderson
@ 2016-01-06  0:53               ` Stephen Hemminger
  2016-01-06  0:55                 ` Steinar H. Gunderson
  2016-01-06  6:18               ` Jonathan Morton
  1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2016-01-06  0:53 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: bloat

On Wed, 6 Jan 2016 01:22:13 +0100
"Steinar H. Gunderson" <sgunderson@bigfoot.com> wrote:

> On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
> > The expensive part is often having to save and restore all the state in
> > registers and other bits on context switch.
> 
> Are you sure? There's not really all that much state to save, and all I've
> been taught before says the opposite.
> 
> Also, I've never ever seen the actual context switch turn up high in a perf
> profile.  Is this because of some sampling artifact?

Yes, especially with Intel processors getting more and more SSE/floating point
registers.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:53               ` Stephen Hemminger
@ 2016-01-06  0:55                 ` Steinar H. Gunderson
  2016-01-06  1:22                   ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2016-01-06  0:55 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: bloat

On Tue, Jan 05, 2016 at 04:53:14PM -0800, Stephen Hemminger wrote:
>> Also, I've never ever seen the actual context switch turn up high in a perf
>> profile.  Is this because of some sampling artifact?
> Yes, especially with Intel processors getting more and more SSE/floating point
> registers.

But those are not saved on context switch to the kernel, no? (Thus the rule
of no floating-point in the kernel.) Only if you switch between userspace
processes.

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:55                 ` Steinar H. Gunderson
@ 2016-01-06  1:22                   ` Stephen Hemminger
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2016-01-06  1:22 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: bloat

On Wed, 6 Jan 2016 01:55:37 +0100
"Steinar H. Gunderson" <sgunderson@bigfoot.com> wrote:

> On Tue, Jan 05, 2016 at 04:53:14PM -0800, Stephen Hemminger wrote:
> >> Also, I've never ever seen the actual context switch turn up high in a perf
> >> profile.  Is this because of some sampling artifact?
> > Yes, especially with Intel processors getting more and more SSE/floating point
> > registers.
> 
> But those are not saved on context switch to the kernel, no? (Thus the rule
> of no floating-point in the kernel.) Only if you switch between userspace
> processes

Right that just punts the work to the kernel when it context switches
to next process.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Bloat] Hardware upticks
  2016-01-06  0:22             ` Steinar H. Gunderson
  2016-01-06  0:53               ` Stephen Hemminger
@ 2016-01-06  6:18               ` Jonathan Morton
  1 sibling, 0 replies; 19+ messages in thread
From: Jonathan Morton @ 2016-01-06  6:18 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: Stephen Hemminger, bloat

> On 6 Jan, 2016, at 02:22, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> 
> On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
>> The expensive part is often having to save and restore all the state in
>> registers and other bits on context switch.
> 
> Are you sure? There's not really all that much state to save, and all I've
> been taught before says the opposite.
> 
> Also, I've never ever seen the actual context switch turn up high in a perf
> profile.  Is this because of some sampling artifact?

ARM has dedicated register banks for several interrupt levels for exactly this reason.  Simple interrupt handlers can operate in these without spilling *any* userspace registers.  This gives ARM quite good interrupt latency, especially in the simpler implementations.

That doesn’t help for an actual context switch of course.  What does help is “lazy FPU state switching”, where on a context switch the FPU is simply marked as unavailable.  Only if/when the process attempts to *use* the FPU, this gets trapped and the trap handler restores the correct state before returning an enabled FPU to userspace.  The same goes for SIMD register banks, of course.

Lazy context switching is a kernel feature.  It’s used on all architectures that have a runtime disable-able FPU, AFAIK.  For a context switch to kernel and back to the same process, the FPU & SIMD are never actually switched, so there is almost no overhead.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-01-06  6:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-05  6:37 [Bloat] Hardware upticks Jonathan Morton
2016-01-05 17:42 ` Aaron Wood
2016-01-05 18:27   ` Jonathan Morton
2016-01-05 18:57     ` Dave Täht
2016-01-05 19:29       ` Steinar H. Gunderson
2016-01-05 19:37         ` Dave Täht
2016-01-05 20:13           ` David Collier-Brown
2016-01-05 20:27           ` Stephen Hemminger
2016-01-05 21:10             ` Jonathan Morton
2016-01-05 23:20             ` Steinar H. Gunderson
2016-01-05 23:17           ` Steinar H. Gunderson
2016-01-05 21:36       ` Benjamin Cronce
2016-01-06  0:01         ` Steinar H. Gunderson
2016-01-06  0:06           ` Stephen Hemminger
2016-01-06  0:22             ` Steinar H. Gunderson
2016-01-06  0:53               ` Stephen Hemminger
2016-01-06  0:55                 ` Steinar H. Gunderson
2016-01-06  1:22                   ` Stephen Hemminger
2016-01-06  6:18               ` Jonathan Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox