[Bloat] [Cerowrt-devel] Fwd: Will Edwards to give Mill talk in Estonia on 12/10/2014

Fri Dec 5 15:15:46 EST 2014

On Fri, Dec 5, 2014 at 11:33 AM,  <Valdis.Kletnieks at vt.edu> wrote:
> On Fri, 05 Dec 2014 11:18:57 -0800, Dave Taht said:
>> The Mill is an extremely wide-issue VLIW design, able to issue 30+
>> MIMD operations per cycle.  The Mill is inherently a vector machine
>> and can vectorize and pipeline almost all loops in general purpose
>> code.
>
> The big question is whether we know more about writing compilers for VLIW
> machines than we did when the Itanium came out.  That was hard enough to
> get just 3 instructions packed per word (of course, the fact that it wasn't
> 3 generic instructions, but 2 of one flavor and 1 of another, didn't help).

Well, in this case half the instructions are one flavor the other half another.

But it's the belt concept in the "mill" that is key. Basically, having
tons and tons
of fixed addressible registers doesn't work well (as in the itanium,
sparc, and other arches) for a variety of reasons...

Taking a classic smaller register set, such as in the x86_64, and
trying add all these superscalar and out of order features to it
has hit a brick wall

... and the best we see in arms and mips (
with way more registers) is typically two out of order ops, total.

stack machines overly serialize operations and tend to bottleneck
on local cache (see the transputer T800 for the last decent example)

Aside from a bunch of genuinely weirder architectures (see for example
the propeller, or dave may's xcore stuff, or parallella)

the mill's "belt" idea - temporal register addressing - is the first new idea
I've seen in cpu design for a very, very long time. (perhaps it
was tried in some other architecture?)

Even if the mill can't get to 32 ops/cycle generally (and some of those ops
are overhead in maintaining the belt, but not as much as you might
think), I do think it can get to quite a few, even in branchy code,
and the lower end versions of the arch are comparable in ops/cycle
to the best we can do today with computers running at much faster
basic clock rates.

and context switch/subroutine call overhead! 4 cycles. Wow. :)

I certainly have quibbles with the presos I've read so far, edge cases
like floating point ops, and other seemingly nice-to-have but not
critical to the core architecture feature(s)...

but I long for a FPGA version, at least, to play with. I've spent a lot
of time trying to come up with a microarchitecture that could do
fq_codel at 10GigE+ speeds (prototyping in the parallella's FPGA),
and kept dreaming of something like the "propeller" at a really
high clock rate...

... then I stumbled over this. Sure, it's years out, but, like wow.

Well worth an initial hour to read/think/watch about.

-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks