From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-oi0-x22b.google.com (mail-oi0-x22b.google.com
	[IPv6:2607:f8b0:4003:c06::22b])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 0F24521F565;
	Fri,  5 Dec 2014 12:15:47 -0800 (PST)
Received: by mail-oi0-f43.google.com with SMTP id a3so983650oib.30
	for <multiple recipients>; Fri, 05 Dec 2014 12:15:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=WV3PkjOfvmgaFqowD38w8wk5z0gltCuINSCIaXZPvHA=;
	b=DSJMBBMg8+1MRAPdrvlHm6H62PY6kBqZXmJrw3n63/JHyAMEu7VtLKN9SnCl6JDhxZ
	7bO9VqlF8nWFRzHrEHQDmiWNanG61Oc3LqHqwzLzNhWB5JPZ1ENBLgU3gETR+9GwKTKG
	N71UVSGG+Fnab/9fOZcPHWO2lc/dJTbmSfoa6G5iyXj3fxeTZNh7VxIO5ldF9o3A7O8N
	+5+keU0P+2ZqII10rO2JJuDtWjl+QtQnyT7oaJbnRo4+TjDi7lMGJb6XgtUKmWEBWVTH
	geh6fMO1hAka5kBmlW2u8+vI6oYDsYoyAD8UIDRMIIqBesfWTD7WcmAmIDR50WSBhf35
	4ihA==
MIME-Version: 1.0
X-Received: by 10.182.241.133 with SMTP id wi5mr11693823obc.10.1417810546819; 
	Fri, 05 Dec 2014 12:15:46 -0800 (PST)
Received: by 10.202.227.77 with HTTP; Fri, 5 Dec 2014 12:15:46 -0800 (PST)
In-Reply-To: <14518.1417808038@turing-police.cc.vt.edu>
References: <daeaa4982a7a1c6b864c6b557d27ac1acc3.20141205184544@mail158.atl21.rsgsv.net>
	<CAA93jw7Bd+hvm88==0fcgeB-mS3TDfqrCr0baV2uy7B==3pS5w@mail.gmail.com>
	<14518.1417808038@turing-police.cc.vt.edu>
Date: Fri, 5 Dec 2014 12:15:46 -0800
Message-ID: <CAA93jw4zMJGmoARkOqvZ2eapU7f5OoaF4hiTDJ-AMmP5x3dMzA@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "cerowrt-devel@lists.bufferbloat.net"
	<cerowrt-devel@lists.bufferbloat.net>, bloat <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] [Cerowrt-devel] Fwd: Will Edwards to give Mill talk in
	Estonia on 12/10/2014
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
	<mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Fri, 05 Dec 2014 20:16:16 -0000

On Fri, Dec 5, 2014 at 11:33 AM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Fri, 05 Dec 2014 11:18:57 -0800, Dave Taht said:
>> The Mill is an extremely wide-issue VLIW design, able to issue 30+
>> MIMD operations per cycle.  The Mill is inherently a vector machine
>> and can vectorize and pipeline almost all loops in general purpose
>> code.
>
> The big question is whether we know more about writing compilers for VLIW
> machines than we did when the Itanium came out.  That was hard enough to
> get just 3 instructions packed per word (of course, the fact that it wasn=
't
> 3 generic instructions, but 2 of one flavor and 1 of another, didn't help=
).

Well, in this case half the instructions are one flavor the other half anot=
her.

But it's the belt concept in the "mill" that is key. Basically, having
tons and tons
of fixed addressible registers doesn't work well (as in the itanium,
sparc, and other arches) for a variety of reasons...

Taking a classic smaller register set, such as in the x86_64, and
trying add all these superscalar and out of order features to it
has hit a brick wall

... and the best we see in arms and mips (
with way more registers) is typically two out of order ops, total.

stack machines overly serialize operations and tend to bottleneck
on local cache (see the transputer T800 for the last decent example)

Aside from a bunch of genuinely weirder architectures (see for example
the propeller, or dave may's xcore stuff, or parallella)

the mill's "belt" idea - temporal register addressing - is the first new id=
ea
I've seen in cpu design for a very, very long time. (perhaps it
was tried in some other architecture?)

Even if the mill can't get to 32 ops/cycle generally (and some of those ops
are overhead in maintaining the belt, but not as much as you might
think), I do think it can get to quite a few, even in branchy code,
and the lower end versions of the arch are comparable in ops/cycle
to the best we can do today with computers running at much faster
basic clock rates.

and context switch/subroutine call overhead! 4 cycles. Wow. :)

I certainly have quibbles with the presos I've read so far, edge cases
like floating point ops, and other seemingly nice-to-have but not
critical to the core architecture feature(s)...

but I long for a FPGA version, at least, to play with. I've spent a lot
of time trying to come up with a microarchitecture that could do
fq_codel at 10GigE+ speeds (prototyping in the parallella's FPGA),
and kept dreaming of something like the "propeller" at a really
high clock rate...

... then I stumbled over this. Sure, it's years out, but, like wow.

Well worth an initial hour to read/think/watch about.

--=20
Dave T=C3=A4ht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks