Hub Execution Model Thread (split from blog)

Ramon · 2013-12-04 01:24

Bill Henning wrote: »

Chip, Go back to sleep, and re-read this thread when you are clear headed.

Go to sleep, but do not came back to this forum anymore because is getting crazy ... (We are all guilty here ! ... me too ).

Cut your ethernet cable, broke your cell phone. Ask someone to change your email account password and forget it. No more interruptions to finish P2.

I recommend a book to all of you guys: "The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer", by Charles J. Murray. In the 70s, Cray and their team went to a small town with limited phone calls and limited transportation to made their next project.

BTW, The last project of Seymour Cray was to make GaAs processor just to reduce latency. I hope enough sucess to P2 to make P3 the first GaAs microcontroller (No 64 bits, no segments, no MMU, no Linux or any other legacy problem. Just a few GHz of pure raw performance !).

Heater. · 2013-12-04 01:37

Yes but if that GaAs Propeller used a Four Dimensional Mobius Quantum Entangled Hypertransport Bus between the 32GByte HUB RAM and the 1024 COGs then they could all execute LMM code at 9Ghz instead of just 8.

We have to tell Chip about this!

Ramon · 2013-12-04 01:51

Yes Heater, you got the point ! (Crazy, ... Crazy )

cgracey · 2013-12-04 02:14

Ramon wrote: »

Go to sleep, but do not came back to this forum anymore because is getting crazy ... (We are all guilty here ! ... me too ).

Cut your ethernet cable, broke your cell phone. Ask someone to change your email account password and forget it. No more interruptions to finish P2.

I recommend a book to all of you guys: "The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer", by Charles J. Murray. In the 70s, Cray and their team went to a small town with limited phone calls and limited transportation to made their next project.

BTW, The last project of Seymour Cray was to make GaAs processor just to reduce latency. I hope enough sucess to P2 to make P3 the first GaAs microcontroller (No 64 bits, no segments, no MMU, no Linux or any other legacy problem. Just a few GHz of pure raw performance !).

It's a funny thing that bulk CMOS processes, due to relentless shrinkage, have practically eclipsed every exotic alternative process chemistry that's ever been developed. Even if the others are faster, they don't have the density, which translates to speed in another way. At least, this is my observation. If there were faster processes, Altera would be using them, but Intel, Altera, and everyone else is using CMOS for logic.

Many of you heard this, but when we were last synthesizing our Prop2 core, I asked the synthesis guy to run it in a 40nm process, just to see how fast it would go. It closed timing, no problem, at 1GHz, and the cell area was only 0.89 square mm. I think this current Prop2, with all we've added, would probably close timing in 28nm at nearly 2GHz. It's just a matter of a few $100k to prove it, and then about $3M to get it into production (the really expensive part).

ozpropdev · 2013-12-04 02:27

cgracey wrote: »

It's a funny thing that bulk CMOS processes, due to relentless shrinkage, have practically eclipsed every exotic alternative process chemistry that's ever been developed. Even if the others are faster, they don't have the density, which translates to speed in another way. At least, this is my observation. If there were faster processes, Altera would be using them, but Intel, Altera, and everyone else is using CMOS for logic.

Many of you heard this, but when we were last synthesizing our Prop2 core, I asked the synthesis guy to run it in a 40nm process, just to see how fast it would go. It closed timing, no problem, at 1GHz, and the cell area was only 0.89 square mm. I think this current Prop2, with all we've added, would probably close timing in 28nm at nearly 2GHz. It's just a matter of a few $100k to prove it, and then about $3M to get it into production (the really expensive part).

I remember reading somewhere that the Altera Stratix 10 uses Intels 14nm technology.
I hate to think how much that FPGA demo board is!
IIRC the same article said something about Intel building a new 7nm fab facility!

cgracey · 2013-12-04 02:40

ozpropdev wrote: »

I remember reading somewhere that the Altera Stratix 10 uses Intels 14nm technology.
I hate to think how much that FPGA demo board is!
IIRC the same article said something about Intel building a new 7nm fab facility!

I thought you were mistaken, at first, because I thought Stratix V was their fastest. They took a real leap in naming here. These are supposed to run designs at GHz speeds. I bet Prop2 would run at around 700MHz on this FPGA. They make it sound VERY intriguing:

ozpropdev · 2013-12-04 03:32

cgracey wrote: »

I thought you were mistaken, at first, because I thought Stratix V was their fastest. They took a real leap in naming here. These are supposed to run designs at GHz speeds. I bet Prop2 would run at around 700MHz on this FPGA. They make it sound VERY intriguing:

I wonder if we can get a box of SAMPLE dev boards out of them for evaluation purposes. LOL

ersmith · 2013-12-04 05:07

Bill Henning wrote: »

With respect, I disagree.

I agree that it is not an hours work, but it would take far less time than CMM took for example.

Actually I don't think so. CMM was pretty straightforward (the CMM instruction set is a superset of the LMM one). Changing to an AUX stack is a much more difficult undertaking. As far as I can tell, by the time memory references get to the machine independent back-end they all look the same (with the special exception of push/pop; but those aren't used for variable access). If you know of a way to distinguish stack and other memory accesses in GCC please let us know, it would allow us to speed up XMM stack memory accesses even in P1.

If the stack model limited arguments + variables, the new indexed stack addressing mode can be used. It would be somewhat wasteful, as everything would have to be longs, but it would work.

Yes, but then it's not really C any more. Which is fair enough -- the AUX stack model could be useful for some languages, and certainly for PASM programmers. But for general purpose languages like C I don't think it will work.

ersmith · 2013-12-04 05:26

Bill Henning wrote: »

To help anyone who wants to try, here is how I would do it:

- limiting the total number of arguments and local variables to the index range from SPA is important
- use longs even for byte/word local variables
- do not store arrays on the AUX stack, use space on a hub stack
- either forbid taking the address of local variables, OR
- any local variables that have their address taken should live on a hub stack (easier fix of two), OR
- (non-trivial gcc change) add an 'aux' attribute to arguments and local variables that the code generator can use to emit SPA indexed references
- for debugging runs, use the WC/WZ flags I described to check for stack over/under flow

All those changes create a language that is no longer "C" or "C++". Even modifying Spin to have those restrictions seems like it would be a lot of work.

Note, in an absolute sense, it would not be "easy", however it would be far easier than a VLIW-style RDQUAD LMM would be.

I actually think the VLIW RDQUAD LMM would be easier. GCC already supports VLIW processors, and that support is pretty robust. There are quite a few VLIW processors on the market already, so there's a lot of experience with them. The GCC support for split address spaces is still very tentative and untried, and I don't know of any processors that use it (except possibly for the Cell SPE, but each one of those has 256KB rather than 2KB, so there's a lot more room for local variables).

Eric

David Betz · 2013-12-04 07:09

I've been thinking about the problem of loading 32 bit constants and issues with the pipeline and was wondering if we could free up enough opcode space for two new instructions like the current REPS instruction.

----		1111110 10 n nnnn nnnnnnnnn nnniiiiii		REPS	#1..$20000,#1..64

The idea would be to have an instruction to load the low 16 bits of a register and another to load the high 16 bits. These would take up the same amount of space as the current proposal to use a RDLONGC reg, PTRA+ instruction followed by the 32 bit constant but would not have any pipeline issues. I guess the trick would be to make space for a 9 bit destination field as well as a 16 bit immediate field for a total of 25 bits. The REPS instruction has 23 bits of arguments. I guess that would be enough if we're willing to limit the D field to only 7 bits so it could only address the first 128 COG locations.

Anyway, the instructions would be as follows:

    LDLO D, #$nnnn
    LDHI D, #$nnnn

Is there any way to fit these into the instruction set bit encodings?

Bill Henning · 2013-12-04 08:36

ersmith wrote: »

Actually I don't think so. CMM was pretty straightforward (the CMM instruction set is a superset of the LMM one).

Makes sense.

You have convinced me that CMM was easier to add to GCC than the AUX stack model.

ersmith wrote: »

Changing to an AUX stack is a much more difficult undertaking. As far as I can tell, by the time memory references get to the machine independent back-end they all look the same (with the special exception of push/pop; but those aren't used for variable access). If you know of a way to distinguish stack and other memory accesses in GCC please let us know, it would allow us to speed up XMM stack memory accesses even in P1.

Using the top two bits of a 32 bit internal address should work. I think you are using something similar already for cog vs. hub.

00 - cog address
01 - aux address
10 - hub address
11 - ext address

This should survive to the machine independent back end, and the code generator should be able to emit different code based on the top two bits of the address. It does limit the address space for hub / xmm to 1GB, but I don't that is an issue for the P2.

ersmith wrote: »

Yes, but then it's not really C any more. Which is fair enough -- the AUX stack model could be useful for some languages, and certainly for PASM programmers. But for general purpose languages like C I don't think it will work.

It would be a subset, useful for driver/control apps that needed the most speed, not useful for compiling large apps (gcc, emacs et al)

Good discussion.

Bill Henning · 2013-12-04 08:40

ersmith wrote: »

All those changes create a language that is no longer "C" or "C++". Even modifying Spin to have those restrictions seems like it would be a lot of work.

The language would be C, with some limitations.

On other microcontrollers C is often limited to 1KB-4KB of ram total, however it would be a single address space, therefore easier on gcc.

Would the limitations rule out compiling large, normal PC style applications with an AUX stack?

Absolutely.

But the point is to make fast driver / control apps.

ersmith wrote: »

I actually think the VLIW RDQUAD LMM would be easier. GCC already supports VLIW processors, and that support is pretty robust. There are quite a few VLIW processors on the market already, so there's a lot of experience with them. The GCC support for split address spaces is still very tentative and untried, and I don't know of any processors that use it (except possibly for the Cell SPE, but each one of those has 256KB rather than 2KB, so there's a lot more room for local variables).

Eric

I think the VLIW would be more work, mostly due to scheduling issues, however I hope you are right about VLIW being not too hard to implement, in case Chip does not add the hub exec mode, as then the VLIW approach would be the only one to really speed gcc up.

Bill Henning · 2013-12-04 08:41

David, that is an EXCELLENT idea!

16 data bits + 9 register address bits might barely fit if WC,WZ,I,CCCC were used to hold part of the register address.

TTTTTTTdddddddddxxxxxxxxxxxxxxxx

where T = opcode, d = dest reg, x = data

David Betz wrote: »
I've been thinking about the problem of loading 32 bit constants and issues with the pipeline and was wondering if we could free up enough opcode space for two new instructions like the current REPS instruction.
----		1111110 10 n nnnn nnnnnnnnn nnniiiiii		REPS	#1..$20000,#1..64
The idea would be to have an instruction to load the low 16 bits of a register and another to load the high 16 bits. These would take up the same amount of space as the current proposal to use a RDLONGC reg, PTRA+ instruction followed by the 32 bit constant but would not have any pipeline issues. I guess the trick would be to make space for a 9 bit destination field as well as a 16 bit immediate field for a total of 25 bits. The REPS instruction has 23 bits of arguments. I guess that would be enough if we're willing to limit the D field to only 7 bits so it could only address the first 128 COG locations.

Anyway, the instructions would be as follows:
    LDLO D, #$nnnn
    LDHI D, #$nnnn
Is there any way to fit these into the instruction set bit encodings?

David Betz · 2013-12-04 09:11

Bill Henning wrote: »

David, that is an EXCELLENT idea!

16 data bits + 9 register address bits might barely fit if WC,WZ,I,CCCC were used to hold part of the register address.

TTTTTTTdddddddddxxxxxxxxxxxxxxxx

where T = opcode, d = dest reg, x = data

I guess another possiblity might be to let the MOVS instruction get used to select the destination register and load the low 9 bits. Then, the following instruction could be encoded like REPS and load the remaining 32 bits into the same register that was used by the preceeding MOVS instruction. That would only require a single new instruction and it wouldn't have to have a D field. It would also require a 9 bit internal register (one for each thread) to remember the D field of the previous MOVS instruction.

David Betz · 2013-12-04 09:18

David Betz wrote: »

I guess another possiblity might be to let the MOVS instruction get used to select the destination register and load the low 9 bits. Then, the following instruction could be encoded like REPS and load the remaining 32 bits into the same register that was used by the preceeding MOVS instruction. That would only require a single new instruction and it wouldn't have to have a D field. It would also require a 9 bit internal register (one for each thread) to remember the D field of the previous MOVS instruction.

Actually, even better would be to have the new instruction be a prefix instruction that just loaded 23 bits into an internal register. Then, the immediate field of the following instruction would use its own 9 bit immediate value along with the value in the 23 bit internal register to form a full 32 bit immediate operand. Then you could use 32 bit immediates with any instruction.

Edit: Eric will recognize this as the "big" prefix we had in the MPE processor back in VM Labs days. :-)

Bill Henning · 2013-12-04 09:25

Both of those would also work!

LOL - I like "BIG"

David Betz · 2013-12-04 09:26

Bill Henning wrote: »

Both of those would also work!

LOL - I like "BIG"

The BIG instruction would allow you to extend any S field to 32 bits which means you could read or write any location in hub memory with a two instruction sequence.

Bill Henning · 2013-12-04 09:39

I like it.

I think the cycle after it is used it should be cleared to zero in order to avoid unexpected results.

David Betz wrote: »

The BIG instruction would allow you to extend any S field to 32 bits which means you could read or write any location in hub memory with a two instruction sequence.

David Betz · 2013-12-04 09:40

Bill Henning wrote: »

I like it.

I think the cycle after it is used it should be cleared to zero in order to avoid unexpected results.

Yes, that would be required. Any instruction other than the BIG instruction should clear the BIG register after executing.

Bill Henning · 2013-12-04 09:43

The ideal opcode for BIG would be 0000000

It would then execute as a NOP (other than loading the hidden BIG register), and could be referenced as just another cog variable from any code.

Basically, this means NOP turns into BIG, and no need to free up an opcode!

David Betz wrote: »

Yes, that would be required. Any instruction other than the BIG instruction should clear the BIG register after executing.

Bill Henning · 2013-12-04 09:45

argh,

Can't use NOP... could cause problems with tasks (unless there was a separate BIG hidden register for each task)

David Betz · 2013-12-04 09:48

Bill Henning wrote: »

argh,

Can't use NOP... could cause problems with tasks (unless there was a separate BIG hidden register for each task)

Yes, I think we would need a separate BIG register for each thread. I wonder if 23 x 4 is a deal breaker?

ersmith · 2013-12-04 09:53

Bill Henning wrote: »

Using the top two bits of a 32 bit internal address should work. I think you are using something similar already for cog vs. hub.

The C compiler never sees addresses, only symbols. It's the linker that actually assigns addresses, and by the time we get to the linker the instructions have already been chosen.

The flow is: GCC outputs assembly code (using symbolic constants). The assembler takes this and turns it into an object file, and then the linker puts the object files and libraries together to make the final binary.

ersmith · 2013-12-04 09:55

David:

I really like your idea of having some way to encode a 32 bit constant in two instructions (either with a BIG prefix or LDLO/LDHI instructions). It would make a lot of LMM related work much easier, since in LMM mode we don't have the option of just putting the constant into a register.

Eric

Bill Henning · 2013-12-04 10:04

Adding a prefix, or postfix, should work then.

variablename_t

where t is

a - for aux
c - for cog
h - for hub
x - for xmm

or something similar - the above was just an example.

I thought there were also some attribute bits for every symbol, for static / volatile / ...

ersmith wrote: »

The C compiler never sees addresses, only symbols. It's the linker that actually assigns addresses, and by the time we get to the linker the instructions have already been chosen.

The flow is: GCC outputs assembly code (using symbolic constants). The assembler takes this and turns it into an object file, and then the linker puts the object files and libraries together to make the final binary.

Bill Henning · 2013-12-04 10:10

It would also help with a hub exec model, as there would be no need for

RDLONGC reg, ptra++
long value

it would be replaced by

BIG value
movi reg,#msb's ' this way the value can be right-justified

ersmith wrote: »

David:

I really like your idea of having some way to encode a 32 bit constant in two instructions (either with a BIG prefix or LDLO/LDHI instructions). It would make a lot of LMM related work much easier, since in LMM mode we don't have the option of just putting the constant into a register.

Eric

Bill Henning · 2013-12-04 10:19

I am really liking BIG...

BIG hubaddr
RDLONG reg, #0 <-- immediate bit, or C bit can be used to select the hidden BIG register as the source of the address

Now this would also be great for LMM and hub exec...

Simpler than keeping track of physical hub cache slot addresses and using those directly.

ersmith · 2013-12-04 10:53

Bill Henning wrote: »

Adding a prefix, or postfix, should work then.
[...]
I thought there were also some attribute bits for every symbol, for static / volatile / ...

That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.

David Betz · 2013-12-04 10:58

ersmith wrote: »

That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.

Do you know if LLVM has any more flexiblity in handling different address spaces? I'm not suggesting that we switch at this point but just wondering for future reference.

Bill Henning · 2013-12-04 11:20

Thanks, good discussion!

Basically the difficulty comes from GCC assuming a certain memory hiearchy, and it being difficult to work around the built in assumptions of GCC.

Would it be possible for the front end to limit the number of locals + arguments, and map all types to longs in an AUX model? If that works, the small indexes would work...

I can definitely see why initially supporting AUX may not be worth it, but it is still worth continuing this discussion for potentially adding it later.

As long as AUX return stack instructions are available, Spin, VM's and other compilers can use it when appropriate (small/medium non-recursive programs) for a good performance gain and memory savings.

It may also be easier to support for the existing gcc cog-only mode, and it would allow larger/faster drivers in that mode.

ersmith wrote: »

That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.

Hub Execution Model Thread (split from blog)

Comments