Chip, Go back to sleep, and re-read this thread when you are clear headed.
Go to sleep, but do not came back to this forum anymore because is getting crazy ... (We are all guilty here ! ... me too ).
Cut your ethernet cable, broke your cell phone. Ask someone to change your email account password and forget it. No more interruptions to finish P2.
I recommend a book to all of you guys: "The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer", by Charles J. Murray. In the 70s, Cray and their team went to a small town with limited phone calls and limited transportation to made their next project.
BTW, The last project of Seymour Cray was to make GaAs processor just to reduce latency. I hope enough sucess to P2 to make P3 the first GaAs microcontroller (No 64 bits, no segments, no MMU, no Linux or any other legacy problem. Just a few GHz of pure raw performance !).
Yes but if that GaAs Propeller used a Four Dimensional Mobius Quantum Entangled Hypertransport Bus between the 32GByte HUB RAM and the 1024 COGs then they could all execute LMM code at 9Ghz instead of just 8.
Go to sleep, but do not came back to this forum anymore because is getting crazy ... (We are all guilty here ! ... me too ).
Cut your ethernet cable, broke your cell phone. Ask someone to change your email account password and forget it. No more interruptions to finish P2.
I recommend a book to all of you guys: "The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer", by Charles J. Murray. In the 70s, Cray and their team went to a small town with limited phone calls and limited transportation to made their next project.
BTW, The last project of Seymour Cray was to make GaAs processor just to reduce latency. I hope enough sucess to P2 to make P3 the first GaAs microcontroller (No 64 bits, no segments, no MMU, no Linux or any other legacy problem. Just a few GHz of pure raw performance !).
It's a funny thing that bulk CMOS processes, due to relentless shrinkage, have practically eclipsed every exotic alternative process chemistry that's ever been developed. Even if the others are faster, they don't have the density, which translates to speed in another way. At least, this is my observation. If there were faster processes, Altera would be using them, but Intel, Altera, and everyone else is using CMOS for logic.
Many of you heard this, but when we were last synthesizing our Prop2 core, I asked the synthesis guy to run it in a 40nm process, just to see how fast it would go. It closed timing, no problem, at 1GHz, and the cell area was only 0.89 square mm. I think this current Prop2, with all we've added, would probably close timing in 28nm at nearly 2GHz. It's just a matter of a few $100k to prove it, and then about $3M to get it into production (the really expensive part).
It's a funny thing that bulk CMOS processes, due to relentless shrinkage, have practically eclipsed every exotic alternative process chemistry that's ever been developed. Even if the others are faster, they don't have the density, which translates to speed in another way. At least, this is my observation. If there were faster processes, Altera would be using them, but Intel, Altera, and everyone else is using CMOS for logic.
Many of you heard this, but when we were last synthesizing our Prop2 core, I asked the synthesis guy to run it in a 40nm process, just to see how fast it would go. It closed timing, no problem, at 1GHz, and the cell area was only 0.89 square mm. I think this current Prop2, with all we've added, would probably close timing in 28nm at nearly 2GHz. It's just a matter of a few $100k to prove it, and then about $3M to get it into production (the really expensive part).
I remember reading somewhere that the Altera Stratix 10 uses Intels 14nm technology.
I hate to think how much that FPGA demo board is!
IIRC the same article said something about Intel building a new 7nm fab facility!
I remember reading somewhere that the Altera Stratix 10 uses Intels 14nm technology.
I hate to think how much that FPGA demo board is!
IIRC the same article said something about Intel building a new 7nm fab facility!
I thought you were mistaken, at first, because I thought Stratix V was their fastest. They took a real leap in naming here. These are supposed to run designs at GHz speeds. I bet Prop2 would run at around 700MHz on this FPGA. They make it sound VERY intriguing:
I thought you were mistaken, at first, because I thought Stratix V was their fastest. They took a real leap in naming here. These are supposed to run designs at GHz speeds. I bet Prop2 would run at around 700MHz on this FPGA. They make it sound VERY intriguing:
I wonder if we can get a box of SAMPLE dev boards out of them for evaluation purposes. LOL
I agree that it is not an hours work, but it would take far less time than CMM took for example.
Actually I don't think so. CMM was pretty straightforward (the CMM instruction set is a superset of the LMM one). Changing to an AUX stack is a much more difficult undertaking. As far as I can tell, by the time memory references get to the machine independent back-end they all look the same (with the special exception of push/pop; but those aren't used for variable access). If you know of a way to distinguish stack and other memory accesses in GCC please let us know, it would allow us to speed up XMM stack memory accesses even in P1.
If the stack model limited arguments + variables, the new indexed stack addressing mode can be used. It would be somewhat wasteful, as everything would have to be longs, but it would work.
Yes, but then it's not really C any more. Which is fair enough -- the AUX stack model could be useful for some languages, and certainly for PASM programmers. But for general purpose languages like C I don't think it will work.
To help anyone who wants to try, here is how I would do it:
- limiting the total number of arguments and local variables to the index range from SPA is important
- use longs even for byte/word local variables
- do not store arrays on the AUX stack, use space on a hub stack
- either forbid taking the address of local variables, OR
- any local variables that have their address taken should live on a hub stack (easier fix of two), OR
- (non-trivial gcc change) add an 'aux' attribute to arguments and local variables that the code generator can use to emit SPA indexed references
- for debugging runs, use the WC/WZ flags I described to check for stack over/under flow
All those changes create a language that is no longer "C" or "C++". Even modifying Spin to have those restrictions seems like it would be a lot of work.
Note, in an absolute sense, it would not be "easy", however it would be far easier than a VLIW-style RDQUAD LMM would be.
I actually think the VLIW RDQUAD LMM would be easier. GCC already supports VLIW processors, and that support is pretty robust. There are quite a few VLIW processors on the market already, so there's a lot of experience with them. The GCC support for split address spaces is still very tentative and untried, and I don't know of any processors that use it (except possibly for the Cell SPE, but each one of those has 256KB rather than 2KB, so there's a lot more room for local variables).
I've been thinking about the problem of loading 32 bit constants and issues with the pipeline and was wondering if we could free up enough opcode space for two new instructions like the current REPS instruction.
---- 1111110 10 n nnnn nnnnnnnnn nnniiiiii REPS #1..$20000,#1..64
The idea would be to have an instruction to load the low 16 bits of a register and another to load the high 16 bits. These would take up the same amount of space as the current proposal to use a RDLONGC reg, PTRA+ instruction followed by the 32 bit constant but would not have any pipeline issues. I guess the trick would be to make space for a 9 bit destination field as well as a 16 bit immediate field for a total of 25 bits. The REPS instruction has 23 bits of arguments. I guess that would be enough if we're willing to limit the D field to only 7 bits so it could only address the first 128 COG locations.
Anyway, the instructions would be as follows:
LDLO D, #$nnnn
LDHI D, #$nnnn
Is there any way to fit these into the instruction set bit encodings?
Changing to an AUX stack is a much more difficult undertaking. As far as I can tell, by the time memory references get to the machine independent back-end they all look the same (with the special exception of push/pop; but those aren't used for variable access). If you know of a way to distinguish stack and other memory accesses in GCC please let us know, it would allow us to speed up XMM stack memory accesses even in P1.
Using the top two bits of a 32 bit internal address should work. I think you are using something similar already for cog vs. hub.
This should survive to the machine independent back end, and the code generator should be able to emit different code based on the top two bits of the address. It does limit the address space for hub / xmm to 1GB, but I don't that is an issue for the P2.
Yes, but then it's not really C any more. Which is fair enough -- the AUX stack model could be useful for some languages, and certainly for PASM programmers. But for general purpose languages like C I don't think it will work.
It would be a subset, useful for driver/control apps that needed the most speed, not useful for compiling large apps (gcc, emacs et al)
All those changes create a language that is no longer "C" or "C++". Even modifying Spin to have those restrictions seems like it would be a lot of work.
The language would be C, with some limitations.
On other microcontrollers C is often limited to 1KB-4KB of ram total, however it would be a single address space, therefore easier on gcc.
Would the limitations rule out compiling large, normal PC style applications with an AUX stack?
Absolutely.
But the point is to make fast driver / control apps.
I actually think the VLIW RDQUAD LMM would be easier. GCC already supports VLIW processors, and that support is pretty robust. There are quite a few VLIW processors on the market already, so there's a lot of experience with them. The GCC support for split address spaces is still very tentative and untried, and I don't know of any processors that use it (except possibly for the Cell SPE, but each one of those has 256KB rather than 2KB, so there's a lot more room for local variables).
Eric
I think the VLIW would be more work, mostly due to scheduling issues, however I hope you are right about VLIW being not too hard to implement, in case Chip does not add the hub exec mode, as then the VLIW approach would be the only one to really speed gcc up.
I've been thinking about the problem of loading 32 bit constants and issues with the pipeline and was wondering if we could free up enough opcode space for two new instructions like the current REPS instruction.
---- 1111110 10 n nnnn nnnnnnnnn nnniiiiii REPS #1..$20000,#1..64
The idea would be to have an instruction to load the low 16 bits of a register and another to load the high 16 bits. These would take up the same amount of space as the current proposal to use a RDLONGC reg, PTRA+ instruction followed by the 32 bit constant but would not have any pipeline issues. I guess the trick would be to make space for a 9 bit destination field as well as a 16 bit immediate field for a total of 25 bits. The REPS instruction has 23 bits of arguments. I guess that would be enough if we're willing to limit the D field to only 7 bits so it could only address the first 128 COG locations.
Anyway, the instructions would be as follows:
LDLO D, #$nnnn
LDHI D, #$nnnn
Is there any way to fit these into the instruction set bit encodings?
16 data bits + 9 register address bits might barely fit if WC,WZ,I,CCCC were used to hold part of the register address.
TTTTTTTdddddddddxxxxxxxxxxxxxxxx
where T = opcode, d = dest reg, x = data
I guess another possiblity might be to let the MOVS instruction get used to select the destination register and load the low 9 bits. Then, the following instruction could be encoded like REPS and load the remaining 32 bits into the same register that was used by the preceeding MOVS instruction. That would only require a single new instruction and it wouldn't have to have a D field. It would also require a 9 bit internal register (one for each thread) to remember the D field of the previous MOVS instruction.
I guess another possiblity might be to let the MOVS instruction get used to select the destination register and load the low 9 bits. Then, the following instruction could be encoded like REPS and load the remaining 32 bits into the same register that was used by the preceeding MOVS instruction. That would only require a single new instruction and it wouldn't have to have a D field. It would also require a 9 bit internal register (one for each thread) to remember the D field of the previous MOVS instruction.
Actually, even better would be to have the new instruction be a prefix instruction that just loaded 23 bits into an internal register. Then, the immediate field of the following instruction would use its own 9 bit immediate value along with the value in the 23 bit internal register to form a full 32 bit immediate operand. Then you could use 32 bit immediates with any instruction.
Edit: Eric will recognize this as the "big" prefix we had in the MPE processor back in VM Labs days. :-)
The BIG instruction would allow you to extend any S field to 32 bits which means you could read or write any location in hub memory with a two instruction sequence.
The BIG instruction would allow you to extend any S field to 32 bits which means you could read or write any location in hub memory with a two instruction sequence.
Using the top two bits of a 32 bit internal address should work. I think you are using something similar already for cog vs. hub.
The C compiler never sees addresses, only symbols. It's the linker that actually assigns addresses, and by the time we get to the linker the instructions have already been chosen.
The flow is: GCC outputs assembly code (using symbolic constants). The assembler takes this and turns it into an object file, and then the linker puts the object files and libraries together to make the final binary.
I really like your idea of having some way to encode a 32 bit constant in two instructions (either with a BIG prefix or LDLO/LDHI instructions). It would make a lot of LMM related work much easier, since in LMM mode we don't have the option of just putting the constant into a register.
The C compiler never sees addresses, only symbols. It's the linker that actually assigns addresses, and by the time we get to the linker the instructions have already been chosen.
The flow is: GCC outputs assembly code (using symbolic constants). The assembler takes this and turns it into an object file, and then the linker puts the object files and libraries together to make the final binary.
I really like your idea of having some way to encode a 32 bit constant in two instructions (either with a BIG prefix or LDLO/LDHI instructions). It would make a lot of LMM related work much easier, since in LMM mode we don't have the option of just putting the constant into a register.
Adding a prefix, or postfix, should work then.
[...]
I thought there were also some attribute bits for every symbol, for static / volatile / ...
That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.
That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.
Do you know if LLVM has any more flexiblity in handling different address spaces? I'm not suggesting that we switch at this point but just wondering for future reference.
Basically the difficulty comes from GCC assuming a certain memory hiearchy, and it being difficult to work around the built in assumptions of GCC.
Would it be possible for the front end to limit the number of locals + arguments, and map all types to longs in an AUX model? If that works, the small indexes would work...
I can definitely see why initially supporting AUX may not be worth it, but it is still worth continuing this discussion for potentially adding it later.
As long as AUX return stack instructions are available, Spin, VM's and other compilers can use it when appropriate (small/medium non-recursive programs) for a good performance gain and memory savings.
It may also be easier to support for the existing gcc cog-only mode, and it would allow larger/faster drivers in that mode.
That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.
Comments
Go to sleep, but do not came back to this forum anymore because is getting crazy ... (We are all guilty here ! ... me too ).
Cut your ethernet cable, broke your cell phone. Ask someone to change your email account password and forget it. No more interruptions to finish P2.
I recommend a book to all of you guys: "The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer", by Charles J. Murray. In the 70s, Cray and their team went to a small town with limited phone calls and limited transportation to made their next project.
BTW, The last project of Seymour Cray was to make GaAs processor just to reduce latency. I hope enough sucess to P2 to make P3 the first GaAs microcontroller (No 64 bits, no segments, no MMU, no Linux or any other legacy problem. Just a few GHz of pure raw performance !).
We have to tell Chip about this!
It's a funny thing that bulk CMOS processes, due to relentless shrinkage, have practically eclipsed every exotic alternative process chemistry that's ever been developed. Even if the others are faster, they don't have the density, which translates to speed in another way. At least, this is my observation. If there were faster processes, Altera would be using them, but Intel, Altera, and everyone else is using CMOS for logic.
Many of you heard this, but when we were last synthesizing our Prop2 core, I asked the synthesis guy to run it in a 40nm process, just to see how fast it would go. It closed timing, no problem, at 1GHz, and the cell area was only 0.89 square mm. I think this current Prop2, with all we've added, would probably close timing in 28nm at nearly 2GHz. It's just a matter of a few $100k to prove it, and then about $3M to get it into production (the really expensive part).
I remember reading somewhere that the Altera Stratix 10 uses Intels 14nm technology.
I hate to think how much that FPGA demo board is!
IIRC the same article said something about Intel building a new 7nm fab facility!
I thought you were mistaken, at first, because I thought Stratix V was their fastest. They took a real leap in naming here. These are supposed to run designs at GHz speeds. I bet Prop2 would run at around 700MHz on this FPGA. They make it sound VERY intriguing:
I wonder if we can get a box of SAMPLE dev boards out of them for evaluation purposes. LOL
Yes, but then it's not really C any more. Which is fair enough -- the AUX stack model could be useful for some languages, and certainly for PASM programmers. But for general purpose languages like C I don't think it will work.
I actually think the VLIW RDQUAD LMM would be easier. GCC already supports VLIW processors, and that support is pretty robust. There are quite a few VLIW processors on the market already, so there's a lot of experience with them. The GCC support for split address spaces is still very tentative and untried, and I don't know of any processors that use it (except possibly for the Cell SPE, but each one of those has 256KB rather than 2KB, so there's a lot more room for local variables).
Eric
The idea would be to have an instruction to load the low 16 bits of a register and another to load the high 16 bits. These would take up the same amount of space as the current proposal to use a RDLONGC reg, PTRA+ instruction followed by the 32 bit constant but would not have any pipeline issues. I guess the trick would be to make space for a 9 bit destination field as well as a 16 bit immediate field for a total of 25 bits. The REPS instruction has 23 bits of arguments. I guess that would be enough if we're willing to limit the D field to only 7 bits so it could only address the first 128 COG locations.
Anyway, the instructions would be as follows:
Is there any way to fit these into the instruction set bit encodings?
Makes sense.
You have convinced me that CMM was easier to add to GCC than the AUX stack model.
Using the top two bits of a 32 bit internal address should work. I think you are using something similar already for cog vs. hub.
00 - cog address
01 - aux address
10 - hub address
11 - ext address
This should survive to the machine independent back end, and the code generator should be able to emit different code based on the top two bits of the address. It does limit the address space for hub / xmm to 1GB, but I don't that is an issue for the P2.
It would be a subset, useful for driver/control apps that needed the most speed, not useful for compiling large apps (gcc, emacs et al)
Good discussion.
The language would be C, with some limitations.
On other microcontrollers C is often limited to 1KB-4KB of ram total, however it would be a single address space, therefore easier on gcc.
Would the limitations rule out compiling large, normal PC style applications with an AUX stack?
Absolutely.
But the point is to make fast driver / control apps.
I think the VLIW would be more work, mostly due to scheduling issues, however I hope you are right about VLIW being not too hard to implement, in case Chip does not add the hub exec mode, as then the VLIW approach would be the only one to really speed gcc up.
16 data bits + 9 register address bits might barely fit if WC,WZ,I,CCCC were used to hold part of the register address.
TTTTTTTdddddddddxxxxxxxxxxxxxxxx
where T = opcode, d = dest reg, x = data
Edit: Eric will recognize this as the "big" prefix we had in the MPE processor back in VM Labs days. :-)
LOL - I like "BIG"
I think the cycle after it is used it should be cleared to zero in order to avoid unexpected results.
It would then execute as a NOP (other than loading the hidden BIG register), and could be referenced as just another cog variable from any code.
Basically, this means NOP turns into BIG, and no need to free up an opcode!
Can't use NOP... could cause problems with tasks (unless there was a separate BIG hidden register for each task)
The flow is: GCC outputs assembly code (using symbolic constants). The assembler takes this and turns it into an object file, and then the linker puts the object files and libraries together to make the final binary.
I really like your idea of having some way to encode a 32 bit constant in two instructions (either with a BIG prefix or LDLO/LDHI instructions). It would make a lot of LMM related work much easier, since in LMM mode we don't have the option of just putting the constant into a register.
Eric
variablename_t
where t is
a - for aux
c - for cog
h - for hub
x - for xmm
or something similar - the above was just an example.
I thought there were also some attribute bits for every symbol, for static / volatile / ...
RDLONGC reg, ptra++
long value
it would be replaced by
BIG value
movi reg,#msb's ' this way the value can be right-justified
BIG hubaddr
RDLONG reg, #0 <-- immediate bit, or C bit can be used to select the hidden BIG register as the source of the address
Now this would also be great for LMM and hub exec...
Simpler than keeping track of physical hub cache slot addresses and using those directly.
That would work for globals and statics, but stack allocated local variables don't even get symbols, just offsets to the stack. By the time we see them in the back end the machine independent front end has already done the address calculation and put the result into a register, so it looks just like a pointer reference (we get an RTL expression that looks like "*r1"). For P2 we could change the compiler to get something like *(offset + sp), as long as the offset fit. But the offset permitted in RDAUX is pretty small, and it wouldn't take much to overflow it (even just saving the whole set of registers for subroutine calls could do it), and then we'd be back to getting what looks like a generic pointer dereference.
Basically the difficulty comes from GCC assuming a certain memory hiearchy, and it being difficult to work around the built in assumptions of GCC.
Would it be possible for the front end to limit the number of locals + arguments, and map all types to longs in an AUX model? If that works, the small indexes would work...
I can definitely see why initially supporting AUX may not be worth it, but it is still worth continuing this discussion for potentially adding it later.
As long as AUX return stack instructions are available, Spin, VM's and other compilers can use it when appropriate (small/medium non-recursive programs) for a good performance gain and memory savings.
It may also be easier to support for the existing gcc cog-only mode, and it would allow larger/faster drivers in that mode.