How does gcc and or catalina style compressed code compare to Spin byte codes? CMM or whatever it is called?
Given that there is no Spin byte code interpreter in ROM on the PII one could imagine adopting something completely different than the old bytecodes.
Is there any merit is compiling to byte/compressed code or native PASM to be decided on an object by object basis? Then no new syntax needs inventing, like CPUB/CPRI/CDAT/CVAR. Nothing would change in an objects source code to move it from being compiled to byte codes or native. I'm not sure where or how we specify the compile mode though.
@ Dave: The idea sounds good, but since then, GCC has support to direct selective code compiled to a COG, I think. (and has in line PASM as well )
Would it make sense to look at the controls in GCC, and apply the same/similar semantics to Spin, so there is not too much culture shock moving from one to the other ?
I'm using the Nano with the latest jic. Is there a list of commands that were excluded to squeeze the build into the Nano?
I saw speculation on the required changes but no confirmed list.
I'm using the Nano with the latest jic. Is there a list of commands that were excluded to squeeze the build into the Nano?
I saw speculation on the required changes but no confirmed list.
These cog changes will make all code execute the same way, no matter the task mix, so that everything can be written for what was single-task mode. Now we'll write all future code using these spacer/trailer rules and it will run optimally in single-task mode, but still work in multi-task mode without any pipeline issues.
These cog changes will make all code execute the same way, no matter the task mix, so that everything can be written for what was single-task mode, only. Now we'll write all future code using these spacer/trailer rules and it will run optimally single-task mode, but still work in multi-task mode.
Fantastic news Chip. Life will be so much simpler this way!
The only other pipeline issue regarding spacer instructions that I can think of is the write-before-execute issue, which must be coded with two spacers to work in single-task mode. In multi-task mode, zero, one, or two spacers may be needed, with two working for every single- and multi-task case:
So, if these cases are coded with two spacers, they will work under any circumstance. I think that does it for unifying the coding rules across task mixes.
Thanks for sticking around, Guys. I appreciate your help and enthusiasm.
These cog changes will make all code execute the same way, no matter the task mix, so that everything can be written for what was single-task mode. Now we'll write all future code using these spacer/trailer rules and it will run optimally in single-task mode, but still work in multi-task mode without any pipeline issues.
Great
I can see a safe mnemonic form for the first two (borrowing from how other DSPs manage this opcode
like this
in operation, Assembler does a simple sanity check on BlockStartAdr & ActualDepartureAdr and flags if wrong for that opcode. (just like other checks for Adr out of range)
This is great stuff Chip. Having such regularity in behaviour is a huge win.
Is it time to be freeze things up a bit? That next shuttle run is coming fast isn't it?
I think things are getting very near to "done".
Next, I want to add a CALLR instruction which writes the return address to a register, instead of a stack. The C compiler guys really want this for leaf functions. Spin could use it, too. It's handy for pulling arguments right after the CALLR, then jumping to the final register value.
I have sort of a general coding question, regarding verilog. You have added loads and loads of opcodes, they need arguments and you have huge fan-outs, and muxes for the results, how is it that it is so fast ?. (I mean 80+ MHz).
I got the idea of having more that one "opcode" register, so to say and then it will have smaller fan-outs, probably nothing new...
Thanks.
Edit: Maybe are the fpgas that fast... Lets see... I'll try to compile my (un-optimized and sub-par 6809) for the Cyclone V and see (It can do 40 MHz in the MachXO2, and 67 MHz in the Spartan3E, it only has 8 & 16 bit paths, but many muxes )
Edit: It can do 90 MHz on the cyclone V. (5CEFA2F23C8N).
Next, I want to add a CALLR instruction which writes the return address to a register, instead of a stack. The C compiler guys really want this for leaf functions. Spin could use it, too. It's handy for pulling arguments right after the CALLR, then jumping to the final register value.
Is there still a possibility of adding the non-hub flags and increasing the locks?
So, before the first P2 developer boards are made, is there going to be a signature sheet passed around so anyone that contributed a feature suggestion or helped with the development and testing and sign it. How cool would it be to have an autographed mask on the board with Chip's signature and all the contributor's signatures?
Ah, yes. I forgot to mention that each task has its own REPS/REPD circuit now.
That's fantastic! Having as much of the instruction set be task-agnostic in its usage is super helpful. I can imagine this resulting in a library of tight and simple background task code that can be quickly mixed in with larger cog programs, or used on its own.
Next, I want to add a CALLR instruction which writes the return address to a register, instead of a stack. The C compiler guys really want this for leaf functions. Spin could use it, too. It's handy for pulling arguments right after the CALLR, then jumping to the final register value.
Instead of a CALLR instruction maybe it would be useful to have an instruction that sets a register location that is written to when any of the CALL instructions are used. It would be something like "SETRETREG register_number". This would cause all of the CALL instructions to write the return address to the designated register instead of writing it to the return stack. Does this make sense, or would it cause confusion? I think this would satisfy the requirement for the C compiler. David Betz, what do you think?
EDIT: I thought about it a bit more, and this might cause some problems when trying to mix code that expects to use the stack for the return address. So a CALLR instruction might be better as long as it works with both COG addresses and hub addresses.
Sorry, that would be confusing, and might interfere with the other modes of CALL operation - the intent is to NOT interfere with them, but provide something David/Eric/you need.
I think Chip will either pick $1F1 as the link register, or provide a SETLR (or as you called it, SETRETREG) for the CALLR instruction.
Instead of a CALLR instruction maybe it would be useful to have an instruction that sets a register location that is written to when any of the CALL instructions are used. It would be something like "SETRETREG register_number". This would cause all of the CALL instructions to write the return address to the designated register instead of writing it to the return stack. Does this make sense, or would it cause confusion? I think this would satisfy the requirement for the C compiler. David Betz, what do you think?
Bill, I agree, it would be confusing. I was just thinking out loud. So will the CALLR instruction allow for calling 9-bit and 16-bit constant addresses, and also calling indirectly through a register? It seems like that might require 2, or maybe 3 different instructions.
Bill, I agree, it would be confusing. I was just thinking out loud. So will the CALLR instruction allow for calling 9-bit and 16-bit constant addresses, and also calling indirectly through a register? It seems like that might require 2, or maybe 3 different instructions.
I was thinking of Sapieha's request for a CALL-equivalent to the new list. Something like that could be useful for VM's, and even more so for operating systems / libraries.
CALLVECT D,#n
CALLVECT D,S
CALLLIST looked weird with 3 L's, so I changed it to VECT for the example
D holds the base address of a WORD table in the hub
n or S are the index
This way it could dispatch to 512 system/VM routines. I think only the 4-level hardware stack version would be needed.
Comments
Given that there is no Spin byte code interpreter in ROM on the PII one could imagine adopting something completely different than the old bytecodes.
Is there any merit is compiling to byte/compressed code or native PASM to be decided on an object by object basis? Then no new syntax needs inventing, like CPUB/CPRI/CDAT/CVAR. Nothing would change in an objects source code to move it from being compiled to byte codes or native. I'm not sure where or how we specify the compile mode though.
Would it make sense to look at the controls in GCC, and apply the same/similar semantics to Spin, so there is not too much culture shock moving from one to the other ?
Early benchmarks indicated the CMM to Spin was about 1:1 on size and 2:1 on speed (source)
GCC can support compiling an entire cog (direct), inlining LMM/CMM code, and inline code for the FCACHE (effectively inline cog code).
I saw speculation on the required changes but no confirmed list.
Rich, See here
Brian
REPS
<spacer instruction>
<REPS block>
REPD
<spacer instruction>
<spacer instruction>
<spacer instruction>
<REPD block>
These should be working, as well, after the current compile completes:
JMPD/CALLD/RETD
<trailing instruction>
<trailing instruction>
<trailing instruction>
These cog changes will make all code execute the same way, no matter the task mix, so that everything can be written for what was single-task mode. Now we'll write all future code using these spacer/trailer rules and it will run optimally in single-task mode, but still work in multi-task mode without any pipeline issues.
I hope to have an update out tonight.
Thanks for your patience.
I am hoping to have some quality DE2-115 time in a couple of days
Thanks
Looks good.
Now P2 will be user friendly (Compiler's to)
The only other pipeline issue regarding spacer instructions that I can think of is the write-before-execute issue, which must be coded with two spacers to work in single-task mode. In multi-task mode, zero, one, or two spacers may be needed, with two working for every single- and multi-task case:
ADD :i,#1
<spacer instruction>
<spacer instruction>
:i MOV OUTA,0
So, if these cases are coded with two spacers, they will work under any circumstance. I think that does it for unifying the coding rules across task mixes.
Thanks for sticking around, Guys. I appreciate your help and enthusiasm.
Nice work!
Did you manage to squeeze in the 3 additional REPx blocks?
Brian
Great
I can see a safe mnemonic form for the first two (borrowing from how other DSPs manage this opcode
like this
A matching 'safe' version of delayed exit is also worth having.
Perhaps something like this ?
in operation, Assembler does a simple sanity check on BlockStartAdr & ActualDepartureAdr and flags if wrong for that opcode. (just like other checks for Adr out of range)
Ah, yes. I forgot to mention that each task has its own REPS/REPD circuit now.
Nice. How much added-logic did that cost ? IIRC earlier comments had it not insignificant ?
It added over 1,000 flipflops to the chip, which already has, probably 60,000. Not a big deal.
Is it time to be freeze things up a bit? That next shuttle run is coming fast isn't it?
I think things are getting very near to "done".
Next, I want to add a CALLR instruction which writes the return address to a register, instead of a stack. The C compiler guys really want this for leaf functions. Spin could use it, too. It's handy for pulling arguments right after the CALLR, then jumping to the final register value.
I have sort of a general coding question, regarding verilog. You have added loads and loads of opcodes, they need arguments and you have huge fan-outs, and muxes for the results, how is it that it is so fast ?. (I mean 80+ MHz).
I got the idea of having more that one "opcode" register, so to say and then it will have smaller fan-outs, probably nothing new...
Thanks.
Edit: Maybe are the fpgas that fast... Lets see... I'll try to compile my (un-optimized and sub-par 6809) for the Cyclone V and see (It can do 40 MHz in the MachXO2, and 67 MHz in the Spartan3E, it only has 8 & 16 bit paths, but many muxes )
Edit: It can do 90 MHz on the cyclone V. (5CEFA2F23C8N).
Is there still a possibility of adding the non-hub flags and increasing the locks?
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1236830&viewfull=1#post1236830
Thanks,
Chris Wardell
The consistent operation will be a big help!
So, before the first P2 developer boards are made, is there going to be a signature sheet passed around so anyone that contributed a feature suggestion or helped with the development and testing and sign it. How cool would it be to have an autographed mask on the board with Chip's signature and all the contributor's signatures?
That's fantastic! Having as much of the instruction set be task-agnostic in its usage is super helpful. I can imagine this resulting in a library of tight and simple background task code that can be quickly mixed in with larger cog programs, or used on its own.
EDIT: I thought about it a bit more, and this might cause some problems when trying to mix code that expects to use the stack for the return address. So a CALLR instruction might be better as long as it works with both COG addresses and hub addresses.
I think Chip will either pick $1F1 as the link register, or provide a SETLR (or as you called it, SETRETREG) for the CALLR instruction.
CALLR D/#16bitconst
As that would allow addressing all 256KB (64KLONG) hub; the other CALLx's distinguish between cog/hub addresses as 0-511=cog, 512+=hub
It just keeps getting better and better!
CALLVECT D,#n
CALLVECT D,S
CALLLIST looked weird with 3 L's, so I changed it to VECT for the example
D holds the base address of a WORD table in the hub
n or S are the index
This way it could dispatch to 512 system/VM routines. I think only the 4-level hardware stack version would be needed.
It bears thinking on, even if only for P3.
If the addresses in the word table are relative to their position, this would get us DLL's.
The calling task/cog would place the start of the DLL in the register "dllbase"
Then it could call any routine in the DLL with
CALLVECT dllbase,#routineindex ' (0..511)<<2
A dll would be:
WORD @function0
WORD @function1
....
WORD @functionN-1
' local DLL data area
function0:
function1:
...
As relative addresses would be used, no need for a relocating loader