When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.
When in the hub, instructions take 4 bytes. The PC steps by 4.
To bridge the two contexts, there are two simple things done:
The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.
The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.
REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.
The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.
I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.
Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:
dat
orgh $400
' launch 15 cogs (cog 0 falls through and runs 'blink', too)
' any cogs missing from the FPGA won't blink
loc x,@blink
rep @repend,#15
coginit #16,x
repend
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
add x,#16 'add to my id
shl x,#18 'shift up to make it big
waitx x 'wait that many clocks
jmp @blink 'do it again
org
x res 1 'variable at cog register 8
Looks good except I find it odd that the 9 bit immediate addresses get treated as long addresses but the 20 bit addresses get treated as byte addresses. Seems like it would be better if they were both shifted left by 2 to get hub addresses for consistency. Then all immediate address fields are treated as long addresses or long offsets.
This is precisely what I have been pushing for without success. Makes for a standard instruction model of all long addresses. The consequence of this is all instructions must be long aligned, which IMHO is fine.
When in the cog, all registers are long, with their addresses being contiguous integers. The PC steps by 1.
When in the hub, instructions take 4 bytes. The PC steps by 4.
To bridge the two contexts, there are two simple things done:
The 9-bit-constant relative branches DJNZ/DJZ/TJZ/... encode the -256..+255 instruction range into their S field. When in cog exec, that value is sign-extended and added to the PC. When in hub exec, it is shifted left two bits and used the same way. This way, both cog and hub contexts get the max use out of these instructions and maintain binary compatibility.
The 20-bit-constant relative branches JMP/CALL/CALLA/... are encoded for hub exec as you imagine they would be, where they track byte offset. When the cog uses these branches, it shifts them right two bits to get cog-relative values. They are assembled pre-4x'd in cog code that way. So, these instructions are now binary compatible between cog/lut and hub code.
REP now works in hub exec by forcing a jump during the last instruction in the repeat block. It didn't take much logic to implement and it works just as you'd expect. Even though it's slow in hub exec, because of the branching on each iteration, it is a convenient instruction to have for doing simple loops.
The assembler generates the same code for relative branches and REP in both cog/lut exec and hub exec contexts.
I will have updated FPGA files done tomorrow. I just finished the Prop123-A7 compile and now I need to make the DE2-115 version.
Here's what the all_cogs_blink program looks like now. Note the ORGH and the REP:
dat
orgh $400
' launch 15 cogs (cog 0 falls through and runs 'blink', too)
' any cogs missing from the FPGA won't blink
loc x,@blink
rep @repend,#15
coginit #16,x
repend
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
add x,#16 'add to my id
shl x,#18 'shift up to make it big
waitx x 'wait that many clocks
jmp @blink 'do it again
org
x res 1 'variable at cog register 8
Looks good except I find it odd that the 9 bit immediate addresses get treated as long addresses but the 20 bit addresses get treated as byte addresses. Seems like it would be better if they were both shifted left by 2 to get hub addresses for consistency. Then all immediate address fields are treated as long addresses or long offsets.
This is precisely what I have been pushing for without success. Makes for a standard instruction model of all long addresses. The consequence of this is all instructions must be long aligned, which IMHO is fine.
Yes, I realize I didn't originate this idea. I just thought I'd give it one more push! :-)
ok, here's a conundrum, you guys can all agree upon:)
dat
orgh 1
' launch cog 1 (cog 0 falls through and runs 'blink', too)
coginit #1,#blink
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
waitx myval 'wait that many clocks
jmp @blink 'do it again
org
myval long $2FAF080 '50_000_000
x res 1
2 issues:
1. When cog0 gets to waitx ... it doesn't get myval, effectively executing the following:
waitx #0
And the cog0 LED stays on for about a minute and 24 seconds... presumably until waitx rolls over its counter.
2. When cog1 gets to waitx, it doesn't get myval either, but does get a value... a pretty big one and apparently always the same value(or nearly) but not the one I am trying to send it(myval).
Chip doesn't want to give up unaligned code in HUB. It does allow for simpler mixed code and data, as in his example with strings.
I think the only real argument here is that because of this it's still possible to write HUB code that is not binary compatible to also run in COG/LUT. His example with mixed data and code will not work if copied into a cog. Of course, even if he made hub instructions long aligned only, his example wouldn't work in cog anyway because it needs byte access to the string data.
The only real complete solution would actually be to go in the other direction and make it so COG/LUT could run unaligned code and also have byte/word data access (rdxxxx/wrxxxx would need to be able to read COG/LUT addresses, and the memory map would have to change so hub would not start at 0). I don't think this is feasible to do for P2's timeframe, if at all with this architecture.
ok, here's a conundrum, you guys can all agree upon:)
dat
orgh 1
' launch cog 1 (cog 0 falls through and runs 'blink', too)
coginit #1,#blink
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
waitx myval 'wait that many clocks
jmp @blink 'do it again
org
myval long $2FAF080 '50_000_000
x res 1
2 issues:
1. When cog0 gets to waitx ... it doesn't get myval, effectively executing the following:
waitx #0
And the cog0 LED stays on for about a minute and 24 seconds... presumably until waitx rolls over its counter.
2. When cog1 gets to waitx, it doesn't get myval either, but does get a value... a pretty big one and apparently always the same value(or nearly) but not the one I am trying to send it(myval).
What am I doing wrong??
You are executing in hub exec mode. But you are treating myval like you are running in cog exec mode.
The problem is that data elements are not automatically copied up into the COG, so your code is reading address $10 in the COG memory for myval, but it's not being initialized to your 50 million value. You need to copy the value into the COG register yourself in code.
con
'myvalue = 50_000_000
dat
orgh 1
' launch cog 1 (cog 0 falls through and runs 'blink', too)
coginit #1,#blink
myvalue long $2FAF080 '50_000_000
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
rdlong mycogvalue,##myvalue
waitx mycogvalue 'wait that many clocks
jmp @blink 'do it again
org
x res 1
mycogvalue res 1
Doesn't Intel use non aligned? Those instructions are variable length, and it's pretty common.
The way Chip has it now we get the best of both options. I understand it is compelling to just pick hub or cog, but the reality is the two have basic differences that make doing that largely impractical.
Additionally, we are back to cog code being simple and fun. This is important because cog code being easy and fun helps with learning, drivers, and or getting the max performance.
Assembly language is looking fun now with these latest decisions.
That is a design goal guys. Higher level tool considerations are important too. And gcc, etc... will be just fine with what has been done.
Finally, as it stands right now, an on chip dev system has great potential. I want to see that happen. All of that is the "is fun" part of our design spec. Why not? How often is that part of the discussion? Never! So let's maximize that part right along with the practicalities.
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Having everything start as hub exec is exactly how the Propeller 1 works. As far as I know there is no way to write a pure PASM program for the P1 without at least a few Spin byte codes to get it started.
Luckily back in the dark ages when the Prop Tool was Windows only and before there was a BST, SimpleIDE, HomeSpun, OpenSpin, Catalina, Prop GCC, etc etc there was at least Cliff Biffle's propasm https://github.com/cbiffle/propasm which did exactly that.
Btw, I like your solution re/ addressing. Personally, I plan on using the first 4k in the hub for mailboxes and system wide data.
Keep the short immediate address jumps in longs, and the branches in +/- longs. Be green. Don't waste two address bits on 0's.
It would be a terrible waste to limit the 9 bit cog # branches to the first 128 cog addresses, and the 9 bit relative addresses to +/- 64 longs - it would be taking purism too far.
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Choice is always good. It should be clear which mode a user is starting in.
Is there a cost to this option ?
Instead of doing cogram COGINIT in hardware, what if you modify COGINIT to take a SETQ x, where x initializes the new cog's ptra, and then put a subroutine in hubram like this:
DAT
orgh
coginit_cogram setq #$1FF
rdlong 0, ptra
jmp #8
Then, to load a cog P1-style, you would just do:
setq ##cog_code
coginit #16, ##coginit_cogram ' is augs+setq+augs+instr like this legal?
...
DAT
org
cog_code long 0[8] ' special registers
' cogexec code
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Choice is always good. It should be clear which mode a user is starting in.
Is there a cost to this option ?
Another bit for cog launching and 2..3 more cog ROM instructions. I'm thinking of supporting LUT load after cog load, too. It's about an hour of work. I'll do this.
Guys, do you think we should give COGINIT an option for loading a cog's RAM and then JMPing to it at $008? It would make it much easier to start up small programs. Having cogs start up in hub exec is like making everybody jump into the deep end of the pool.
Yes, provided it does not take up too much space.
However, this could also be a routine in hub loaded from ROM that the user could jump to via coginit.
...BTW wouldn't a cog start at $010 make more sense?
No. If interrupt vectors are present from $00A.. $00F, they will look like NOP's. Otherwise, code can start at $008. A few instructions can always go at $008.. $009. Being able to automatically load interrupt vectors without having to manually set them would be nice.
Unless we provide an option to protect the RAM copy of internal ROM, it's probably best we don't put routines and such in there that people would need to depend on.
Comments
This is precisely what I have been pushing for without success. Makes for a standard instruction model of all long addresses. The consequence of this is all instructions must be long aligned, which IMHO is fine.
2 issues:
1. When cog0 gets to waitx ... it doesn't get myval, effectively executing the following:
waitx #0
And the cog0 LED stays on for about a minute and 24 seconds... presumably until waitx rolls over its counter.
2. When cog1 gets to waitx, it doesn't get myval either, but does get a value... a pretty big one and apparently always the same value(or nearly) but not the one I am trying to send it(myval).
What am I doing wrong??
waitx ##myvalue
but I still don't understand what my code is doing wrong.
Chip doesn't want to give up unaligned code in HUB. It does allow for simpler mixed code and data, as in his example with strings.
I think the only real argument here is that because of this it's still possible to write HUB code that is not binary compatible to also run in COG/LUT. His example with mixed data and code will not work if copied into a cog. Of course, even if he made hub instructions long aligned only, his example wouldn't work in cog anyway because it needs byte access to the string data.
The only real complete solution would actually be to go in the other direction and make it so COG/LUT could run unaligned code and also have byte/word data access (rdxxxx/wrxxxx would need to be able to read COG/LUT addresses, and the memory map would have to change so hub would not start at 0). I don't think this is feasible to do for P2's timeframe, if at all with this architecture.
You are executing in hub exec mode. But you are treating myval like you are running in cog exec mode.
Incidentally, this sort of mistake is going to be a common occurrence. Not sure what can be done about it, though.
The problem is that data elements are not automatically copied up into the COG, so your code is reading address $10 in the COG memory for myval, but it's not being initialized to your 50 million value. You need to copy the value into the COG register yourself in code.
This works
The way Chip has it now we get the best of both options. I understand it is compelling to just pick hub or cog, but the reality is the two have basic differences that make doing that largely impractical.
Additionally, we are back to cog code being simple and fun. This is important because cog code being easy and fun helps with learning, drivers, and or getting the max performance.
Assembly language is looking fun now with these latest decisions.
That is a design goal guys. Higher level tool considerations are important too. And gcc, etc... will be just fine with what has been done.
Finally, as it stands right now, an on chip dev system has great potential. I want to see that happen. All of that is the "is fun" part of our design spec. Why not? How often is that part of the discussion? Never! So let's maximize that part right along with the practicalities.
Future geeks will thank us.
Luckily back in the dark ages when the Prop Tool was Windows only and before there was a BST, SimpleIDE, HomeSpun, OpenSpin, Catalina, Prop GCC, etc etc there was at least Cliff Biffle's propasm https://github.com/cbiffle/propasm which did exactly that.
Btw, I like your solution re/ addressing. Personally, I plan on using the first 4k in the hub for mailboxes and system wide data.
Keep the short immediate address jumps in longs, and the branches in +/- longs. Be green. Don't waste two address bits on 0's.
It would be a terrible waste to limit the 9 bit cog # branches to the first 128 cog addresses, and the 9 bit relative addresses to +/- 64 longs - it would be taking purism too far.
Choice is always good. It should be clear which mode a user is starting in.
Is there a cost to this option ?
Then, to load a cog P1-style, you would just do:
Yes!
Another bit for cog launching and 2..3 more cog ROM instructions. I'm thinking of supporting LUT load after cog load, too. It's about an hour of work. I'll do this.
However, this could also be a routine in hub loaded from ROM that the user could jump to via coginit.
BTW wouldn't a cog start at $010 make more sense?
No. If interrupt vectors are present from $00A.. $00F, they will look like NOP's. Otherwise, code can start at $008. A few instructions can always go at $008.. $009. Being able to automatically load interrupt vectors without having to manually set them would be nice.
I would like that option on COGINIT, but I really like the start in HUBEXEC mode too, so I hope it stays having both options.
Also, will the initial startup still have the first cog start in HUBEXEC? I assume so, right?