Recursive/Reentrant code in COG?

We are all familaiar with the fact that the Propeller COG does not have a call stack. Rather it uses an old computer technique of writing the return address for subroutine calls to some known location where it can be picked up at the end of the subroutine (JMPRET).
This is not normally an issue for COG code as it's not often one want's to put a recursive algorithm in COG. (Has any one ever done so? Even the recursive Fibo benchmark?).
However we now have the possibility of running multiple threads in a COG. That means that different threads may want to call the same subroutine. Rather than having multiple copies of the same code in COG we migh need to make it a subroutine(s) to fit things in the space available.
What happens then?
Using normal COG CALL/RET won't work. Thread A can call a subroutine, which places the return address for thread A at the end of the routine (normally). Thread B migh call the same routine before thread A has returned. Ooops we have just now overwritten A's return address. Thread A could eaily end up running thread B's code or some such horror.
What we need is a stack based call return mechanism for thread safe calls.
Anyone attempted this yet?
This is not normally an issue for COG code as it's not often one want's to put a recursive algorithm in COG. (Has any one ever done so? Even the recursive Fibo benchmark?).
However we now have the possibility of running multiple threads in a COG. That means that different threads may want to call the same subroutine. Rather than having multiple copies of the same code in COG we migh need to make it a subroutine(s) to fit things in the space available.
What happens then?
Using normal COG CALL/RET won't work. Thread A can call a subroutine, which places the return address for thread A at the end of the routine (normally). Thread B migh call the same routine before thread A has returned. Ooops we have just now overwritten A's return address. Thread A could eaily end up running thread B's code or some such horror.
What we need is a stack based call return mechanism for thread safe calls.
Anyone attempted this yet?
Comments
That's a tricky one!
-Phil
See CALLA, CALLB, RETA, RETB
Andy
Using CALLA,RETA doesn't get around the multi thread issue of who called it first?
For example
Thread 1 calls routine CALLA #XYZ.
Thread 2 then does the same, CALLA #XYZ
When thread 1 completes call it hits a RETA which returns to thread 2's PC not thread 1's.
I think I've got that right?
Cheers
Brian
If so, all that is needed is for a call, which is actually a JMPRET to place the return address in some place unique to the thread where it can be picked up by another JMPRET at return time. Rather than placing the return address in the RETURN instruction as usual.
This would not allow recursive calls within a thread but would allow reentrant calls between threads as the return addresses get stored in different places.
No stack required.
Edit: Oops, sorry Phil that's what you said:
Yes you are right.
I think there are anyway other problems beside the return address. Also all the registers you use in the subroutine can get written by one thread and read by another. You will also need local registers on the stack and a stackpointer per thread.
But maybe Prop 2's register remapping can help here? I don't really understand that yet...
Andy
I think using SETMAP could get around the register overwrite issue and the return address storage issues.
The problem is the mechanism to retrieve the return address related to the current thread.
I'm dreaming here but if we had SPC and SPD as well as SPA and SB, we could have a CALLC and CALLD.
This in addition to a RETX for example that pops a return address from the stack based on thread number. Thread 0 pops SPA,1 pops SPB etc.
That would be nice. Just dreaming....(Sorry Chip, Ken, Beau and others..)
Cheers
Brian
In the back of my mind was the idea that every thread is working with it's own variable, via "register remapping" or what ever it is called. It was only the return addresses that woke me up in the middle of the night nagging that they are in the shared instruction space normally.
Now, where is that example of threads using common code and separate variables?
Seems to me the same technique of making calls and returns with JMPRET could be used by a thread such that it's return address is stored in it's thread local storage. Of course every subroutine you call would have to have it's own return address storage location defined.
We should not need to request any new instructions in the PII.
dat org 'longs are like nop's, get skipped pin long 0 'task 0 data count long 1 delay long 0 extra long 0 long 1 'task 1 data long 5 long 0 long 0 long 2 'task 2 data long 13 long 0 long 0 long 3 'task 3 data long 29 long 0 long 0 setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each settask #%%3210 'enable all tasks jmptask #loop,#%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop loop notp pin 'toggle task x pin mov delay,count 'get task x delay djnz delay,#$ 'count down delay jmp #loop 'loop (count + 3 clocks)
Bingo, that's just what I was thinking of. Thanks.
So, can we do reentrant, thread safe, subroutines as simply as this:
1) Make the subroutine call with CALL as normal.
2) Ensure the subroutine_ret lable and return instruction are in the threads data area.
3) To return from the subroutine just jump there.
Like so:
pin long 0 'task 0 data count long 1 delay long 0 subroutine_ret ret ' Return address for thread safe calls of subroutine. long 1 'task 1 data long 5 long 0 long 0 long 2 'task 2 data long 13 long 0 long 0 long 3 'task 3 data long 29 long 0 long 0 setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each settask #%%3210 'enable all tasks jmptask #thread, #%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop thread notp pin 'toggle task x pin mov delay,count 'get task x delay djnz delay,#$ 'count down delay call #subroutine 'Jump to subroutine ' saves return address in ' thread local subroutine_ret. jmp #loop 'loop (count + 3 clocks) subroutine 'bla bla add count, #1 'tweaks thread local count. 'bla bla jmp #subroutine_ret 'Returns via thread local return address. 'jmp subroutine_ret 'Or can we do like this?
Anyone got their FPGA set up to test this?
Wouldn't the RET instruction cause a problem when the cog is launched?
My understanding is that the variables are treated as NOP's and are ignored.
This relies on the values not being decoded as instructions?
Edit: Only 18 bit values can be used so there not decoded.
Brian
OK can't we just reserve a space with zero in it to store return addresses and and just do a JMP through that location (no #) for our returns? Like so:
pin long 0 'task 0 data count long 1 delay long 0 subroutine_ret long 0 ' Space for return address for thread safe calls of subroutine. long 1 'task 1 data long 5 long 0 long 0 long 2 'task 2 data long 13 long 0 long 0 long 3 'task 3 data long 29 long 0 long 0 setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each settask #%%3210 'enable all tasks jmptask #thread, #%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop thread notp pin 'toggle task x pin mov delay,count 'get task x delay djnz delay,#$ 'count down delay call #subroutine 'Jump to subroutine ' saves return address in ' thread local subroutine_ret. jmp #loop 'loop (count + 3 clocks) subroutine 'bla bla add count, #1 'tweaks thread local count. 'bla bla jmp subroutine_ret 'Return from subroutine through thread local address.
Here's a program I just tested on my new DE2 board.
Running 4 tasks calling the same routine.
Both work fine. Hooray!
con {{ Chip answered my question on SETMAP as follows: SETMAP #%1_010_010 Those two 3-bit sub-fields can be %000..%111 to represent 1/2/4/8/16/32/64/128 sets or registers. For task-switched remapping, it makes sense to have 4 sets (bits 5..3 = %010), but the set size (number of registers in a set, selected by bits 2..0) can be any power of 2 you'd like, up to 128 }} _clkmode = xinput _xinfreq = 60_000_000 led0 = 32 led2 = 34 led4 = 36 led6 = 38 dat orgh $e80 org pin long led0 time long 0 delay long 60_000_000 flash_ret long 0 long led2 long 0 long 30_000_000 flash_ret2 long 0 long led4 long 0 long 15_000_000 flash_ret3 long 0 long led6 long 0 long 7_500_000 flash_ret4 long 0 mov flash_ret,jmp_inst mov flash_ret2,jmp_inst mov flash_ret3,jmp_inst mov flash_ret4,jmp_inst setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each jmptask #task1,#%0010 jmptask #task2,#%0100 jmptask #task3,#%1000 settask #%%3210 task0 jmpret flash_ret,#flash jmp #task0 task1 jmpret flash_ret,#flash jmp #task1 task2 jmpret flash_ret,#flash jmp #task2 task3 jmpret flash_ret,#flash jmp #task3 flash getcnt time add time,delay passcnt time notp pin jmp flash_ret jmp_inst jmp #0
I'm getting tired but isn't:
jmpret flash_ret,#flash
the same as:call #flash
And what is that "jmp_inst" business about?
Not sure I see why we need it.
"mov flash_ret, jmp_inst" puts a jump to zero instruction at the return address location.
The subroutine calls will over write 9 bits of that with their return addresses.
But when we do the return with "jmp flash_ret" we are not using the JMP opcode we put in their (or are we?)
Am I missing some points here?
Flash_ret(,2,3,4) holds the task-specific return address in the lower 9 bits. I don't think it also needs to hold the jump instruction. If you get rid of jmp_inst and the 4 lines that reference it, the code still works, right?
Thanks for a demonstration of some cool techniques!
mov flash_ret,jmp_inst mov flash_ret2,jmp_inst mov flash_ret3,jmp_inst mov flash_ret4,jmp_inst
-Phil
That "call #flash" is an assmbler pseudo op that should build the exact same "jmpret flash_ret,#flash" instruction bit for bit. At least it does for P1.
And as you say we don't need the "jmp" opcode bits in those flash_retx slots.
I just found out my DE0 nano has been trodden on, don't ask, and many of it's header pins are severely bent. Hope they don't fall off when I straighten them out and it still works.
Our posts crossed over. I may be right but has anyone tested it yet? As I said above I'm having some technical hitches here....
You are correct!
It was very late for me and looking at it this morning ...huh?
For some reason I thought the JMP instruction needed to be there?
Here's the corrected code that has been tested.
Cheers
Brian
con _clkmode = xinput _xinfreq = 60_000_000 led0 = 32 led2 = 34 led4 = 36 led6 = 38 dat orgh $e80 org pin long led0 time long 0 delay long 60_000_000 flash_ret long 0 long led2 long 0 long 30_000_000 long 0 long led4 long 0 long 15_000_000 long 0 long led6 long 0 long 7_500_000 long 0 setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each jmptask #task1,#%0010 jmptask #task2,#%0100 jmptask #task3,#%1000 settask #%%3210 task0 call #flash jmp #task0 task1 call #flash jmp #task1 task2 call #flash jmp #task2 task3 call #flash jmp #task3 flash getcnt time add time,delay passcnt time notp pin jmp flash_ret
Fantastic, thanks for that confirmation.
That is so clean and elegant.
This problem turned out to be much easier than I was imagining yesterday.