Recursive/Reentrant code in COG?

Heater. · 2013-09-28 23:19

We are all familaiar with the fact that the Propeller COG does not have a call stack. Rather it uses an old computer technique of writing the return address for subroutine calls to some known location where it can be picked up at the end of the subroutine (JMPRET).

This is not normally an issue for COG code as it's not often one want's to put a recursive algorithm in COG. (Has any one ever done so? Even the recursive Fibo benchmark?).

However we now have the possibility of running multiple threads in a COG. That means that different threads may want to call the same subroutine. Rather than having multiple copies of the same code in COG we migh need to make it a subroutine(s) to fit things in the space available.

What happens then?

Using normal COG CALL/RET won't work. Thread A can call a subroutine, which places the return address for thread A at the end of the routine (normally). Thread B migh call the same routine before thread A has returned. Ooops we have just now overwritten A's return address. Thread A could eaily end up running thread B's code or some such horror.

What we need is a stack based call return mechanism for thread safe calls.

Anyone attempted this yet?

ozpropdev · 2013-09-28 23:36

Hi Heater

That's a tricky one!

Phil Pilgrim (PhiPi) · 2013-09-28 23:42

If there were a block of RAM with a common address range, but with different physcial memory for each thread, the return addresses could be stored there.

-Phil

Ariba · 2013-09-29 00:06

There is a stack based call and return mechanism in Prop 2 It even saves the C and Z flag.
See CALLA, CALLB, RETA, RETB

Andy

ozpropdev · 2013-09-29 00:27

Hi Andy

Using CALLA,RETA doesn't get around the multi thread issue of who called it first?

For example

Thread 1 calls routine CALLA #XYZ.
Thread 2 then does the same, CALLA #XYZ
When thread 1 completes call it hits a RETA which returns to thread 2's PC not thread 1's.

I think I've got that right?

Cheers
Brian

Heater. · 2013-09-29 00:36

I haven't really looked into this but isn't there some weird memory tricks going on with threads such that, say, four threads all running the same code all see different memory for some variables? Sure I saw an example like that float by once.

If so, all that is needed is for a call, which is actually a JMPRET to place the return address in some place unique to the thread where it can be picked up by another JMPRET at return time. Rather than placing the return address in the RETURN instruction as usual.

This would not allow recursive calls within a thread but would allow reentrant calls between threads as the return addresses get stored in different places.

No stack required.

Edit: Oops, sorry Phil that's what you said:

If there were a block of RAM with a common address range, but with different physcial memory for each thread, the return addresses could be stored there.

Ariba · 2013-09-29 00:41

Brian
Yes you are right.
I think there are anyway other problems beside the return address. Also all the registers you use in the subroutine can get written by one thread and read by another. You will also need local registers on the stack and a stackpointer per thread.

But maybe Prop 2's register remapping can help here? I don't really understand that yet...

Andy

ozpropdev · 2013-09-29 00:55

Hi Guys

I think using SETMAP could get around the register overwrite issue and the return address storage issues.
The problem is the mechanism to retrieve the return address related to the current thread.

I'm dreaming here but if we had SPC and SPD as well as SPA and SB, we could have a CALLC and CALLD.
This in addition to a RETX for example that pops a return address from the stack based on thread number. Thread 0 pops SPA,1 pops SPB etc.
That would be nice. Just dreaming....(Sorry Chip, Ken, Beau and others..)

Cheers
Brian

Heater. · 2013-09-29 00:56

Ah, yes, those pesky local variables.

In the back of my mind was the idea that every thread is working with it's own variable, via "register remapping" or what ever it is called. It was only the return addresses that woke me up in the middle of the night nagging that they are in the shared instruction space normally.

Now, where is that example of threads using common code and separate variables?

Heater. · 2013-09-29 01:04

ospropdev,

I think using SETMAP could get around the register overwrite issue and the return address storage issues.

That sounds like the thing I am looking for to get thread local variables. Which could include space for return addresses.

The problem is the mechanism to retrieve the return address related to the current thread.

This should be simple. I have to look again but in Full Duplex Serial don't we have two coroutines calling each other with JMPRET, the return address being stored in some variable somewhere rather than in the normal return instruction location?

Seems to me the same technique of making calls and returns with JMPRET could be used by a thread such that it's return address is stored in it's thread local storage. Of course every subroutine you call would have to have it's own return address storage location defined.

We should not need to request any new instructions in the PII.

ozpropdev · 2013-09-29 01:08

Register remapping in the unofficial docs

dat
       org                        'longs are like nop's, get skipped

pin    long        0                'task 0 data
count  long        1
delay  long        0
extra  long        0

      long        1                'task 1 data
      long        5
      long        0
      long        0

      long        2                'task 2 data
      long        13
      long        0
      long        0

      long        3                'task 3 data
      long        29
      long        0
      long        0

      setmap        #%1_010_010        'remap registers by task, 4 sets, 4 registers each
      settask        #%%3210                'enable all tasks
      jmptask        #loop,#%1111        'before any newly-started tasks get to execute stage, jump all tasks to loop

loop   notp        pin                'toggle task x pin
      mov        delay,count        'get task x delay
      djnz        delay,#$        'count down delay
      jmp        #loop                'loop (count + 3 clocks)

Heater. · 2013-09-29 02:13

ozpropdev,

Bingo, that's just what I was thinking of. Thanks.

So, can we do reentrant, thread safe, subroutines as simply as this:
1) Make the subroutine call with CALL as normal.
2) Ensure the subroutine_ret lable and return instruction are in the threads data area.
3) To return from the subroutine just jump there.

Like so:

pin             long        0     'task 0 data
count           long        1
delay           long        0
subroutine_ret  ret               ' Return address for thread safe calls of subroutine.

       long        1              'task 1 data
       long        5
       long        0
       long        0

       long        2              'task 2 data
       long        13
       long        0
       long        0

       long        3              'task 3 data
       long        29
       long        0
       long        0

       setmap      #%1_010_010     'remap registers by task, 4 sets, 4 registers each
       settask     #%%3210         'enable all tasks
       jmptask     #thread, #%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop

thread notp        pin             'toggle task x pin
       mov         delay,count     'get task x delay
       djnz        delay,#$        'count down delay

       call        #subroutine     'Jump to subroutine
                                   '  saves return address in
                                   '  thread local subroutine_ret.

       jmp         #loop           'loop (count + 3 clocks)

subroutine
       'bla bla
       add count,  #1              'tweaks thread local count.
       'bla bla
       jmp         #subroutine_ret 'Returns via thread local return address.

       'jmp        subroutine_ret   'Or can we do like this?

Anyone got their FPGA set up to test this?

ozpropdev · 2013-09-29 02:26

Hi Heater

Wouldn't the RET instruction cause a problem when the cog is launched?
My understanding is that the variables are treated as NOP's and are ignored.
This relies on the values not being decoded as instructions?

Edit: Only 18 bit values can be used so there not decoded.

Brian

Heater. · 2013-09-29 03:47

Brian,

OK can't we just reserve a space with zero in it to store return addresses and and just do a JMP through that location (no #) for our returns? Like so:

pin             long        0     'task 0 data
count           long        1
delay           long        0
subroutine_ret  long        0     ' Space for return address for thread safe calls of subroutine.

       long        1              'task 1 data
       long        5
       long        0
       long        0

       long        2              'task 2 data
       long        13
       long        0
       long        0

       long        3              'task 3 data
       long        29
       long        0
       long        0

       setmap      #%1_010_010     'remap registers by task, 4 sets, 4 registers each
       settask     #%%3210         'enable all tasks
       jmptask     #thread, #%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop

thread notp        pin             'toggle task x pin
       mov         delay,count     'get task x delay
       djnz        delay,#$        'count down delay

       call        #subroutine     'Jump to subroutine
                                   '  saves return address in
                                   '  thread local subroutine_ret.

       jmp         #loop           'loop (count + 3 clocks)

subroutine
       'bla bla
       add count,  #1              'tweaks thread local count.
       'bla bla
       jmp        subroutine_ret   'Return from subroutine through thread local address.

ozpropdev · 2013-09-29 04:51

Ok, I think we've got it now.
Here's a program I just tested on my new DE2 board.
Running 4 tasks calling the same routine.
Both work fine. Hooray!

con

{{
Chip answered my question on SETMAP as follows:


	SETMAP #%1_010_010

Those two 3-bit sub-fields can be %000..%111 to represent 1/2/4/8/16/32/64/128 sets
or registers. For task-switched remapping, it makes sense to have
4 sets (bits 5..3 = %010), but the set size (number of registers in a set, selected
by bits 2..0) can be any power of 2 you'd like, up to 128
}}


	_clkmode = xinput
	_xinfreq = 60_000_000


	led0 = 32
	led2 = 34
	led4 = 36
	led6 = 38

dat
		orgh	$e80

		org

pin		long	led0
time		long	0
delay		long	60_000_000
flash_ret	long	0 

		long	led2
		long	0
		long	30_000_000
flash_ret2	long	0

		long	led4
		long	0
		long	15_000_000
flash_ret3	long	0

		long	led6
		long	0
		long	7_500_000
flash_ret4	long	0


		mov	flash_ret,jmp_inst
		mov	flash_ret2,jmp_inst
		mov	flash_ret3,jmp_inst
		mov	flash_ret4,jmp_inst

		setmap	#%1_010_010        'remap registers by task, 4 sets, 4 registers each
		jmptask	#task1,#%0010
		jmptask	#task2,#%0100
		jmptask	#task3,#%1000
		settask	#%%3210


task0		jmpret	flash_ret,#flash
		jmp	#task0

task1		jmpret	flash_ret,#flash
		jmp	#task1


task2		jmpret	flash_ret,#flash
		jmp	#task2


task3		jmpret	flash_ret,#flash
		jmp	#task3


flash		getcnt	time
		add	time,delay
		passcnt	time
		notp	pin
		jmp	flash_ret

jmp_inst	jmp	#0

Heater. · 2013-09-29 06:02

Cool!

I'm getting tired but isn't:

        jmpret  flash_ret,#flash

the same as:

       call  #flash

And what is that "jmp_inst" business about?

Not sure I see why we need it.

"mov flash_ret, jmp_inst" puts a jump to zero instruction at the return address location.
The subroutine calls will over write 9 bits of that with their return addresses.
But when we do the return with "jmp flash_ret" we are not using the JMP opcode we put in their (or are we?)

Am I missing some points here?

timx8 · 2013-09-29 13:17

I'm also curious if the call instruction would do the same thing. If "call #flash" is smart enough to follow the register remapping, to know where to write the return address, then it should be the same.

Flash_ret(,2,3,4) holds the task-specific return address in the lower 9 bits. I don't think it also needs to hold the jump instruction. If you get rid of jmp_inst and the 4 lines that reference it, the code still works, right?

Thanks for a demonstration of some cool techniques!

Phil Pilgrim (PhiPi) · 2013-09-29 13:26

Heater is right. The following is unnecessary:

		mov	flash_ret,jmp_inst
		mov	flash_ret2,jmp_inst
		mov	flash_ret3,jmp_inst
		mov	flash_ret4,jmp_inst

-Phil

Heater. · 2013-09-29 13:29

Yes. We need to know!

That "call #flash" is an assmbler pseudo op that should build the exact same "jmpret flash_ret,#flash" instruction bit for bit. At least it does for P1.

And as you say we don't need the "jmp" opcode bits in those flash_retx slots.

I just found out my DE0 nano has been trodden on, don't ask, and many of it's header pins are severely bent. Hope they don't fall off when I straighten them out and it still works.

Heater. · 2013-09-29 13:31

Thanks Phil,

Our posts crossed over. I may be right but has anyone tested it yet? As I said above I'm having some technical hitches here....

ozpropdev · 2013-09-29 15:03

Hi Guys

You are correct!

It was very late for me and looking at it this morning ...huh?
For some reason I thought the JMP instruction needed to be there?
Here's the corrected code that has been tested.

Cheers
Brian

con

	_clkmode = xinput
	_xinfreq = 60_000_000


	led0 = 32
	led2 = 34
	led4 = 36
	led6 = 38

dat
		orgh	$e80

		org

pin		long	led0
time		long	0
delay		long	60_000_000
flash_ret	long	0 

		long	led2
		long	0
		long	30_000_000
		long	0

		long	led4
		long	0
		long	15_000_000
		long	0

		long	led6
		long	0
		long	7_500_000
		long	0


		setmap	#%1_010_010        'remap registers by task, 4 sets, 4 registers each
		jmptask	#task1,#%0010
		jmptask	#task2,#%0100
		jmptask	#task3,#%1000
		settask	#%%3210


task0		call	#flash
		jmp	#task0

task1		call	#flash
		jmp	#task1


task2		call	#flash
		jmp	#task2


task3		call	#flash
		jmp	#task3


flash		getcnt	time
		add	time,delay
		passcnt	time
		notp	pin
		jmp	flash_ret

Heater. · 2013-09-29 15:27

ozpropdev,

Fantastic, thanks for that confirmation.

That is so clean and elegant.

This problem turned out to be much easier than I was imagining yesterday.

Recursive/Reentrant code in COG?

Comments