Example Software

78rpm · 2015-10-28 14:36

I know some of you are developing some software to test / support / explore / demonstrate the P2, but unsure as to the extent or area. To avoid duplicate effort, I would like to lay claim to a pre-emptive multithreaded multicog (code wise) demonstration, that is if nobody is currently undertaking such a task. I would like to undertake this once I have finished the rdlong / wrlong, etc, testing, which is progressing nicely. I am also happy to maintain a list of these packages from a forum post, even this one, linking to your posts in this first comment. This would help direct people to the appropriate place for a given field of interest.

pjv · 2015-10-28 16:44

Hi 78rpm;

I fully laud your intent in this direction as I have the same interest.

As you may (or not) know, I have over the years developed such a scheme for the P1, allowing some 8 simultaneous threads to run in each cog. In fact, for trivial "led blinking" tasks, I have had over 100 simultaneous threads running in one chip. A task switch takes about 1 to 2 uSec, depending on how many threads a cog has.... 8 is a practical maximum.

All works very well, and with the enhanced capabilities of the P2, I expect to regurgitate and enhance that software to make considerable speed and other performance improvements.

At this point I have not yet embraced developing with the P2 while everything is still in such a change in flux, but I'm getting closer. Perhaps when all the dust has settled on the instruction set, and there is stable documentation and some understanding of the SmartPin capabilities will be my time.

Could be soon, I hope. In the mean time I will follow your development with eagerness, and see how closely your design aligns with my P1 experience. I'm sure I will learn a bunch.

Good on you!

Cheers,

Peter (pjv)

78rpm · 2015-10-29 02:39

pjv wrote: »

Hi 78rpm;

I fully laud your intent in this direction as I have the same interest.

As you may (or not) know, I have over the years developed such a scheme for the P1, allowing some 8 simultaneous threads to run in each cog. In fact, for trivial "led blinking" tasks, I have had over 100 simultaneous threads running in one chip. A task switch takes about 1 to 2 uSec, depending on how many threads a cog has.... 8 is a practical maximum.

All works very well, and with the enhanced capabilities of the P2, I expect to regurgitate and enhance that software to make considerable speed and other performance improvements.

At this point I have not yet embraced developing with the P2 while everything is still in such a change in flux, but I'm getting closer. Perhaps when all the dust has settled on the instruction set, and there is stable documentation and some understanding of the SmartPin capabilities will be my time.

Could be soon, I hope. In the mean time I will follow your development with eagerness, and see how closely your design aligns with my P1 experience. I'm sure I will learn a bunch.

Good on you!

Cheers,

Peter (pjv)

Hello Peter,

Although I am a recent poster I have been reading the development of the P2 off and on over the past 6 or 7 years, and dipping into the other forums. It is possible I have seen your posts over the years, or they may have been during a period my reading here was light, for one reason or another. Do you recall the thread title or approx month / year?

I think the P2 is easily capable of re-entrant code, especially in the hub-exec mode. Execution and a stack in hub ram may be slower, but it allows for function local variables on the stack. I already have a simple piece of code which manages to look elegant. Fortunately PNut permits immediate constants to define stack offsets:

' passing aguments on the stack test

jmp #over_my_data

CON
  STK_LOCAL_1	   	=  0
  STK_TOS_ON_ENTRY	= -1		'ptrb here
  STK_RET_ADDR     	= -2
  STK_ARG_3_DIV_RESULT  = -3	
  STK_ARG_2_DIVISOR     = -4
  STK_ARG_1_NUMERATOR	= -5
  STK_RESULT            = -6
DAT

pass_on_stack
		pusha	ptrb
		pusha	ptra
		popa	ptrb			' now addressing stack
	' make space for locals by adding to ptra, 
	' accessed by rdlong x,ptrb[ 1 ] etc
		pusha	r1
		pusha	r2

		' divide takes values in registers, so we must load them
		' instead of doing a pusha scratch
		rdlong	scratch, ptrb[ STK_ARG_1_NUMERATOR ]
		mov	r1, scratch
		rdlong	scratch, ptrb[ STK_ARG_2_DIVISOR ]
		mov	r2, scratch

		calla	#divide32b32_inCog		' returns Q:R in r2:r1, r1 / r2

		pusha	r1			' Q
		rdlong	scratch, ptrb[ STK_ARG_3_DIV_RESULT ]		
		cmp	r1, scratch  wc,  wz
	if_e	jmp	#.L10

		' error - we may wish to return an error code or 
		' null value here
		wrlong	scratch, ptrb[ STK_RESULT ]

		loc 	adrb, #pass_on_stack_div_error
		calla 	#send_msg
		pusha	ptrb
		popa	adrb			' address of result
		add	adrb, ##STK_ARG_3_DIV_RESULT * 4 ' longs
		calla	#send_dec
		loc 	adrb, #newline
		calla 	#send_msg
		jmp	#.L20
.L10		
		' result is correct
		wrlong	scratch, ptrb[ STK_RESULT ]
.L20
		popa	scratch			' discard r1
		popa	r2
		popa	r1
		popa	ptrb
		reta
' pass_on_stack end



stack_args_result byte	"Result from passing arguments on the stack = ", 0
pass_on_stack_div_error	byte	"Error dividing values passed on stack", 0

		alignl

over_my_data
		pusha	##0			' return value
		pusha	##$76543210		' arg 1
		pusha	##$1cedcafe		' arg 2
		pusha	##$76543210/$1cedcafe	' arg 3
		calla	#pass_on_stack
		sub	ptra, #3 * 4		' 3 args of longs
		popa	r8			' return value
		
		loc 	adrb, #stack_args_result
		calla 	#send_msg
		loc	adrb,#r8
		calla	#send_dec
		loc 	adrb, #newline
		calla 	#send_msg

Local constants would be handy, or an ability to undefine (can unconstant? (nuke) ) them, but I think I will manage even with abbreviated names.

The main problem with multithreading is the number of 'CPU' registers which require saving for context. Writing your own demonstration you get to choose how the registers are allocated. If the 'kernel' is included as part of say C, then you are locked in to how ever many registers the software and the porting team implement . That is what will take the time. Though we have the luxury of much more ram to play with.

For my demo I was thinking of using a very simple priorty which can be boosted for more slices, and however many threads it takes to give a simple and good demo to show what is possible. Hopefully someone will build on it in the future. I also intend to use Parallax Serial Terminal with the demo, or possibly converting the output so standard ANSI escape sequences can be generated for updating screen positions for use with putty or similar.

The other aspect or the mt demo is to show reentrancy in hub code, not just from 16 cogs, but perhaps 16 threads

I am using a DE0-Nano so I only have a two cog 32K no cordic version here, so I am not able to be too adventurous!

Hopefully this demo can then lead onto a Propeller 2 port of FreeRTOS, just so Parallax can say, "We really didn't need to do this, we were quite happy with sixteen processors running parallel threading natively! FreeRTOS was there looking lonely so we felt we ought to." I have looked at the device dependent sections some years ago for adding in x86 real mode FPU context save but I never got round to submitting it. I doubt that section of code will have changed much.

I think it could be useful for driving control panel displays were each instrument or small cluster has it's own task. The outputs of which are handled either by another task or a non-interruptible driver as necessary.

You have managed to implement mt on a Propeller 1, I now challenge you to write a cog emulator which resides in the cog. Why, you might ask. Well so that you can interpret pasm bitcode. This will allow you to run SPIN as a task, meaning you may be able to have multiple SPIN interpreters running in a single cog, executing more code. Go on, you know you want to do it.

pjv · 2015-10-29 16:02

78rpm;

Just to be clear, the multi threading I have done with the P1 is co-operative, NOT pre-emptive as you are planning to do. The development is ongoing, adding Spin friendly features that cause "soft" interrupts in the cog codes.

I have not published much about this, just the occasional reference to what is possible when multi tasking rears its head. There is a REAL early version published years ago on the Parallax site where it was a winner in one of the contests.

I hope to release my completed work for non-commercial applications in the near future.

And no, I have no interest in making an emulator.... too many other things.

Cheers,

Peter (pjv)

Example Software

Comments