Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

Cluso99 · 2012-12-06 11:55

Nice job Chip - Task switching looks amazingly simple to implement. For sure there are some traps for the unwary.

cgracey · 2012-12-06 11:57

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

Cluso99 · 2012-12-06 11:58

Sapieha: Your P2 board looks nice. I see you have gone for a bank of V,G,0-7 sockets allowing the use of many P1 modules from Bill and others.

Cluso99 · 2012-12-06 12:00

cgracey wrote: »

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

Yes. Our posts crossed.

The P2 is certainly going to keep Koroneko busy

David Betz · 2012-12-06 12:00

cgracey wrote: »

There was no @ before 'reserves'.

Ah, that must be my problem! Thanks for the explanation!

Sapieha · 2012-12-06 12:01

Hi Cluso

V,G,0-7,5V

And Yes it is for support all standard 10 + Bill's 11 pins modules

Cluso99 wrote: »

Sapieha: Your P2 board looks nice. I see you have gone for a bank of V,G,0-7 sockets allowing the use of many P1 modules from Bill and others.

David Betz · 2012-12-06 12:07

cgracey wrote: »

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

It made sense to me. I think you had posted a shorter description of tasking earlier as well. Even that was pretty clear!

mindrobots · 2012-12-06 12:51

cgracey wrote: »

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

Chip,

It made sense to me, which may or may not cause concern!

I like how you can play with the task register to control the amount of execution each task can get.

Now we'll see if I understand.

At least two ways to start multi-tasking:

1) Your code after COGINIT would start with instructions for task0 starting at $000, it would run through some code doing initialization/housekeeping until it was ready to start task1. When ready to start task1, it would execute a JMPTASK with the mask set for task1 and D set for the first instruction of task1's code and then use SETTASK to give some portion of the execution slots to task1. Any combination of this up to 3 additional tasks.

2) As in your 4 task example so task 0 ends up just doing the JMP then the SETTASK and then going about its business.

Wicked cool!!

I also like that the stack area is non-voaltile across COGINTs. That seems ripe for adventure and exploitation (in a good sense)!

Bill Henning · 2012-12-06 12:59

Made sense to me.

cgracey wrote: »

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

Electrodude · 2012-12-06 13:36

The non-volatility of the stack makes LMM really easy:

load variables from stack into cogram
run part of code
save variables into stack
coginit this cog with next chunk
repeat

Tubular · 2012-12-06 14:32

cgracey wrote: »

Did anyone read the new doc section on multitasking yet? Did it make complete sense?

Yep! Looks clear enough despite the caveats. I am thinking of re-writing my SVGA generator in task form, perhaps a VGA version (dot clock 25MHz), just as an exercise - ie
task00 - supervisor, vsync & idle thread
task01 - output active VGA
task02 - fetch / modulate VGA contents
task03 - porch/hsync/porch

The advantage I see is you don't have to cycle count to get the timing spot on (can base it all of system CNT for instance).

As I said just an exercise, we will of course be using the video generator down the track.

cgracey · 2012-12-06 16:14

Okay!

Thanks for the feedback, Guys.

Sounds like you've got an accurate handle on the multitasking.

cgracey · 2012-12-06 16:18

Electrodude wrote: »
The non-volatility of the stack makes LMM really easy:
load variables from stack into cogram
run part of code
save variables into stack
coginit this cog with next chunk
repeat

When a cog loads, it is actually executing RDLONGC's. So, you could do the same in software and load at the same rate, without having to commit the whole cog memory and having the I/O's cancelled each time.

Peter Jakacki · 2012-12-06 17:25

Since I've been waylaid and just getting back into the P2 seat now I have copied the latest (I hope) documentation from Chip's post and put this into a Google document. I find having to go back to the post and scroll select the text and mucking about is a nuisance for a number of reasons and one of them is that all the formatting of the document done previously is lost. Why can't we just have the Google document updated so we keep the formatting, we can introduce bookmarks and table of contents so that this is a live document.
Link to the editable version of the document
Link to the webpage version of the document which is automatically updated when the master has been changed.

Seairth · 2012-12-06 18:00

cgracey wrote: »

Sounds like you've got an accurate handle on the multitasking.

*sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:

cgracey wrote: »

It just writes a new value into the TASK register which immediately affects which task is going to execute next.

Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?

cgracey · 2012-12-06 19:05

Seairth wrote: »

*sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:

Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?

You're right. I need to do some pipeline explanation.

When I say an instruction executes, I mean it is in the last stage of the pipeline, where the action occurs. Prior stages, going backwards, read the operands, handle indirection and other things, and read the instruction:

0: read the instruction
1: handle indirection
2: read the operands
3: execute the instruction (compute result, affect Z, C, write result)

So, when SETTASK issues a new time slot pattern, there are already three instructions in the pipeline, so the 4th instruction after SETTASK will be from the task specified in the two LSB's of the SETTASK operand.

Anyone want to verify this? You could have the SETTASK's 2 LSB's give a time slot to a task which just sets pin 1 using 'SETP #1', then the instruction after the SETTASK could do a 'SETP #0'. See how many clock periods are between the two.

ctwardell · 2012-12-07 05:53

This is getting too fun, went ahead and ordered a DE0-Nano from Digi-Key this morning.

I want to continue with the COSMACog progect using the P1, but also want to have a P2 version when the P2 hits the streets.

Depending on how things look around Jan/Feb I may spring for a DE2-115.

C.W.

David Betz · 2012-12-07 07:42

Cluso99,

I guess you own this topic so I'd like to request that you add a link to the p2load thread so people can find the loader when they need it. The thread itself is not active enough to stay close to the top of the posts.

Here is a link to it: http://forums.parallax.com/showthread.php?144384-p2load-A-Loader-for-the-Propeller-II

Thanks!
David

Leon · 2012-12-07 07:55

ctwardell wrote: »

This is getting too fun, went ahead and ordered a DE0-Nano from Digi-Key this morning.

I want to continue with the COSMACog progect using the P1, but also want to have a P2 version when the P2 hits the streets.

Depending on how things look around Jan/Feb I may spring for a DE2-115.

C.W.

It might make more sense to implement the COSMAC Elf in an FPGA:

http://whats.all.this.brouhaha.com/category/computing/hardware/fpga/

Seairth · 2012-12-07 11:29

cgracey wrote: »

Initially, $1F6 and $1F7 point only to themselves, so they are more or less regular RAM registers and do get loaded on cog start. When the hardware sees $1F6 or $1F7 for D or S, it substitutes the current pointer value for $1F6/$1F7. The only way you can address the shadow registers is by pointing INDA or INDB at them.

Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?

Also, it occurs to me that it would be possible to generate instructions like:

MOV ++INDA INDA++
MOV INDA++, INDA--
etc.

Are these allowed or undefined?

cgracey · 2012-12-07 11:50

Seairth wrote: »

Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?

Also, it occurs to me that it would be possible to generate instructions like:

MOV ++INDA INDA++
MOV INDA++, INDA--
etc.

Are these allowed or undefined?

Those crazy examples are all allowed. Just OR the 2-bit fields together to get the 2-bit post-effect.

At cog startup, INDA and INDB are configured as if these instructions had been execute:

FIXINDA $1F6,$1F6
FIXINDB $1F7,$1F7

So, reading or writing $1F6 or $1F7 has the intended effect. You just won't be able to have any conditional execution.

ctwardell · 2012-12-07 12:00

Leon wrote: »

It might make more sense to implement the COSMAC Elf in an FPGA:

http://whats.all.this.brouhaha.com/category/computing/hardware/fpga/

I may do that as well someday for giggles, but for now the prop1 and then prop2 are fine.

The goal is to have a very low cost emulator with a low barrier to entry.

C.W.

Seairth · 2012-12-07 12:03

cgracey wrote: »

At cog startup, INDA and INDB are configured as if these instructions had been execute:

FIXINDA $1F6,$1F6
FIXINDB $1F7,$1F7

So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.

cgracey · 2012-12-07 13:55

Seairth wrote: »

So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.

No. It loads $1F8 instructions. If you ever actually execute $1F6 or $1F7, it will get the data from those absolute registers. Only D and S have indirection. The instruction doesn't.

Cluso99 · 2012-12-07 14:52

Oooh! another 2 buried instructions

Now provided we dont actually use any I/O, we can also put instructions into PINA, PINB, PINC & PIND (by doing mov instructions of course since they are all 0's at launch) and get an extra 6 instructions in total

That is enough for my zero footprint debugger.

Ariba · 2012-12-07 20:38

Chip

I experiment a bit with LMM on Prop2, and have a few questions:

1) does the single cog DE0-Nano version simulate the Hub timing for 8 cogs, or can the single cog access the hub on every cycle in this version?
2) what is the minimal number of cycles between modifying an instruction and execute it? My observation so far is that 2 instructions in between are enough.

Here is my first attempt to LMM, it executes with 1/5 the clock rate (12MHz for a 60MHz clock):

lmm   rdlongc instr,pc
      jmpd #lmm
      add pc,#4
instr nop
      nop
      jmp #lmm   '(if jmpd gets cancelled by LMM code)

Theoretically every fourth rdlongc a new quad is read and the timing then must be on a multiple of 8 clocks, but
I don't see this behavior.

Andy

cgracey · 2012-12-07 23:13

Ariba,

1) Regardless of how few cogs an FPGA implementation has, it always cycles the hub as if there were eight cogs. So, the DE0-Nano board gives its single cog every 8th hub cycle, just as a single cog would get in a complete chip.

2) You are right about two instructions needing to be between an instruction modifier and the instruction getting modified. I was just writing some pipeline explanation about this:

PIPELINE
--------

Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:


  1st stage    - Read instruction
  2nd stage    - Determine indirect/remapped D and S addresses, update INDA/INDB
  3rd stage    - Read D and S
  4th stage    - Execute instruction, writing D, Z/C/PC, and any other results


On every clock cycle, the instruction in each stage advances to the next stage, unless the instruction
in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for
the hub).

To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
the 3rd stage to substitute for any obsoleted data being read from the cog register RAM. The same is
done for instruction data in the 1st stage, but there is still a two-stage gap between when a register
is modified and it can be executed, at the earliest:


        MOVD    :inst,top9         'modify instruction
        NOP                        '1...
        NOP                        '2... at least two instructions in-between
:inst   ADD     A,B                'modified instruction executes


Tasks that execute in at least every 3rd time slot don't need to observe this 2-instruction rule because
their instructions will always be sufficiently spread apart in the pipeline.

Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?

Ariba · 2012-12-08 01:32

Thank you Chip

Good to know that my findings on the FPGA version are also applicable on the real Prop2.

cgracey wrote: »

....
Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?

I also expected a 24 cycle loop for 4 instructions. My LMM loop is 2 instructions:

notp #2
    sub pc,#8

so I should get a toggle frequency at Pin 2 of 2.5MHz (60MHz / 24), with a small asymmetry, because 3 instructions need 5 cycles each and one instructon needs 9 cycles. But my Scope showed a higher frequency (more like 3MHz) and a symmetrical signal yesterday. I need to verify this again. At the moment I have running also a second task with 25% timeslot together with the LMM loop from last post. With this second task I see some jitter in the LMM generated frequency, but it still works. If I give the second task 50%, then I need to execute the jmpd on instruction later, which makes sense to me.

Andy

Ariba · 2012-12-08 02:25

OK I verified it and it is reallly 3MHz.

But after a while I also found the reason:
This 2 instructions LMM code just stays always in the quad-cache, no need to reload it ! So rdlongc always takes only 1 cycle.
This changes if I make the LMM code longer or the addresses of the two instruction loop goes over a qaud boundery.

Andy

cgracey · 2012-12-08 02:30

Here's a little program that kicks off four tasks running the same code, but with different variable sets.

Register remapping is set up to remap 4 sets of 4 registers, according to the task executing. For tasks 0..3, hard addresses 0..3 remap to 0..3, 4..7, 8..11, or 12..15.

dat
        org			'longs are like nop's, get skipped

pin	long	0		'task 0 data
count	long	1
delay	long	0
extra	long	0

	long	1		'task 1 data
	long	5
	long	0
	long	0

	long	2		'task 2 data
	long	13
	long	0
	long	0

	long	3		'task 3 data
	long	29
	long	0
	long	0

	setmap	#%1_010_010	'remap registers by task, 4 sets, 4 registers each
	settask	#%%3210		'enable all tasks
	jmptask	#loop,#%1111	'before any newly-started tasks get to execute stage, jump all tasks to loop

loop	notp	pin		'toggle task x pin
	mov	delay,count	'get task x delay
	djnz	delay,#$	'count down delay
	jmp	#loop		'loop (count + 3 clocks)

Task 0 toggles pin 0 every 16 clocks.
Task 1 toggles pin 1 every 32 clocks.
Task 2 toggles pin 2 every 64 clocks.
Task 3 toggles pin 3 every 128 clocks.

I'm going to document register remapping next.

Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

Comments