Did anyone read the new doc section on multitasking yet? Did it make complete sense?
Chip,
It made sense to me, which may or may not cause concern!
I like how you can play with the task register to control the amount of execution each task can get.
Now we'll see if I understand.
At least two ways to start multi-tasking:
1) Your code after COGINIT would start with instructions for task0 starting at $000, it would run through some code doing initialization/housekeeping until it was ready to start task1. When ready to start task1, it would execute a JMPTASK with the mask set for task1 and D set for the first instruction of task1's code and then use SETTASK to give some portion of the execution slots to task1. Any combination of this up to 3 additional tasks.
2) As in your 4 task example so task 0 ends up just doing the JMP then the SETTASK and then going about its business.
Wicked cool!!
I also like that the stack area is non-voaltile across COGINTs. That seems ripe for adventure and exploitation (in a good sense)!
Did anyone read the new doc section on multitasking yet? Did it make complete sense?
Yep! Looks clear enough despite the caveats. I am thinking of re-writing my SVGA generator in task form, perhaps a VGA version (dot clock 25MHz), just as an exercise - ie
task00 - supervisor, vsync & idle thread
task01 - output active VGA
task02 - fetch / modulate VGA contents
task03 - porch/hsync/porch
The advantage I see is you don't have to cycle count to get the timing spot on (can base it all of system CNT for instance).
As I said just an exercise, we will of course be using the video generator down the track.
The non-volatility of the stack makes LMM really easy:
load variables from stack into cogram
run part of code
save variables into stack
coginit this cog with next chunk
repeat
When a cog loads, it is actually executing RDLONGC's. So, you could do the same in software and load at the same rate, without having to commit the whole cog memory and having the I/O's cancelled each time.
Since I've been waylaid and just getting back into the P2 seat now I have copied the latest (I hope) documentation from Chip's post and put this into a Google document. I find having to go back to the post and scroll select the text and mucking about is a nuisance for a number of reasons and one of them is that all the formatting of the document done previously is lost. Why can't we just have the Google document updated so we keep the formatting, we can introduce bookmarks and table of contents so that this is a live document. Link to the editable version of the document Link to the webpage version of the document which is automatically updated when the master has been changed.
It just writes a new value into the TASK register which immediately affects which task is going to execute next.
Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?
*sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:
Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?
You're right. I need to do some pipeline explanation.
When I say an instruction executes, I mean it is in the last stage of the pipeline, where the action occurs. Prior stages, going backwards, read the operands, handle indirection and other things, and read the instruction:
0: read the instruction
1: handle indirection
2: read the operands
3: execute the instruction (compute result, affect Z, C, write result)
So, when SETTASK issues a new time slot pattern, there are already three instructions in the pipeline, so the 4th instruction after SETTASK will be from the task specified in the two LSB's of the SETTASK operand.
Anyone want to verify this? You could have the SETTASK's 2 LSB's give a time slot to a task which just sets pin 1 using 'SETP #1', then the instruction after the SETTASK could do a 'SETP #0'. See how many clock periods are between the two.
I guess you own this topic so I'd like to request that you add a link to the p2load thread so people can find the loader when they need it. The thread itself is not active enough to stay close to the top of the posts.
Initially, $1F6 and $1F7 point only to themselves, so they are more or less regular RAM registers and do get loaded on cog start. When the hardware sees $1F6 or $1F7 for D or S, it substitutes the current pointer value for $1F6/$1F7. The only way you can address the shadow registers is by pointing INDA or INDB at them.
Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?
Also, it occurs to me that it would be possible to generate instructions like:
Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?
Also, it occurs to me that it would be possible to generate instructions like:
MOV ++INDA INDA++
MOV INDA++, INDA--
etc.
Are these allowed or undefined?
Those crazy examples are all allowed. Just OR the 2-bit fields together to get the 2-bit post-effect.
At cog startup, INDA and INDB are configured as if these instructions had been execute:
FIXINDA $1F6,$1F6
FIXINDB $1F7,$1F7
So, reading or writing $1F6 or $1F7 has the intended effect. You just won't be able to have any conditional execution.
At cog startup, INDA and INDB are configured as if these instructions had been execute:
FIXINDA $1F6,$1F6
FIXINDB $1F7,$1F7
So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.
So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.
No. It loads $1F8 instructions. If you ever actually execute $1F6 or $1F7, it will get the data from those absolute registers. Only D and S have indirection. The instruction doesn't.
Now provided we dont actually use any I/O, we can also put instructions into PINA, PINB, PINC & PIND (by doing mov instructions of course since they are all 0's at launch) and get an extra 6 instructions in total
I experiment a bit with LMM on Prop2, and have a few questions:
1) does the single cog DE0-Nano version simulate the Hub timing for 8 cogs, or can the single cog access the hub on every cycle in this version?
2) what is the minimal number of cycles between modifying an instruction and execute it? My observation so far is that 2 instructions in between are enough.
Here is my first attempt to LMM, it executes with 1/5 the clock rate (12MHz for a 60MHz clock):
1) Regardless of how few cogs an FPGA implementation has, it always cycles the hub as if there were eight cogs. So, the DE0-Nano board gives its single cog every 8th hub cycle, just as a single cog would get in a complete chip.
2) You are right about two instructions needing to be between an instruction modifier and the instruction getting modified. I was just writing some pipeline explanation about this:
PIPELINE
--------
Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:
1st stage - Read instruction
2nd stage - Determine indirect/remapped D and S addresses, update INDA/INDB
3rd stage - Read D and S
4th stage - Execute instruction, writing D, Z/C/PC, and any other results
On every clock cycle, the instruction in each stage advances to the next stage, unless the instruction
in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for
the hub).
To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
the 3rd stage to substitute for any obsoleted data being read from the cog register RAM. The same is
done for instruction data in the 1st stage, but there is still a two-stage gap between when a register
is modified and it can be executed, at the earliest:
MOVD :inst,top9 'modify instruction
NOP '1...
NOP '2... at least two instructions in-between
:inst ADD A,B 'modified instruction executes
Tasks that execute in at least every 3rd time slot don't need to observe this 2-instruction rule because
their instructions will always be sufficiently spread apart in the pipeline.
Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?
....
Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?
I also expected a 24 cycle loop for 4 instructions. My LMM loop is 2 instructions:
notp #2
sub pc,#8
so I should get a toggle frequency at Pin 2 of 2.5MHz (60MHz / 24), with a small asymmetry, because 3 instructions need 5 cycles each and one instructon needs 9 cycles. But my Scope showed a higher frequency (more like 3MHz) and a symmetrical signal yesterday. I need to verify this again. At the moment I have running also a second task with 25% timeslot together with the LMM loop from last post. With this second task I see some jitter in the LMM generated frequency, but it still works. If I give the second task 50%, then I need to execute the jmpd on instruction later, which makes sense to me.
But after a while I also found the reason:
This 2 instructions LMM code just stays always in the quad-cache, no need to reload it ! So rdlongc always takes only 1 cycle.
This changes if I make the LMM code longer or the addresses of the two instruction loop goes over a qaud boundery.
Here's a little program that kicks off four tasks running the same code, but with different variable sets.
Register remapping is set up to remap 4 sets of 4 registers, according to the task executing. For tasks 0..3, hard addresses 0..3 remap to 0..3, 4..7, 8..11, or 12..15.
dat
org 'longs are like nop's, get skipped
pin long 0 'task 0 data
count long 1
delay long 0
extra long 0
long 1 'task 1 data
long 5
long 0
long 0
long 2 'task 2 data
long 13
long 0
long 0
long 3 'task 3 data
long 29
long 0
long 0
setmap #%1_010_010 'remap registers by task, 4 sets, 4 registers each
settask #%%3210 'enable all tasks
jmptask #loop,#%1111 'before any newly-started tasks get to execute stage, jump all tasks to loop
loop notp pin 'toggle task x pin
mov delay,count 'get task x delay
djnz delay,#$ 'count down delay
jmp #loop 'loop (count + 3 clocks)
Task 0 toggles pin 0 every 16 clocks.
Task 1 toggles pin 1 every 32 clocks.
Task 2 toggles pin 2 every 64 clocks.
Task 3 toggles pin 3 every 128 clocks.
Comments
The P2 is certainly going to keep Koroneko busy
V,G,0-7,5V
And Yes it is for support all standard 10 + Bill's 11 pins modules
Chip,
It made sense to me, which may or may not cause concern!
I like how you can play with the task register to control the amount of execution each task can get.
Now we'll see if I understand.
At least two ways to start multi-tasking:
1) Your code after COGINIT would start with instructions for task0 starting at $000, it would run through some code doing initialization/housekeeping until it was ready to start task1. When ready to start task1, it would execute a JMPTASK with the mask set for task1 and D set for the first instruction of task1's code and then use SETTASK to give some portion of the execution slots to task1. Any combination of this up to 3 additional tasks.
2) As in your 4 task example so task 0 ends up just doing the JMP then the SETTASK and then going about its business.
Wicked cool!!
I also like that the stack area is non-voaltile across COGINTs. That seems ripe for adventure and exploitation (in a good sense)!
Yep! Looks clear enough despite the caveats. I am thinking of re-writing my SVGA generator in task form, perhaps a VGA version (dot clock 25MHz), just as an exercise - ie
task00 - supervisor, vsync & idle thread
task01 - output active VGA
task02 - fetch / modulate VGA contents
task03 - porch/hsync/porch
The advantage I see is you don't have to cycle count to get the timing spot on (can base it all of system CNT for instance).
As I said just an exercise, we will of course be using the video generator down the track.
Thanks for the feedback, Guys.
Sounds like you've got an accurate handle on the multitasking.
When a cog loads, it is actually executing RDLONGC's. So, you could do the same in software and load at the same rate, without having to commit the whole cog memory and having the I/O's cancelled each time.
Link to the editable version of the document
Link to the webpage version of the document which is automatically updated when the master has been changed.
*sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:
Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?
You're right. I need to do some pipeline explanation.
When I say an instruction executes, I mean it is in the last stage of the pipeline, where the action occurs. Prior stages, going backwards, read the operands, handle indirection and other things, and read the instruction:
0: read the instruction
1: handle indirection
2: read the operands
3: execute the instruction (compute result, affect Z, C, write result)
So, when SETTASK issues a new time slot pattern, there are already three instructions in the pipeline, so the 4th instruction after SETTASK will be from the task specified in the two LSB's of the SETTASK operand.
Anyone want to verify this? You could have the SETTASK's 2 LSB's give a time slot to a task which just sets pin 1 using 'SETP #1', then the instruction after the SETTASK could do a 'SETP #0'. See how many clock periods are between the two.
I want to continue with the COSMACog progect using the P1, but also want to have a P2 version when the P2 hits the streets.
Depending on how things look around Jan/Feb I may spring for a DE2-115.
C.W.
I guess you own this topic so I'd like to request that you add a link to the p2load thread so people can find the loader when they need it. The thread itself is not active enough to stay close to the top of the posts.
Here is a link to it: http://forums.parallax.com/showthread.php?144384-p2load-A-Loader-for-the-Propeller-II
Thanks!
David
It might make more sense to implement the COSMAC Elf in an FPGA:
http://whats.all.this.brouhaha.com/category/computing/hardware/fpga/
Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?
Also, it occurs to me that it would be possible to generate instructions like:
MOV ++INDA INDA++
MOV INDA++, INDA--
etc.
Are these allowed or undefined?
Those crazy examples are all allowed. Just OR the 2-bit fields together to get the 2-bit post-effect.
At cog startup, INDA and INDB are configured as if these instructions had been execute:
FIXINDA $1F6,$1F6
FIXINDB $1F7,$1F7
So, reading or writing $1F6 or $1F7 has the intended effect. You just won't be able to have any conditional execution.
I may do that as well someday for giggles, but for now the prop1 and then prop2 are fine.
The goal is to have a very low cost emulator with a low barrier to entry.
C.W.
So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.
No. It loads $1F8 instructions. If you ever actually execute $1F6 or $1F7, it will get the data from those absolute registers. Only D and S have indirection. The instruction doesn't.
Now provided we dont actually use any I/O, we can also put instructions into PINA, PINB, PINC & PIND (by doing mov instructions of course since they are all 0's at launch) and get an extra 6 instructions in total
That is enough for my zero footprint debugger.
I experiment a bit with LMM on Prop2, and have a few questions:
1) does the single cog DE0-Nano version simulate the Hub timing for 8 cogs, or can the single cog access the hub on every cycle in this version?
2) what is the minimal number of cycles between modifying an instruction and execute it? My observation so far is that 2 instructions in between are enough.
Here is my first attempt to LMM, it executes with 1/5 the clock rate (12MHz for a 60MHz clock): Theoretically every fourth rdlongc a new quad is read and the timing then must be on a multiple of 8 clocks, but
I don't see this behavior.
Andy
1) Regardless of how few cogs an FPGA implementation has, it always cycles the hub as if there were eight cogs. So, the DE0-Nano board gives its single cog every 8th hub cycle, just as a single cog would get in a complete chip.
2) You are right about two instructions needing to be between an instruction modifier and the instruction getting modified. I was just writing some pipeline explanation about this:
Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?
Good to know that my findings on the FPGA version are also applicable on the real Prop2.
I also expected a 24 cycle loop for 4 instructions. My LMM loop is 2 instructions: so I should get a toggle frequency at Pin 2 of 2.5MHz (60MHz / 24), with a small asymmetry, because 3 instructions need 5 cycles each and one instructon needs 9 cycles. But my Scope showed a higher frequency (more like 3MHz) and a symmetrical signal yesterday. I need to verify this again. At the moment I have running also a second task with 25% timeslot together with the LMM loop from last post. With this second task I see some jitter in the LMM generated frequency, but it still works. If I give the second task 50%, then I need to execute the jmpd on instruction later, which makes sense to me.
Andy
But after a while I also found the reason:
This 2 instructions LMM code just stays always in the quad-cache, no need to reload it ! So rdlongc always takes only 1 cycle.
This changes if I make the LMM code longer or the addresses of the two instruction loop goes over a qaud boundery.
Andy
Register remapping is set up to remap 4 sets of 4 registers, according to the task executing. For tasks 0..3, hard addresses 0..3 remap to 0..3, 4..7, 8..11, or 12..15.
Task 0 toggles pin 0 every 16 clocks.
Task 1 toggles pin 1 every 32 clocks.
Task 2 toggles pin 2 every 64 clocks.
Task 3 toggles pin 3 every 128 clocks.
I'm going to document register remapping next.