It's my lack of experience in actually programming DSP that caused me to come up with this overly-complex solution to what was a simple problem. If we revise the die, we'll improve this mechanism.
I suspect your experience is still better than most. Nevertheless, for anyone interested in DSP, I strongly recommend reading The Scientist and Engineer's Guide to Digital Signal Processing (http://www.dspguide.com, PDF version found at http://www.dspguide.com/pdfbook.htm). it's very accessible. It won't give code, but it should give enough understanding to write the code.
This is very cool stuff. So of the CORDIC unit, big multiplier, big divider, and big square-rooter, do these live in independent hardware, or do they share resources? Could an ambitious coder run all 4 simultaneously? The fast MUL/SCL/MAC instructions I'm assuming are independent of these?
Wow. We have very luxurious math in PASM now. So far, I've used them a few times. Haven't interleaved ops yet, but obviously it's an option. I find myself working shifts and adds, only to remember that we've got fast math now. Fun!
I suspect your experience is still better than most. Nevertheless, for anyone interested in DSP, I strongly recommend reading The Scientist and Engineer's Guide to Digital Signal Processing (http://www.dspguide.com, PDF version found at http://www.dspguide.com/pdfbook.htm). it's very accessible. It won't give code, but it should give enough understanding to write the code.
I didn't find it in one file - which would be around 15MB
That's true. Each chapter is a separate PDF. I used an online tool to merge them into one file. I'd put that up here, but I don't think it falls under "permissible use".
Seriously. I've never seen such luxury in all my years of embedded design.
At one point I wondered why only one cog would fit in a Cyclone IV 4C22. Now I know why. There's really an extraordinary amount of logic packed in the P2.
Seriously. I've never seen such luxury in all my years of embedded design.
At one point I wondered why only one cog would fit in a Cyclone IV 4C22. Now I know why. There's really an extraordinary amount of logic packed in the P2.
Did you think Chip has really be foolin' around for the last 5 years?
REGISTER REMAPPING
------------------
The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDA or the current task.
When using remapping, you must locate your program code above the last used block of
registers which the bottom-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.
Here is the SETMAP instruction:
SETMAP D/#n - Configure register remapping to %M_BBB_RRR
%M = mode
%0 = INDA selects the block
%1 = task number selects the block
%BBB = block count
%000 = 1 block remapping disabled for %000
%001 = 2 blocks remapping enabled for %001..%111
%010 = 4 blocks
%011 = 8 blocks
%100 = 16 blocks
%101 = 32 blocks
%110 = 64 blocks
%111 = 128 blocks
%RRR = register count
%000 = 1 register remap $000
%001 = 2 registers remap $000..$001
%010 = 4 registers remap $000..$003
%011 = 8 registers remap $000..$007
%100 = 16 registers remap $000..$00F
%101 = 32 registers remap $000..$01F
%110 = 64 registers remap $000..$03F
%111 = 128 registers remap $000..$07F
The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDA or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses, not via INDA and INDB
accesses.
Below is an elaboration of all uniquely-useful remapping schemes:
S/D addresses
%M_BBB_RRR blocks regs initial -> remapped block selector
-----------------------------------------------------------------------------
%x_000_xxx 1 x <same>
%0_001_000 2 1 %000000000 -> %00000000P P = INDA[0]
%0_001_001 2 2 %00000000X -> %0000000PX
%0_001_010 2 4 %0000000XX -> %000000PXX (2 threads)
%0_001_011 2 8 %000000XXX -> %00000PXXX
%0_001_100 2 16 %00000XXXX -> %0000PXXXX
%0_001_101 2 32 %0000XXXXX -> %000PXXXXX
%0_001_110 2 64 %000XXXXXX -> %00PXXXXXX
%0_001_111 2 128 %00XXXXXXX -> %0PXXXXXXX
%0_010_000 4 1 %000000000 -> %0000000PP PP = INDA[1..0]
%0_010_001 4 2 %00000000X -> %000000PPX
%0_010_010 4 4 %0000000XX -> %00000PPXX (4 threads)
%0_010_011 4 8 %000000XXX -> %0000PPXXX
%0_010_100 4 16 %00000XXXX -> %000PPXXXX
%0_010_101 4 32 %0000XXXXX -> %00PPXXXXX
%0_010_110 4 64 %000XXXXXX -> %0PPXXXXXX
%0_010_111 4 128 %00XXXXXXX -> %PPXXXXXXX
%0_011_000 8 1 %000000000 -> %000000PPP PPP = INDA[2..0]
%0_011_001 8 2 %00000000X -> %00000PPPX
%0_011_010 8 4 %0000000XX -> %0000PPPXX (8 threads)
%0_011_011 8 8 %000000XXX -> %000PPPXXX
%0_011_100 8 16 %00000XXXX -> %00PPPXXXX
%0_011_101 8 32 %0000XXXXX -> %0PPPXXXXX
%0_011_110 8 64 %000XXXXXX -> %PPPXXXXXX
%0_100_000 16 1 %000000000 -> %00000PPPP PPPP = INDA[3..0]
%0_100_001 16 2 %00000000X -> %0000PPPPX
%0_100_010 16 4 %0000000XX -> %000PPPPXX (16 threads)
%0_100_011 16 8 %000000XXX -> %00PPPPXXX
%0_100_100 16 16 %00000XXXX -> %0PPPPXXXX
%0_100_101 16 32 %0000XXXXX -> %PPPPXXXXX
%0_101_000 32 1 %000000000 -> %0000PPPPP PPPPP = INDA[4..0]
%0_101_001 32 2 %00000000X -> %000PPPPPX
%0_101_010 32 4 %0000000XX -> %00PPPPPXX (32 threads)
%0_101_011 32 8 %000000XXX -> %0PPPPPXXX
%0_101_100 32 16 %00000XXXX -> %PPPPPXXXX
%0_110_000 64 1 %000000000 -> %000PPPPPP PPPPPP = INDA[5..0]
%0_110_001 64 2 %00000000X -> %00PPPPPPX
%0_110_010 64 4 %0000000XX -> %0PPPPPPXX (64 threads)
%0_110_011 64 8 %000000XXX -> %PPPPPPXXX
%0_111_000 128 1 %000000000 -> %00PPPPPPP PPPPPPP = INDA[6..0]
%0_111_001 128 2 %00000000X -> %0PPPPPPPX
%0_111_010 128 4 %0000000XX -> %PPPPPPPXX (128 threads)
%1_001_000 2 1 %000000000 -> %00000000T T = bit 0 of the task number
%1_001_001 2 2 %00000000X -> %0000000TX
%1_001_010 2 4 %0000000XX -> %000000TXX (2 tasks)
%1_001_011 2 8 %000000XXX -> %00000TXXX
%1_001_100 2 16 %00000XXXX -> %0000TXXXX
%1_001_101 2 32 %0000XXXXX -> %000TXXXXX
%1_001_110 2 64 %000XXXXXX -> %00TXXXXXX
%1_001_111 2 128 %00XXXXXXX -> %0TXXXXXXX
%1_010_000 4 1 %000000000 -> %0000000TT TT = task number
%1_010_001 4 2 %00000000X -> %000000TTX
%1_010_010 4 4 %0000000XX -> %00000TTXX (4 tasks)
%1_010_011 4 8 %000000XXX -> %0000TTXXX
%1_010_100 4 16 %00000XXXX -> %000TTXXXX
%1_010_101 4 32 %0000XXXXX -> %00TTXXXXX
%1_010_110 4 64 %000XXXXXX -> %0TTXXXXXX
%1_010_111 4 128 %00XXXXXXX -> %TTXXXXXXX
Here is an example program which uses remapping with multi-threading:
DAT org
period long 2-1 '$000, thread 0 (20 longs initally execute as NOPs)
time long 0 '$001, thread 0
pin_x long 0 '$002, thread 0
pin_y long 1 '$003, thread 0
long 4-1 '$000, thread 1
long 0 '$001, thread 1
long 2 '$002, thread 1
long 3 '$003, thread 1
long 8-1 '$000, thread 2
long 0 '$001, thread 2
long 4 '$002, thread 2
long 5 '$003, thread 2
long 16-1 '$000, thread 3
long 0 '$001, thread 3
long 6 '$002, thread 3
long 7 '$003, thread 3
pc long loop[4] '$010..$013, all threads start at loop
setmap #%0_010_010 'remap 4 blocks of 4 regs by INDA[1..0]
fixinda #pc+3,#pc 'set INDA to cycle through blocks and threads
nop 'allow SETMAP 3 clocks to take effect
loop tasksw 'switch to next thread
incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 threads executing same code with unique variables)
Here is an example program which uses remapping with multi-tasking:
DAT org
period long 2-1 '$000, task 0 (16 longs initally execute as NOPs)
time long 0 '$001, task 0
pin_x long 0 '$002, task 0
pin_y long 1 '$003, task 0
long 4-1 '$000, task 1
long 0 '$001, task 1
long 2 '$002, task 1
long 3 '$003, task 1
long 8-1 '$000, task 2
long 0 '$001, task 2
long 4 '$002, task 2
long 5 '$003, task 2
long 16-1 '$000, task 3
long 0 '$001, task 3
long 6 '$002, task 3
long 7 '$003, task 3
setmap #%1_010_010 'remap 4 blocks of 4 regs by task
settask #%11_10_01_00 'set all 4 tasks in motion
jmptask #loop,#%1111 'herd tasks to loop
loop incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 tasks executing same code with unique registers)
Wow, I didn't realize the P2 had both threading and multi-tasking.
Threading appears to be cooperative multi-tasking, yielding control of the COG when the loop is finished, whereas the multi-tasking appears to be more like temporal multi-threading.
TASKSW only yields control of the main COG after a single section of code runs, executing only one PC at a time.
SETTASK allows for up to 4 PCs to be executing simultaneously, but at different pipeline stages, so each PC moves forward in lockstep with another.
TASKSW is useful for applications where you have either very time sensitive, or blocking code that you want to run, where other tasks don't have hard realtime demands.
SETTASK is useful for applications where you need hard realtime in multiple threads at once, but at the expense of only using non-blocking, non-flushing instructions.
I just added some details to the the latest doc's in post #316, in case anyone already grabbed them.
The attached document states the following for TASKSW: "Instructions trailing TASKSWD are in the next thread". However, this would seem to contradict the way that the other xxxD instructions seem to work (i.e. trailing instructions that are already in the pipeline are associated with the code that's *before* the jump, not after). If TASKSW is conceptually different this way (the documentation is correct), I suggest emphasizing that in the document.
The attached document states the following for TASKSW: "Instructions trailing TASKSWD are in the next thread". However, this would seem to contradict the way that the other xxxD instructions seem to work (i.e. trailing instructions that are already in the pipeline are associated with the code that's *before* the jump, not after). If TASKSW is conceptually different this way (the documentation is correct), I suggest emphasizing that in the document.
The reason is because TASKSWD is (I think) 'JMPRETD INDA,++INDA WZ, WC' and when INDA gets incremented, the next instruction has the remapped registers already pointing to the next thread's register block and the flags have been saved and updated, as well. So, the thread context has switched and those trailing instructions are in the next thread.
I'll make sure this is documented better. Thanks for pointing this out.
The threading example makes my brain hurt, which might explain why it looks "wrong" to me. When that code runs, do you actually end up with an initial four switches that basically do nothing but fix up the PC array? Would this also work:
DAT org
period long 2-1 '$000, thread 0 (20 longs initally execute as NOPs)
time long 0 '$001, thread 0
pin_x long 0 '$002, thread 0
pin_y long 1 '$003, thread 0
long 4-1 '$000, thread 1
long 0 '$001, thread 1
long 2 '$002, thread 1
long 3 '$003, thread 1
long 8-1 '$000, thread 2
long 0 '$001, thread 2
long 4 '$002, thread 2
long 5 '$003, thread 2
long 16-1 '$000, thread 3
long 0 '$001, thread 3
long 6 '$002, thread 3
long 7 '$003, thread 3
pc long thread[4] '$010..$013, all threads start at thread
setmap #%0_010_010 'remap 4 blocks of 4 regs by INDA[1..0]
fixinda #pc+3,#pc 'set INDA to cycle through blocks and threads
nop 'allow SETMAP 3 clocks to take effect
loop tasksw 'switch to next thread
thread incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 threads executing same code with unique variables)
My reasoning here is that the pc array will contain the addresses for the thread label (not loop), and TASKSW (rather, JMPRET) is going to load that address from the next array element while storing PC+1 (with PC value being the address of the TASKSW instruction) in the current array element (which is always the same address as the thread label)..
The threading example makes my brain hurt, which might explain why it looks "wrong" to me. When that code runs, do you actually end up with an initial four switches that basically do nothing but fix up the PC array? Would this also work:
DAT org
period long 2-1 '$000, thread 0 (20 longs initally execute as NOPs)
time long 0 '$001, thread 0
pin_x long 0 '$002, thread 0
pin_y long 1 '$003, thread 0
long 4-1 '$000, thread 1
long 0 '$001, thread 1
long 2 '$002, thread 1
long 3 '$003, thread 1
long 8-1 '$000, thread 2
long 0 '$001, thread 2
long 4 '$002, thread 2
long 5 '$003, thread 2
long 16-1 '$000, thread 3
long 0 '$001, thread 3
long 6 '$002, thread 3
long 7 '$003, thread 3
pc long thread[4] '$010..$013, all threads start at thread
setmap #%0_010_010 'remap 4 blocks of 4 regs by INDA[1..0]
fixinda #pc+3,#pc 'set INDA to cycle through blocks and threads
nop 'allow SETMAP 3 clocks to take effect
loop tasksw 'switch to next thread
thread incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 threads executing same code with unique variables)
My reasoning here is that the pc array will contain the addresses for the thread label (not loop), and TASKSW (rather, JMPRET) is going to load that address from the next array element while storing PC+1 (with PC value being the address of the TASKSW instruction) in the current array element (which is always the same address as the thread label)..
Or do I have this all wrong?
That code will do the same thing as my example. It will just execute the loop's body code before the first TASKSW for each thread.
My example would have been a lot richer if I showed a more complex program (instead of a loop) which made conditional branches with TASKSW's placed throughout. That way, the PC array wouldn't always contain the same values, but possibly a different return address for each thread, most of the time.
REGISTER REMAPPING
------------------
The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDA or the current task.
When using remapping, you must locate your program code above the last used block of
registers which the bottom-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.
Here is the SETMAP instruction:
SETMAP D/#n - Configure register remapping to %M_BBB_RRR
%M = mode
%0 = INDA selects the block
%1 = task number selects the block
%BBB = block count
%000 = 1 block remapping disabled for %000
%001 = 2 blocks remapping enabled for %001..%111
%010 = 4 blocks
%011 = 8 blocks
%100 = 16 blocks
%101 = 32 blocks
%110 = 64 blocks
%111 = 128 blocks
%RRR = register count
%000 = 1 register remap $000
%001 = 2 registers remap $000..$001
%010 = 4 registers remap $000..$003
%011 = 8 registers remap $000..$007
%100 = 16 registers remap $000..$00F
%101 = 32 registers remap $000..$01F
%110 = 64 registers remap $000..$03F
%111 = 128 registers remap $000..$07F
The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDA or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses, not via INDA and INDB
accesses.
Below is an elaboration of all uniquely-useful remapping schemes:
S/D addresses
%M_BBB_RRR blocks regs initial -> remapped block selector
-----------------------------------------------------------------------------
%x_000_xxx 1 x <same>
%0_001_000 2 1 %000000000 -> %00000000P P = INDA[0]
%0_001_001 2 2 %00000000X -> %0000000PX
%0_001_010 2 4 %0000000XX -> %000000PXX (2 threads)
%0_001_011 2 8 %000000XXX -> %00000PXXX
%0_001_100 2 16 %00000XXXX -> %0000PXXXX
%0_001_101 2 32 %0000XXXXX -> %000PXXXXX
%0_001_110 2 64 %000XXXXXX -> %00PXXXXXX
%0_001_111 2 128 %00XXXXXXX -> %0PXXXXXXX
%0_010_000 4 1 %000000000 -> %0000000PP PP = INDA[1..0]
%0_010_001 4 2 %00000000X -> %000000PPX
%0_010_010 4 4 %0000000XX -> %00000PPXX (4 threads)
%0_010_011 4 8 %000000XXX -> %0000PPXXX
%0_010_100 4 16 %00000XXXX -> %000PPXXXX
%0_010_101 4 32 %0000XXXXX -> %00PPXXXXX
%0_010_110 4 64 %000XXXXXX -> %0PPXXXXXX
%0_010_111 4 128 %00XXXXXXX -> %PPXXXXXXX
%0_011_000 8 1 %000000000 -> %000000PPP PPP = INDA[2..0]
%0_011_001 8 2 %00000000X -> %00000PPPX
%0_011_010 8 4 %0000000XX -> %0000PPPXX (8 threads)
%0_011_011 8 8 %000000XXX -> %000PPPXXX
%0_011_100 8 16 %00000XXXX -> %00PPPXXXX
%0_011_101 8 32 %0000XXXXX -> %0PPPXXXXX
%0_011_110 8 64 %000XXXXXX -> %PPPXXXXXX
%0_100_000 16 1 %000000000 -> %00000PPPP PPPP = INDA[3..0]
%0_100_001 16 2 %00000000X -> %0000PPPPX
%0_100_010 16 4 %0000000XX -> %000PPPPXX (16 threads)
%0_100_011 16 8 %000000XXX -> %00PPPPXXX
%0_100_100 16 16 %00000XXXX -> %0PPPPXXXX
%0_100_101 16 32 %0000XXXXX -> %PPPPXXXXX
%0_101_000 32 1 %000000000 -> %0000PPPPP PPPPP = INDA[4..0]
%0_101_001 32 2 %00000000X -> %000PPPPPX
%0_101_010 32 4 %0000000XX -> %00PPPPPXX (32 threads)
%0_101_011 32 8 %000000XXX -> %0PPPPPXXX
%0_101_100 32 16 %00000XXXX -> %PPPPPXXXX
%0_110_000 64 1 %000000000 -> %000PPPPPP PPPPPP = INDA[5..0]
%0_110_001 64 2 %00000000X -> %00PPPPPPX
%0_110_010 64 4 %0000000XX -> %0PPPPPPXX (64 threads)
%0_110_011 64 8 %000000XXX -> %PPPPPPXXX
%0_111_000 128 1 %000000000 -> %00PPPPPPP PPPPPPP = INDA[6..0]
%0_111_001 128 2 %00000000X -> %0PPPPPPPX
%0_111_010 128 4 %0000000XX -> %PPPPPPPXX (128 threads)
%1_001_000 2 1 %000000000 -> %00000000T T = bit 0 of the task number
%1_001_001 2 2 %00000000X -> %0000000TX
%1_001_010 2 4 %0000000XX -> %000000TXX (2 tasks)
%1_001_011 2 8 %000000XXX -> %00000TXXX
%1_001_100 2 16 %00000XXXX -> %0000TXXXX
%1_001_101 2 32 %0000XXXXX -> %000TXXXXX
%1_001_110 2 64 %000XXXXXX -> %00TXXXXXX
%1_001_111 2 128 %00XXXXXXX -> %0TXXXXXXX
%1_010_000 4 1 %000000000 -> %0000000TT TT = task number
%1_010_001 4 2 %00000000X -> %000000TTX
%1_010_010 4 4 %0000000XX -> %00000TTXX (4 tasks)
%1_010_011 4 8 %000000XXX -> %0000TTXXX
%1_010_100 4 16 %00000XXXX -> %000TTXXXX
%1_010_101 4 32 %0000XXXXX -> %00TTXXXXX
%1_010_110 4 64 %000XXXXXX -> %0TTXXXXXX
%1_010_111 4 128 %00XXXXXXX -> %TTXXXXXXX
Here is an example program which uses remapping with multi-threading:
DAT org
period long 2-1 '$000, thread 0 (20 longs initally execute as NOPs)
time long 0 '$001, thread 0
pin_x long 0 '$002, thread 0
pin_y long 1 '$003, thread 0
long 4-1 '$000, thread 1
long 0 '$001, thread 1
long 2 '$002, thread 1
long 3 '$003, thread 1
long 8-1 '$000, thread 2
long 0 '$001, thread 2
long 4 '$002, thread 2
long 5 '$003, thread 2
long 16-1 '$000, thread 3
long 0 '$001, thread 3
long 6 '$002, thread 3
long 7 '$003, thread 3
pc long loop[4] '$010..$013, all threads start at loop
setmap #%0_010_010 'remap 4 blocks of 4 regs by INDA[1..0]
fixinda #pc+3,#pc 'set INDA to cycle through blocks and threads
nop 'allow SETMAP 3 clocks to take effect
loop tasksw 'switch to next thread
incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 threads executing same code with unique variables)
Here is an example program which uses remapping with multi-tasking:
DAT org
period long 2-1 '$000, task 0 (16 longs initally execute as NOPs)
time long 0 '$001, task 0
pin_x long 0 '$002, task 0
pin_y long 1 '$003, task 0
long 4-1 '$000, task 1
long 0 '$001, task 1
long 2 '$002, task 1
long 3 '$003, task 1
long 8-1 '$000, task 2
long 0 '$001, task 2
long 4 '$002, task 2
long 5 '$003, task 2
long 16-1 '$000, task 3
long 0 '$001, task 3
long 6 '$002, task 3
long 7 '$003, task 3
setmap #%1_010_010 'remap 4 blocks of 4 regs by task
settask #%11_10_01_00 'set all 4 tasks in motion
jmptask #loop,#%1111 'herd tasks to loop
loop incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 tasks executing same code with unique registers)
INTER-COG EXCHANGE
------------------
The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.
Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.
The only control over the inter-cog exchange is each cog's PIND input filter.
The SETXCH instruction is used to set the PIND input filter:
SETXCH D/#n - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA
%DDDDDDDD = filter for PIND input bits 31..24
%xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
%xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
%x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
%1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input
%CCCCCCCC = filter for PIND input bits 23..16
%xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
%xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
%x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
%1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input
%BBBBBBBB = filter for PIND input bits 15..8
%xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
%xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
%x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
%1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input
%AAAAAAAA = filter for PIND input bits 7..0
%xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
%xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
%x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
%1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input
To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.
After SETXCH, PIND can be read for newly-filtered data on the third clock:
SETXCH #$00000001 'change filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from new filter
Writes to a PIND are readable from a PIND on the third clock, as well.
The PIND port does not connect to CTRA, CTRB, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:
GETP/GETNP - pin reads
OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ - pin writes
JP/JPD/JNP/JNPD - pin branches
INTER-COG EXCHANGE
------------------
The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.
Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.
The only control over the inter-cog exchange is each cog's PIND input filter.
The SETXCH instruction is used to set the PIND input filter:
SETXCH D/#n - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA
%DDDDDDDD = filter for PIND input bits 31..24
%xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
%xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
%x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
%1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input
%CCCCCCCC = filter for PIND input bits 23..16
%xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
%xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
%x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
%1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input
%BBBBBBBB = filter for PIND input bits 15..8
%xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
%xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
%x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
%1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input
%AAAAAAAA = filter for PIND input bits 7..0
%xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
%xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
%x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
%1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input
To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.
After SETXCH, PIND can be read for newly-filtered data on the third clock:
SETXCH #$00000001 'change filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from new filter
Writes to a PIND are readable from a PIND on the third clock, as well.
The PIND/DIRD port does not connect to the CTR's, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:
GETP/GETNP - pin reads
OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ - pin writes
JP/JPD/JNP/JNPD - pin branches
INTER-COG EXCHANGE
------------------
The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.
Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.
The only control over the inter-cog exchange is each cog's PIND input filter.
The SETXCH instruction is used to set the PIND input filter:
SETXCH D/#n - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA
%DDDDDDDD = filter for PIND input bits 31..24
%xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
%xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
%xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
%x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
%1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input
%CCCCCCCC = filter for PIND input bits 23..16
%xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
%xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
%xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
%x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
%1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input
%BBBBBBBB = filter for PIND input bits 15..8
%xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
%xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
%xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
%x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
%1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input
%AAAAAAAA = filter for PIND input bits 7..0
%xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
%xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
%xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
%x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
%1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input
To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.
After SETXCH, PIND can be read for newly-filtered data on the third clock:
SETXCH #$00000001 'change filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from old filter
MOV X,PIND 'data from new filter
Writes to a PIND are readable from a PIND on the third clock, as well.
The PIND port does not connect to CTRA, CTRB, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:
GETP/GETNP - pin reads
OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ - pin writes
JP/JPD/JNP/JNPD - pin branches
SNDSER & RCVSER are Chip's new inter-prop communications instructions. Last I heard was:
- 32 bit buffered input and output (1 long buffer I think)
- three lines needed: TXD, RXD, CLK
- can block on read/write, or poll for completion
- may need shared crystal between the props
- 1 bit per clock cycle (original plan was PLL'd higher)
Comments
I suspect your experience is still better than most. Nevertheless, for anyone interested in DSP, I strongly recommend reading The Scientist and Engineer's Guide to Digital Signal Processing (http://www.dspguide.com, PDF version found at http://www.dspguide.com/pdfbook.htm). it's very accessible. It won't give code, but it should give enough understanding to write the code.
That's right. They can all run at the same time.
Wow. We have very luxurious math in PASM now. So far, I've used them a few times. Haven't interleaved ops yet, but obviously it's an option. I find myself working shifts and adds, only to remember that we've got fast math now. Fun!
The PDFs are a little tedious to find, but if you start here:
http://www.dspguide.com/CH1.PDF
then it is CH1 .. CH34.PDF
I didn't find it in one file - which would be around 15MB
That's true. Each chapter is a separate PDF. I used an online tool to merge them into one file. I'd put that up here, but I don't think it falls under "permissible use".
I'll work on that today. I should have it done by this evening.
Seriously. I've never seen such luxury in all my years of embedded design.
At one point I wondered why only one cog would fit in a Cyclone IV 4C22. Now I know why. There's really an extraordinary amount of logic packed in the P2.
Did you think Chip has really be foolin' around for the last 5 years?
You said You will post COG to COG communication by Internal port info and Remapping of COG registers .
How it is are going ?
I'm still working on the register remapping. After that I'll cover the Port D cog-to-cog communication.
Thanks
Prop2_Docs.txt
Here's the new section:
Threading appears to be cooperative multi-tasking, yielding control of the COG when the loop is finished, whereas the multi-tasking appears to be more like temporal multi-threading.
TASKSW only yields control of the main COG after a single section of code runs, executing only one PC at a time.
SETTASK allows for up to 4 PCs to be executing simultaneously, but at different pipeline stages, so each PC moves forward in lockstep with another.
TASKSW is useful for applications where you have either very time sensitive, or blocking code that you want to run, where other tasks don't have hard realtime demands.
SETTASK is useful for applications where you need hard realtime in multiple threads at once, but at the expense of only using non-blocking, non-flushing instructions.
The attached document states the following for TASKSW: "Instructions trailing TASKSWD are in the next thread". However, this would seem to contradict the way that the other xxxD instructions seem to work (i.e. trailing instructions that are already in the pipeline are associated with the code that's *before* the jump, not after). If TASKSW is conceptually different this way (the documentation is correct), I suggest emphasizing that in the document.
The reason is because TASKSWD is (I think) 'JMPRETD INDA,++INDA WZ, WC' and when INDA gets incremented, the next instruction has the remapped registers already pointing to the next thread's register block and the flags have been saved and updated, as well. So, the thread context has switched and those trailing instructions are in the next thread.
I'll make sure this is documented better. Thanks for pointing this out.
My reasoning here is that the pc array will contain the addresses for the thread label (not loop), and TASKSW (rather, JMPRET) is going to load that address from the next array element while storing PC+1 (with PC value being the address of the TASKSW instruction) in the current array element (which is always the same address as the thread label)..
Or do I have this all wrong?
That code will do the same thing as my example. It will just execute the loop's body code before the first TASKSW for each thread.
My example would have been a lot richer if I showed a more complex program (instead of a loop) which made conditional branches with TASKSW's placed throughout. That way, the PC array wouldn't always contain the same values, but possibly a different return address for each thread, most of the time.
Nice - Thanks
Edit:
File deleted.
Prop2_Docs.txt
Here's the new part:
BIG Thanks.
Sapieha just messaged me that you added the PIND docs... I am now digesting it.
SNDSER & RCVSER are Chip's new inter-prop communications instructions. Last I heard was:
- 32 bit buffered input and output (1 long buffer I think)
- three lines needed: TXD, RXD, CLK
- can block on read/write, or poll for completion
- may need shared crystal between the props
- 1 bit per clock cycle (original plan was PLL'd higher)
Many of us are waiting for more info :-)
The synthesis was too muddy to make an independent clock work.