Yes Chip direct to hub would be nice, but that is impossible as is clear with the round-robin nature of COGS without strict priority servicing.·The main reason for being interested in hub access in the first place is that software bit-fetch/bit-bang is too slow and one needs to do segmentation and reassembly somewhere when using multiple COGs. Adding a COG DMA engine surely would speed things up ... no?
COG DMA should have no such hub constraints though and should be able to sample pretty darn fast right? Looking at the counter modules, they can be incremented at clock speed so I assume with a verilog/vhdl hardware state-machine assist and some arbitration if necessary, words can be DMA read or written in·less than·4·clock cycles.
Of course having an NRZ/NRZI SERDES solves many serial communications issues that even DMA can't solve [noparse]:)[/noparse]
Does that look right? What's missing, of course, is context-switching. As a concession to software multi-threaders, would it be possible to include two sets of condition bits that could be swapped in one instruction? I think most of us would be satisfied with two. More than two threads, emulated in this fashion, would bog down. This kind of context-swapping could come in handy elsewhere, too, when you want to call a subroutine, say, that sets condition bits, while preserving those bits in the calling program.
Yes Chip direct to hub would be nice, but that is impossible as is clear with the round-robin nature of COGS without strict priority servicing.·The main reason for being interested in hub access in the first place is that software bit-fetch/bit-bang is too slow and one needs to do segmentation and reassembly somewhere when using multiple COGs. Adding a COG DMA engine surely would speed things up ... no? It would, but it would be really useful if it performed modulation/demodulation after/before reading/writing cog RAM, as you state at the end here.
COG DMA should have no such hub constraints though and should be able to sample pretty darn fast right? Looking at the counter modules, they can be incremented at clock speed so I assume with a verilog/vhdl hardware state-machine assist and some arbitration if necessary, words can be DMA read or written in·less than·4·clock cycles. It could actually hit every clock, so 32 bits could be moved at a 160MHz rate!
Of course having an NRZ/NRZI SERDES solves many serial communications issues that even DMA can't solve [noparse]:)[/noparse]
I've noticed the jitter to be within a nanosecond, which is 1/8 of·a pixel at a 125MHz pixel rate - not noticeable on a CRT, but some LCDs (which·all use internal PLLs to sync) exhibit occasional pixel-boundary jitter. A monitor with a good PLL system always looks rock-solid, though. Some monitors are better than others in this regard.
I don't mean jitter, I'm referring to a fixed timing offset between plls...
But, maybe I did something wrong...· Now that I think about again, the cursor wouldn't looks so good if the pll's weren't pretty well sync'd.
That four-port cell represents quite a commitment, silicon-wise! How does it compare, in physical area, to the Prop I cell that has a larger feature size?
Does that look right? What's missing, of course, is context-switching. As a concession to software multi-threaders, would it be possible to include two sets of condition bits that could be swapped in one instruction? I think most of us would be satisfied with two. More than two threads, emulated in this fashion, would bog down. This kind of context-swapping could come in handy elsewhere, too, when you want to call a subroutine, say, that sets condition bits, while preserving those bits in the calling program.
Thanks,
Phil
Wow! I never though of doing it that way. It's quite a brain-bender. I think it would work. It needs to be diagrammed. Branches within the single-instruction-at-a-time threads wouldn't work, though, right? About the condition bits, I'm having a mental block on how to save/restore these.
I'm getting the feeling that very fine granualarity is paramount to your multi-threading. There·is always going to be the possibility that if each thread had to do a RDxxxx/WRxxxx at the 'same' time, and let's say there were four threads, the last guy would get delayed by 24+ clocks (3*8). That's no worse than what JMPRETD could provide in the same circumstance. Plus, with JMPRETD, you can get bursts of determinant timing in short runs of code before you must JMPRETD to the next guy.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/29/2008 7:10:28 PM GMT
I don't mean jitter, I'm referring to a fixed timing offset between plls...
I was thinking about multiple cogs starting up their internal CTRs and PLLs at the same time.
For syncing CTRA and CTRB PLLs within a single cog, you'd have to preload the second-to-be-configured CTR's PHS register with a value that jump-starts it to match the first CTR at the time it is to be configured.
I take back what I said... Upon further reflection, the cursor in XGA, SXGA mode wouldn't look as good as it does if the plls weren't sync'd very well. I must have done something wrong when I tried it for my 6-bit vga driver...
A branch in the emulated code would just be something like MOV PC0,#target_addr, similar to the LMM. This works, since the increment has already taken place by then and won't muck it up.
Wow! I never though of doing it that way. It's quite a brain-bender. I think it would work. It needs to be diagrammed. Branches within the single-instruction-at-a-time threads wouldn't work, though, right? About the condition bits, I'm having a mental block on how to save/restore these.
What about making the condition bits accessable as a register then software can save/restore if needed
Phil, · "That four-port cell represents quite a commitment, silicon-wise! How does it compare, in physical area, to the Prop I cell that has a larger feature size?" ·
I looked at the database for the Prop I, and there isn't a four-port cell... at least not that I saw.· The 6T memory is used for both the MEM_RAM and MEM_COG. ·
The dimensions of the 6T for the Prop I, are 6.85um X 4.8um requiring a total area of 32.88^2um ·
The dimensions of the 6T for the Prop II, are 2.73um X 1.95um requiring a total area of 5.3235^2um· ... 1/6th of the silicon · ·
The dimensions of the four-port cell in the Prop II is 2.885um X 7.770um requiring a total area of 22.4165^2um· ... Still less silicon than the 6T in Prop I. · ·
Attached is a side by side comparison in the same scale view of the 3R1W and 6T memory cells.
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Hey, why is there so much focus on executing LMM code at a certain address, when you could just copy the instruction into a register that you're about to execute? You've got six instructions available between RDLONGs, so could you not perform the housekeeping within that period?
········· '(you've got three more instructions before another RDLONG)
You could use indirection, too, to make this thing loop.
Maybe the indirect addressing should be augmented so that addresses within some range get converted to another range through bit substitution. This way, each LMM thread could address the same locations, but be accessing separate windows. For example, you could set it up so that if either S or D·registers were ranging from $000..$00F, you would mux into S/D bits 7..4 a variable value. That way, a thread0 could address $000.$00F, a thread1 could address $010..$01F, a thread2 could address $020.$02F, etc. - all while being coded to $000.$00F. This way, you could run multiple instances of the same LMM code. This would be something VERY easy and small·to add to the cog.
Phil, not sure you want to combine the save/restore into a switch. If you are writing a small multi-threading kernel, you need to save the old thread context and restore the next thread, making the 2 separate allows any number of threads and allows easier management of thread contexts. If you make it a switch then the thread context will keep moving, it keeps moving to the next running threads old context, and its context will move to the thread context after that etc. This works if all you want is a pure round-robin switcher but as soon as you add some decision logic the finding the thread will mean another level of indirection which will lose the switch efficency.
BTW, given that the non-hub and non-0001xx opcodes are used up in the Prop I, where did you manage to squeeze in the JMPRETD family?
-Phil
It went where ONES was slated to go: %000111. We are using %000110 for everything else!
Sorry I missed your point about the swap-flags instruction. That's a much better concept than read- and write-flags instructions:
SWAPZC·· D· 'swap Z and C with bit 1 and 0 in D
I am going to add this, for sure. It's very elegant. How about augmenting its functionality so that it rotates the bits down by two, acting as a FIFO? That way, N threads can be chained with one instruction. Otherwise, SWAPZC is just a ping-pong mechanism.
Ah ha! It can be coded·like this:
SWAPZC·· D,[noparse][[/noparse]n]· 'swap Z and C with FIFO bits·in D, n specifies number of bit-pairs.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/29/2008 7:53:04 PM GMT
Chip Gracey said...
Hey, why is there so much focus on executing LMM code at a certain address, when you could just copy the instruction into a register that you're about to execute?
I was thinking more about emulating multi-threaded code that's already resident in the cog. But, yes, your technique would work great for LMM code!
I'm no LMM programmer, but Chip's 'find' seems to be a big deal for LMM, and no-one appears to have mentioned it! (Just wanted to make sure it didn't get missed )
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again BTW: I type as I'm thinking, so please don't take any offence at my writing style
Can [noparse][[/noparse]n] be a register or just a constant? I want a thread id (assume ids are even no) in a register. Then for up to 16 threads I can save/restore context based on the thread id register, and double the thread id for the jmpret offset. ad I will be able to switch between arbitary threads just by setting the thread id and then restoring the context.
Timmoore said...
Can [noparse][[/noparse]n] be a register or just a constant? I want a thread id (assume ids are even no) in a register. Then for up to 16 threads I can save/restore context based on the thread id register, and double the thread id for the jmpret offset. ad I will be able to switch between arbitary threads just by setting the thread id and then restoring the context.
It could use either a constant (0..15) or a register (4 lsb's). This takes some thinking about...
get·A flags
exe·A inst
save·A flags
(save A, get B, rotate) %ddccbbaa -> %AAddccbb
get·B flags
exe·B inst
save·B flags
(save B, get C, rotate) %AAddccbb -> %BBAAddcc
get·C flags
exe·C inst
save·C flags
(save C, get D, rotate) %BBAAddcc -> %CCBBAAdd
get D flags
exe D inst
save D flags
(save D, get A) %CCBBAAdd -> %CCBBAAdd -> %DDCCBBAA
So,·the SWAPZC instruction must·read flags from bits[noparse][[/noparse]1..0], shift right by 2 bits, and save·flags to bits[noparse][[/noparse]n*2+1..n*2].
When a thread starts or dies, you change n at that time to open a new bit set, or clip off an old one. This is almost nothing to implement.
@Chip, if it in fact doesn't take too much real estate to do and doesn't take things away from other stuff, I'd say go for it. The TackXXX should be sufficient for some people's needs. Why Tack though? Why not Thread?
ImageCraft said...
@Chip, if it in fact doesn't take too much real estate to do and doesn't take things away from other stuff, I'd say go for it. The TackXXX should be sufficient for some people's needs. Why Tack though? Why not Thread?
Tack spells better with +new and +end, plus TACKNEW and TACKEND are under 8 characters (a tab stop).
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again BTW: I type as I'm thinking, so please don't take any offence at my writing style
Maybe I'm missing something, but what exactly does [noparse][[/noparse]<x>] mean?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I am 1011, so be surprised!
Advertisement sponsored by dfletch:
Come and join us on the Propeller IRC channel for fast and easy help!
Channel: #propeller
Server: irc.freenode.net or freenode.net
If you don't want to bother installing an IRC client, use Mibbit. www.mibbit.com
Comments
COG DMA should have no such hub constraints though and should be able to sample pretty darn fast right? Looking at the counter modules, they can be incremented at clock speed so I assume with a verilog/vhdl hardware state-machine assist and some arbitration if necessary, words can be DMA read or written in·less than·4·clock cycles.
Of course having an NRZ/NRZI SERDES solves many serial communications issues that even DMA can't solve [noparse]:)[/noparse]
Good luck with your project.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
If I understand the JMPD correctly, the following could be used to emulate two threads, assuming they don't include JMP or WAIT instructions:
Does that look right? What's missing, of course, is context-switching. As a concession to software multi-threaders, would it be possible to include two sets of condition bits that could be swapped in one instruction? I think most of us would be satisfied with two. More than two threads, emulated in this fashion, would bog down. This kind of context-swapping could come in handy elsewhere, too, when you want to call a subroutine, say, that sets condition bits, while preserving those bits in the calling program.
Thanks,
Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Chip Gracey
Parallax, Inc.
Phil,
The '3R1W_BIT_CELL.png' functions as a "...four-port memory..." that performs "...·three reads and one write".
Compared to a standard 6T memory cell which we also use '6T_BIT_CELL.png'.
Our 6T memory cell·is comparable to TSMC's 130nm process if it were scaled up to 180nm.
·
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Post Edited (Beau Schwabe (Parallax)) : 8/29/2008 7:38:16 PM GMT
But, maybe I did something wrong...· Now that I think about again, the cursor wouldn't looks so good if the pll's weren't pretty well sync'd.
Post Edited (Rayman) : 8/29/2008 7:07:43 PM GMT
That four-port cell represents quite a commitment, silicon-wise! How does it compare, in physical area, to the Prop I cell that has a larger feature size?
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Wow! I never though of doing it that way. It's quite a brain-bender. I think it would work. It needs to be diagrammed. Branches within the single-instruction-at-a-time threads wouldn't work, though, right? About the condition bits, I'm having a mental block on how to save/restore these.
I'm getting the feeling that very fine granualarity is paramount to your multi-threading. There·is always going to be the possibility that if each thread had to do a RDxxxx/WRxxxx at the 'same' time, and let's say there were four threads, the last guy would get delayed by 24+ clocks (3*8). That's no worse than what JMPRETD could provide in the same circumstance. Plus, with JMPRETD, you can get bursts of determinant timing in short runs of code before you must JMPRETD to the next guy.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/29/2008 7:10:28 PM GMT
For syncing CTRA and CTRB PLLs within a single cog, you'd have to preload the second-to-be-configured CTR's PHS register with a value that jump-starts it to match the first CTR at the time it is to be configured.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
I take back what I said... Upon further reflection, the cursor in XGA, SXGA mode wouldn't look as good as it does if the plls weren't sync'd very well. I must have done something wrong when I tried it for my 6-bit vga driver...
A branch in the emulated code would just be something like MOV PC0,#target_addr, similar to the LMM. This works, since the increment has already taken place by then and won't muck it up.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
BTW, given that the non-hub and non-0001xx opcodes are used up in the Prop I, where did you manage to squeeze in the JMPRETD family?
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
·
"That four-port cell represents quite a commitment, silicon-wise! How does it compare, in physical area, to the Prop I cell that has a larger feature size?"
·
I looked at the database for the Prop I, and there isn't a four-port cell... at least not that I saw.· The 6T memory is used for both the MEM_RAM and MEM_COG.
·
The dimensions of the 6T for the Prop I, are 6.85um X 4.8um requiring a total area of 32.88^2um
·
The dimensions of the 6T for the Prop II, are 2.73um X 1.95um requiring a total area of 5.3235^2um· ... 1/6th of the silicon
·
·
The dimensions of the four-port cell in the Prop II is 2.885um X 7.770um requiring a total area of 22.4165^2um· ... Still less silicon than the 6T in Prop I.
·
·
Attached is a side by side comparison in the same scale view of the 3R1W and 6T memory cells.
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
loop····· rdlong· inst0,pc0··'get thread0 instruction
········· getzc·· zc0······· 'get thread0 flags
inst0···· nop··············· 'LMM instruction gets put here
········· setzc·· zc0······· 'set thread0 flags
········· '(you've got three more instructions before another RDLONG)
You could use indirection, too, to make this thing loop.
Maybe the indirect addressing should be augmented so that addresses within some range get converted to another range through bit substitution. This way, each LMM thread could address the same locations, but be accessing separate windows. For example, you could set it up so that if either S or D·registers were ranging from $000..$00F, you would mux into S/D bits 7..4 a variable value. That way, a thread0 could address $000.$00F, a thread1 could address $010..$01F, a thread2 could address $020.$02F, etc. - all while being coded to $000.$00F. This way, you could run multiple instances of the same LMM code. This would be something VERY easy and small·to add to the cog.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
The condition bits are already accessible, in a way. They can be saved in two instructions and restored in one (assuming save is initialized to zero):
I was hoping for a way just to switch between two sets in one instruction.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Sorry I missed your point about the swap-flags instruction. That's a much better concept than read- and write-flags instructions:
SWAPZC·· D· 'swap Z and C with bit 1 and 0 in D
I am going to add this, for sure. It's very elegant. How about augmenting its functionality so that it rotates the bits down by two, acting as a FIFO? That way, N threads can be chained with one instruction. Otherwise, SWAPZC is just a ping-pong mechanism.
Ah ha! It can be coded·like this:
SWAPZC·· D,[noparse][[/noparse]n]· 'swap Z and C with FIFO bits·in D, n specifies number of bit-pairs.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/29/2008 7:53:04 PM GMT
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
I'm no LMM programmer, but Chip's 'find' seems to be a big deal for LMM, and no-one appears to have mentioned it! (Just wanted to make sure it didn't get missed )
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.com
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again
BTW: I type as I'm thinking, so please don't take any offence at my writing style
Nope, didn't miss it! 'Just focused elsewhere for the moment and not multitasking very well.
In fact, for one or two threads, you could also do RDLONG inst0,PTRA[noparse][[/noparse] 1++] to auto-increment the PC in PTRA.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
get·A flags
exe·A inst
save·A flags
(save A, get B, rotate) %ddccbbaa -> %AAddccbb
get·B flags
exe·B inst
save·B flags
(save B, get C, rotate) %AAddccbb -> %BBAAddcc
get·C flags
exe·C inst
save·C flags
(save C, get D, rotate) %BBAAddcc -> %CCBBAAdd
get D flags
exe D inst
save D flags
(save D, get A) %CCBBAAdd -> %CCBBAAdd -> %DDCCBBAA
So,·the SWAPZC instruction must·read flags from bits[noparse][[/noparse]1..0], shift right by 2 bits, and save·flags to bits[noparse][[/noparse]n*2+1..n*2].
When a thread starts or dies, you change n at that time to open a new bit set, or clip off an old one. This is almost nothing to implement.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Supose if COG has fast SERIN/OUT.
It is in You construction rom to Run LMM directly on serial IN?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.com
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again
BTW: I type as I'm thinking, so please don't take any offence at my writing style
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I am 1011, so be surprised!
Advertisement sponsored by dfletch:
Come and join us on the Propeller IRC channel for fast and easy help!
Channel: #propeller
Server: irc.freenode.net or freenode.net
If you don't want to bother installing an IRC client, use Mibbit. www.mibbit.com