i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong
i could conceivably run all my present code in a muti threaded cog?
Another question directed to others - What would be the best way to interface the smart pin DAC and ADC to the maximum usable MHz frequency of the P2 chip?
ADCs have new modes, and I've not seen a full shakeout on those yet ? They are more modest in samples per second, and bits.
DACs can operate from streamers, & they are video-speed in performance.
I've not seen a Direct Digital Synthesis design go past on P2 yet, but it should be quite good there Good DACs and good BW & Maths...
Inbuilt DACs would support DDS to MHz regions, but with more limited bit-precision.
Adding low cost Audio DACs (or codecs) should allow 20+ bits of DAC precision, on i2s links for high grade Audio DDS. Maybe Cordic can do sine-on-the fly for that ?
lcsc have 16b Dual-DACs like TM8211 , for 150+ $0.0811, and higher spec Audio 16~24b CS4344-CZZR in TSSOP10, for 100+ $0.3227
The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1
The big ones are the multiplication ones. P1 needs 48+(?) instructions (=192 cycles) for a 16 bit multiplication (unless you unroll it, "wasting" a bunch of cog RAM). P2 needs just one 2-cycle instruction.
i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong
i could conceivably run all my present code in a muti threaded cog?
Memory would still be a constraint. While execution from HUBRAM is supported, execution speed there depends heavily on code structure.
To balance that there are the new instructions and execution concepts like SKIP, SKIPF, and EXECF that allow major reductions in code footprint for routines with common elements, and savings in execution time beyond simply conditional execution.
The event and interrupt mechanism that has been introduced has been given careful thought to make it useful within the Propeller mindset.
There are many other differences to consider too, but in broad terms, the bigger benefit you get with the p2 is fitting drivers into a single cog that would have required 3 or 4 cogs on p1. If nothing else you save on the co-ordination efforts required to get the cogs working together.
@rogloh and @garryj have demonstrated this nicely with their display and USB drivers (respectively).
i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong
i could conceivably run all my present code in a muti threaded cog?
No. For starters you'd run out of memory if you're running the code in COG (or if it uses a lot of COG memory registers) and tried to fit 8 copies in!. If the code is in HUB memory then it'll probably fit. For typical C code running in HUB I find the P2 runs at around 1.5x to 2x the speed of P1 at the same clock frequency. You can clock the P2 quite a bit higher (probably 3x is feasible) so giving a speedup of 4.5x to 6x over P1.
For certain specialized purposes the P2 will be even faster than that (e.g. using smart pins, or with code that fits entirely in a COG). Conversely there may be some cases where the P2 won't be much faster than P1, although those will be extremely rare I think.
Roger is proving how much can fit in a cog with that kitchen sink of video drivers!
LOL, depending on the smartpins colliding with parallel streamer outputs and how that works now in rev B I think I might be able squeeze in support for some CLK/DE parallel bus LCDs and those legacy EGA/CGA TTL ones in too at some point. That'd be nice. More coverage.
'
' Set PLL
'
dat org
hubset ##%1_000000_0000001111_1111_01_00 'alter
waitx ##20_000_000/100
hubset ##%1_000000_0000001111_1111_01_11 'alter
'
' Launch n+1 cogs
'
.loop coginit n,#@pgm 'launch cogs 7..0
djnf n,#.loop 'last iteration relaunches cog 0
n long 7 'set to 0, 1, 3, or 7
'
' Program that runs in each cog
'
org
pgm cogid x
add x,#56
.loop drvnot x
jmp #.loop
x res 1
I don't know how fast the new silicon can run because it keeps up with the PLL as it max's out around 390MHz at room temperature. I hit it with freeze spray and the frequency climbed to 435MHz! I couldn't get it any colder than that.
Cold spray should be about -51C. You could find a little metal container and use thermally cobductive epoxy to bond it to the device, then get some dry ice from the supermarket and put chips of dry ice in acetone in it. That should reach -78.5C.
Then take there rest of the dry ice, and powder it in a food processor. Take your favorite ice cream recipie and put it in a stand mixer on high with the whisk. Sprinkle the powdered dry ice into the mix a heaping tablespoon at a time until the ice cream sets up. Then let it rest in the freezer an hour. It makes.the smoothest ice cream you have ever seen.
(You have to do SOMETHING with the extra dry ice, right?)
i only have one main loop in spin....the rest is in pasm on prop 1....i cant "push" encoder reads to the main loop...but there are smart pins on prop2
the next fastest loop is the circle interpolator....it is at 50hz to 80 hz...cant get to 100 on p1...funny...is fast enough but big....i have maybe 4 instructions of play room
after that a mover that compares machine state to desired machine state....this thing also checks e stops becuase it can stop all movment....its fast already it only "compares"
except for a three axis pid....running nine threads......but its short! multiply heavy tho.
add to that p2 can multiply!
u will be able to run a whole 3-axis cnc from a single cog using using lmm.......its already in an eeprom
Chip,
Another discrepancy with the docs. In the COGINIT section:
In each case of COGINIT, the last SETQ value is written into the target cog's PTRA register.
That's worded as if PTRA is always filled with the value from the Q register, irrespective of the presence of prefixed (immediately preceding) SETQ instruction. Testing has proven that PTRA is not filled unless the SETQ is placed as a prefix.
Chip,
Another discrepancy with the docs. In the COGINIT section:
In each case of COGINIT, the last SETQ value is written into the target cog's PTRA register.
That's worded as if PTRA is always filled with the value from the Q register, irrespective of the presence of prefixed (immediately preceding) SETQ instruction. Testing has proven that PTRA is not filled unless the SETQ is placed as a prefix.
Evanh, thanks for noticing this. I'll get this straightened out this morning.
I see that the P2 Eval RevB board is now on sale for 20% off! It's time for all of you who were on the fence about P2 development to buy in. Only $120!
I found a bug today in the silicon. Not a showstopper, but something to be aware of...
KNOWN BUGS (new section in Google Doc)
Intervening ALTx/AUGS/AUGD instructions between SETQ/SETQ2 and RDLONG/WRLONG/WMLONG-PTRx instructions will cancel the special-case block-size PTRx deltas. The anticipated number of longs will transfer, but PTRx will only be modified according to normal PTRx behavior:
setq #16-1 'ready to load 16 longs
altd start_reg 'alter start reg (ALTD cancels block-size PTRx deltas)
rdlong 0,ptra++ 'ptra will only be incremented by 4, not 16*4, as anticipated!!!
If I had realized this potential problem, a simple signal-name substitution in the Verilog code would have fixed it.
I've got my reasons. It's part of the Spin2 interpreter's inline PASM feature. You can load code into $000..$167 and execute it. That code sequence is for loading PASM code of some length, starting at some register, executing it, and then resuming bytecode execution from where the PASM binary left off. I needed PTRA to stay current.
Here is how this interpreter code looks now:
'
'
' a: In-line PASM
' b: REGEXEC(hubadr)
' c: REGLOAD(hubadr)
' d: CALL(anyadr)
'
inline_pasm setq #16-1 'a load local variables from hub into buff
rdlong buff,dbase 'a
bith v,#31 'a set flag to later restore local variables to hub
mov ptrb,pb 'a get bytecode ptr into ptrb
skip ##%11100100000111 'a x2 begin inline_pasm skip pattern
regexec_ skip ##%1111000000 '| b x2 begin REGEXEC skip pattern
regload_ mov ptrb,x '| b c get hubadr into ptrb
rdword w,ptrb++ 'a b c read start register
rdword y,ptrb++ 'a b c read length of pasm code, minus 1
setq y 'a b c read in code
altd w 'a b c
rdlong 0,ptrb++ 'a b c altd causes ptrb++ to inc by 1*4, not by (y+1)*4
_ret_ popa x '| | c REGLOAD done, pop stack
shl y,#2 'a | update bytecode ptr for inline_pasm
add y,ptrb 'a |
call_pasm mov w,x '| | d get CALL address
popa x '| b d pop stack
mov y,pb '| b d save bytecode ptr
mov z,ptra 'a b d save ptra
call w 'a b d call pasm code (can use pa/pb/ptra/ptrb/stack)
testb v,#31 wc 'a b d if inline_pasm, restore local variables to hub
if_c setq #16-1 'a b d
if_c wrlong buff,dbase 'a b d
mov ptra,z 'a b d restore ptra
_ret_ mov pb,y 'a b d restore bytecode ptr
ALTx and SETQ work in quite different ways. ALTx actively modifies the following instruction in the pipeline, no matter what instruction that might be. SETQ is much more benign, it just fills the hidden Q register and, presumably, sets a flag to say it has done so. It is then up to subsequent instructions to make use of what Q holds.
That Q flag is the messy part. For most op-codes, the default behaviour will have an auto-reset of the flag; with some of them making use of Q at the same time. But certain instructions like AUGx/ALTx will leave the flag set so that Q stays primed. Not too dissimilar to the interrupt blocking mechanism.
PS: And then there is MUXQ which uses Q irrespective of the state of the flag.
PPS: SETQ2 will be filling the same Q register but setting a different flag that only RDLONG/WRLONG action on.
That's one term I'd not known. The only one similar I sort of knew was just processor architecture. But of course that has quite a broad coverage.
Anyway, I went and looked it up and of course Wikipedia has an entry. And maybe not surprisingly, it's somewhat pointedly written. The final sentence is this:
"Unfortunately, the terminology around such programming models tends to focus on the details of the hardware that inspired the execution model, and in that insular world the mistaken belief is formed that a programming model is only for the case when an execution model is closely matched to hardware features."
Comments
i could conceivably run all my present code in a muti threaded cog?
P2 can easily do 250 MHz and only needs 2 clocks per instruction, so 125 MIPS...
The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1
ADCs have new modes, and I've not seen a full shakeout on those yet ? They are more modest in samples per second, and bits.
DACs can operate from streamers, & they are video-speed in performance.
I've not seen a Direct Digital Synthesis design go past on P2 yet, but it should be quite good there Good DACs and good BW & Maths...
Inbuilt DACs would support DDS to MHz regions, but with more limited bit-precision.
Adding low cost Audio DACs (or codecs) should allow 20+ bits of DAC precision, on i2s links for high grade Audio DDS. Maybe Cordic can do sine-on-the fly for that ?
lcsc have 16b Dual-DACs like TM8211 , for 150+ $0.0811, and higher spec Audio 16~24b CS4344-CZZR in TSSOP10, for 100+ $0.3227
The big ones are the multiplication ones. P1 needs 48+(?) instructions (=192 cycles) for a 16 bit multiplication (unless you unroll it, "wasting" a bunch of cog RAM). P2 needs just one 2-cycle instruction.
Memory would still be a constraint. While execution from HUBRAM is supported, execution speed there depends heavily on code structure.
To balance that there are the new instructions and execution concepts like SKIP, SKIPF, and EXECF that allow major reductions in code footprint for routines with common elements, and savings in execution time beyond simply conditional execution.
The event and interrupt mechanism that has been introduced has been given careful thought to make it useful within the Propeller mindset.
There are many other differences to consider too, but in broad terms, the bigger benefit you get with the p2 is fitting drivers into a single cog that would have required 3 or 4 cogs on p1. If nothing else you save on the co-ordination efforts required to get the cogs working together.
@rogloh and @garryj have demonstrated this nicely with their display and USB drivers (respectively).
No. For starters you'd run out of memory if you're running the code in COG (or if it uses a lot of COG memory registers) and tried to fit 8 copies in!. If the code is in HUB memory then it'll probably fit. For typical C code running in HUB I find the P2 runs at around 1.5x to 2x the speed of P1 at the same clock frequency. You can clock the P2 quite a bit higher (probably 3x is feasible) so giving a speedup of 4.5x to 6x over P1.
For certain specialized purposes the P2 will be even faster than that (e.g. using smart pins, or with code that fits entirely in a COG). Conversely there may be some cases where the P2 won't be much faster than P1, although those will be extremely rare I think.
Cold spray should be about -51C. You could find a little metal container and use thermally cobductive epoxy to bond it to the device, then get some dry ice from the supermarket and put chips of dry ice in acetone in it. That should reach -78.5C.
Then take there rest of the dry ice, and powder it in a food processor. Take your favorite ice cream recipie and put it in a stand mixer on high with the whisk. Sprinkle the powdered dry ice into the mix a heaping tablespoon at a time until the ice cream sets up. Then let it rest in the freezer an hour. It makes.the smoothest ice cream you have ever seen.
(You have to do SOMETHING with the extra dry ice, right?)
the next fastest loop is the circle interpolator....it is at 50hz to 80 hz...cant get to 100 on p1...funny...is fast enough but big....i have maybe 4 instructions of play room
after that a mover that compares machine state to desired machine state....this thing also checks e stops becuase it can stop all movment....its fast already it only "compares"
except for a three axis pid....running nine threads......but its short! multiply heavy tho.
add to that p2 can multiply!
u will be able to run a whole 3-axis cnc from a single cog using using lmm.......its already in an eeprom
Stuck
Another discrepancy with the docs. In the COGINIT section:
That's worded as if PTRA is always filled with the value from the Q register, irrespective of the presence of prefixed (immediately preceding) SETQ instruction. Testing has proven that PTRA is not filled unless the SETQ is placed as a prefix.
Evanh, thanks for noticing this. I'll get this straightened out this morning.
Kind regards, Samuel Lourenço
Wait until we get into signal processing using the CORDIC functions. Then, performance will be >50 fold, compared to the P1.
https://www.parallax.com/product/64000-es
I accidently clicked a cogwheel icon that unstuck it for me. Is there a way to restick it?
Try refreshing the browser... I just clicked a few buttons that might do the trick.
If not, maybe find the topic and click the cog again and see what options you have- maybe it can be re-stuck that way?
KNOWN BUGS (new section in Google Doc)
Intervening ALTx/AUGS/AUGD instructions between SETQ/SETQ2 and RDLONG/WRLONG/WMLONG-PTRx instructions will cancel the special-case block-size PTRx deltas. The anticipated number of longs will transfer, but PTRx will only be modified according to normal PTRx behavior:
If I had realized this potential problem, a simple signal-name substitution in the Verilog code would have fixed it.
Mike
I've got my reasons. It's part of the Spin2 interpreter's inline PASM feature. You can load code into $000..$167 and execute it. That code sequence is for loading PASM code of some length, starting at some register, executing it, and then resuming bytecode execution from where the PASM binary left off. I needed PTRA to stay current.
Here is how this interpreter code looks now:
Doesn't seem like you can use both...
Docs are explicit on this for altx. Doesn't say this but implies it for setq...
That Q flag is the messy part. For most op-codes, the default behaviour will have an auto-reset of the flag; with some of them making use of Q at the same time. But certain instructions like AUGx/ALTx will leave the flag set so that Q stays primed. Not too dissimilar to the interrupt blocking mechanism.
PS: And then there is MUXQ which uses Q irrespective of the state of the flag.
PPS: SETQ2 will be filling the same Q register but setting a different flag that only RDLONG/WRLONG action on.
If so, I would appreciate a pointer to it.
Anyway, I went and looked it up and of course Wikipedia has an entry. And maybe not surprisingly, it's somewhat pointedly written. The final sentence is this: