New P2 Silicon

mrchillin · 2019-11-19 01:24

i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong

i could conceivably run all my present code in a muti threaded cog?

Rayman · 2019-11-19 01:31

So, P1 is 80 MHz usually, but takes 4 clocks per instruction, so 20 MIPS.

P2 can easily do 250 MHz and only needs 2 clocks per instruction, so 125 MIPS...

The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1

jmg · 2019-11-19 01:35

PropGuy2 wrote: »

Another question directed to others - What would be the best way to interface the smart pin DAC and ADC to the maximum usable MHz frequency of the P2 chip?

ADCs have new modes, and I've not seen a full shakeout on those yet ? They are more modest in samples per second, and bits.
DACs can operate from streamers, & they are video-speed in performance.

I've not seen a Direct Digital Synthesis design go past on P2 yet, but it should be quite good there

Good DACs and good BW & Maths...
Inbuilt DACs would support DDS to MHz regions, but with more limited bit-precision.

Adding low cost Audio DACs (or codecs) should allow 20+ bits of DAC precision, on i2s links for high grade Audio DDS. Maybe Cordic can do sine-on-the fly for that ?
lcsc have 16b Dual-DACs like TM8211 , for 150+ $0.0811, and higher spec Audio 16~24b CS4344-CZZR in TSSOP10, for 100+ $0.3227

Wuerfel_21 · 2019-11-19 01:39

Rayman wrote: »

The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1

The big ones are the multiplication ones. P1 needs 48+(?) instructions (=192 cycles) for a 16 bit multiplication (unless you unroll it, "wasting" a bunch of cog RAM). P2 needs just one 2-cycle instruction.

AJL · 2019-11-19 01:44

mrchillin wrote: »

i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong

i could conceivably run all my present code in a muti threaded cog?

Memory would still be a constraint. While execution from HUBRAM is supported, execution speed there depends heavily on code structure.

To balance that there are the new instructions and execution concepts like SKIP, SKIPF, and EXECF that allow major reductions in code footprint for routines with common elements, and savings in execution time beyond simply conditional execution.

The event and interrupt mechanism that has been introduced has been given careful thought to make it useful within the Propeller mindset.

There are many other differences to consider too, but in broad terms, the bigger benefit you get with the p2 is fitting drivers into a single cog that would have required 3 or 4 cogs on p1. If nothing else you save on the co-ordination efforts required to get the cogs working together.

@rogloh and @garryj have demonstrated this nicely with their display and USB drivers (respectively).

ersmith · 2019-11-19 02:23

mrchillin wrote: »

i was wondering if one p2 cog at max speed can out run 8 p1 cogs....the p1 is 20mhz...times 8...160 mhz....but the begunning of this thread says that it can get to almost 400 mhz....or am i just wrong

i could conceivably run all my present code in a muti threaded cog?

No. For starters you'd run out of memory if you're running the code in COG (or if it uses a lot of COG memory registers) and tried to fit 8 copies in!. If the code is in HUB memory then it'll probably fit. For typical C code running in HUB I find the P2 runs at around 1.5x to 2x the speed of P1 at the same clock frequency. You can clock the P2 quite a bit higher (probably 3x is feasible) so giving a speedup of 4.5x to 6x over P1.

For certain specialized purposes the P2 will be even faster than that (e.g. using smart pins, or with code that fits entirely in a COG). Conversely there may be some cases where the P2 won't be much faster than P1, although those will be extremely rare I think.

evanh · 2019-11-19 03:17

Roger is proving how much can fit in a cog with that kitchen sink of video drivers!

rogloh · 2019-11-19 04:08

evanh wrote: »

Roger is proving how much can fit in a cog with that kitchen sink of video drivers!

LOL, depending on the smartpins colliding with parallel streamer outputs and how that works now in rev B I think I might be able squeeze in support for some CLK/DE parallel bus LCDs and those legacy EGA/CGA TTL ones in too at some point. That'd be nice. More coverage.

MIchael_Michalski · 2019-11-19 07:31

cgracey wrote: »
Here is a link to the documentation for the new silicon:

https://docs.google.com/document/d/1gn6oaT5Ib7CytvlZHacmrSbVBJsD9t_-kmvjd7nUR6o/edit?usp=sharing

Here is a link to the instruction sheet for the new silicon:

https://docs.google.com/spreadsheets/d/1_vJk-Ad569UMwgXTKTdfJkHYHpc1rZwxB-DcIiAZNdk/edit?usp=sharing

Here are some current measurements on the old vs. new silicon:
MHz+cogs vs I@1.8V	P2 v1	P2 v2	v2/v1
--------------------------------------------------
20MHz PLL, 1 cog	66 mA	29 mA	44%
20MHz PLL, 2 cogs	69	32	46%
20MHz PLL, 4 cogs	76	37	49%
20MHz PLL, 8 cogs	89	49	55%

40MHz PLL, 1 cog	129	55	43%
40MHz PLL, 2 cogs	136	61	45%
40MHz PLL, 4 cogs	148	72	49%
40MHz PLL, 8 cogs	175	94	54%

80MHz PLL, 1 cog	253	106	42%
80MHz PLL, 2 cogs	266	118	44%
80MHz PLL, 4 cogs	290	141	49%
80MHz PLL, 8 cogs	344	186	54%

160MHz PLL, 1 cog	497	208	42%
160MHz PLL, 2 cogs	521	231	44%
160MHz PLL, 4 cogs	570	275	48%
160MHz PLL, 8 cogs	672	365	54%

320MHz PLL, 1 cog	962	407	42%
320MHz PLL, 2 cogs	1010	455	45%
320MHz PLL, 4 cogs	1104	541	49%
320MHz PLL, 8 cogs	1295	718	55%
The new silicon takes about half the power.

Here's the code that was running in the P2's:
'
' Set PLL
'
dat		org

		hubset	##%1_000000_0000001111_1111_01_00	'alter
		waitx	##20_000_000/100
		hubset	##%1_000000_0000001111_1111_01_11	'alter
'
' Launch n+1 cogs
'
.loop		coginit	n,#@pgm		'launch cogs 7..0
		djnf	n,#.loop	'last iteration relaunches cog 0

n		long	7		'set to 0, 1, 3, or 7
'
' Program that runs in each cog
'
		org

pgm		cogid	x
		add	x,#56

.loop		drvnot	x
		jmp	#.loop


x		res	1
I don't know how fast the new silicon can run because it keeps up with the PLL as it max's out around 390MHz at room temperature. I hit it with freeze spray and the frequency climbed to 435MHz! I couldn't get it any colder than that.

Cold spray should be about -51C. You could find a little metal container and use thermally cobductive epoxy to bond it to the device, then get some dry ice from the supermarket and put chips of dry ice in acetone in it. That should reach -78.5C.

Then take there rest of the dry ice, and powder it in a food processor. Take your favorite ice cream recipie and put it in a stand mixer on high with the whisk. Sprinkle the powdered dry ice into the mix a heaping tablespoon at a time until the ice cream sets up. Then let it rest in the freezer an hour. It makes.the smoothest ice cream you have ever seen.
(You have to do SOMETHING with the extra dry ice, right?)

mrchillin · 2019-11-19 17:19

i only have one main loop in spin....the rest is in pasm on prop 1....i cant "push" encoder reads to the main loop...but there are smart pins on prop2

the next fastest loop is the circle interpolator....it is at 50hz to 80 hz...cant get to 100 on p1...funny...is fast enough but big....i have maybe 4 instructions of play room

after that a mover that compares machine state to desired machine state....this thing also checks e stops becuase it can stop all movment....its fast already it only "compares"

except for a three axis pid....running nine threads......but its short! multiply heavy tho.

add to that p2 can multiply!

u will be able to run a whole 3-axis cnc from a single cog using using lmm.......its already in an eeprom

cgracey · 2019-11-20 04:15

This thread needs to be sticky.

VonSzarvas · 2019-11-20 07:58

cgracey wrote: »

This thread needs to be sticky.

Stuck

evanh · 2019-11-21 10:52

Chip,
Another discrepancy with the docs. In the COGINIT section:

In each case of COGINIT, the last SETQ value is written into the target cog's PTRA register.

That's worded as if PTRA is always filled with the value from the Q register, irrespective of the presence of prefixed (immediately preceding) SETQ instruction. Testing has proven that PTRA is not filled unless the SETQ is placed as a prefix.

cgracey · 2019-11-21 13:43

evanh wrote: »

Chip,
Another discrepancy with the docs. In the COGINIT section:

In each case of COGINIT, the last SETQ value is written into the target cog's PTRA register.

That's worded as if PTRA is always filled with the value from the Q register, irrespective of the presence of prefixed (immediately preceding) SETQ instruction. Testing has proven that PTRA is not filled unless the SETQ is placed as a prefix.

Evanh, thanks for noticing this. I'll get this straightened out this morning.

samuell · 2019-11-22 00:25

Rayman wrote: »

So, P1 is 80 MHz usually, but takes 4 clocks per instruction, so 20 MIPS.

P2 can easily do 250 MHz and only needs 2 clocks per instruction, so 125 MIPS...

The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1

In fact, I could say that the improvement in speed is almost tenfold at 160MHz, when compared to the P1.

Kind regards, Samuel Lourenço

cgracey · 2019-11-22 01:25

samuell wrote: »

Rayman wrote: »

So, P1 is 80 MHz usually, but takes 4 clocks per instruction, so 20 MIPS.

P2 can easily do 250 MHz and only needs 2 clocks per instruction, so 125 MIPS...

The other thing is that P2 has a lot of new instructions that save time when doing stuff so "effective" MIPS is even higher, when compared to P1

In fact, I could say that the improvement in speed is almost tenfold at 160MHz, when compared to the P1.

Kind regards, Samuel Lourenço

Wait until we get into signal processing using the CORDIC functions. Then, performance will be >50 fold, compared to the P1.

Tubular · 2019-11-22 01:54

Or spin2 that benefits from lean instructions and skips, on top of the hardware speedups

David Betz · 2019-11-26 13:47

I see that the P2 Eval RevB board is now on sale for 20% off! It's time for all of you who were on the fence about P2 development to buy in. Only $120!

https://www.parallax.com/product/64000-es

Martin Hodge · 2019-11-26 18:16

VonSzarvas wrote: »

cgracey wrote: »

This thread needs to be sticky.

Stuck

I accidently clicked a cogwheel icon that unstuck it for me. Is there a way to restick it?

VonSzarvas · 2019-11-26 18:21

Martin Hodge wrote: »

I accidently clicked a cogwheel icon that unstuck it for me. Is there a way to restick it?

Try refreshing the browser... I just clicked a few buttons that might do the trick.

If not, maybe find the topic and click the cog again and see what options you have- maybe it can be re-stuck that way?

evanh · 2019-11-27 01:09

The sticky only applies to "Propeller 2" forum. It appears as unstuck when viewing "Recent Discussions".

cgracey · 2019-12-13 22:06

I found a bug today in the silicon. Not a showstopper, but something to be aware of...

KNOWN BUGS (new section in Google Doc)

Intervening ALTx/AUGS/AUGD instructions between SETQ/SETQ2 and RDLONG/WRLONG/WMLONG-PTRx instructions will cancel the special-case block-size PTRx deltas. The anticipated number of longs will transfer, but PTRx will only be modified according to normal PTRx behavior:

	setq	#16-1		'ready to load 16 longs
	altd	start_reg	'alter start reg (ALTD cancels block-size PTRx deltas)
	rdlong	0,ptra++	'ptra will only be incremented by 4, not 16*4, as anticipated!!!

If I had realized this potential problem, a simple signal-name substitution in the Verilog code would have fixed it.

msrobots · 2019-12-13 22:19

why would you want to do that anyway?

Mike

cgracey · 2019-12-13 22:26

msrobots wrote: »

why would you want to do that anyway?

Mike

I've got my reasons. It's part of the Spin2 interpreter's inline PASM feature. You can load code into $000..$167 and execute it. That code sequence is for loading PASM code of some length, starting at some register, executing it, and then resuming bytecode execution from where the PASM binary left off. I needed PTRA to stay current.

Here is how this interpreter code looks now:

'
'
' a: In-line PASM
' b: REGEXEC(hubadr)
' c: REGLOAD(hubadr)
' d: CALL(anyadr)
'
inline_pasm	setq	#16-1			'a		load local variables from hub into buff
		rdlong	buff,dbase		'a
		bith	v,#31			'a		set flag to later restore local variables to hub

		mov	ptrb,pb			'a		get bytecode ptr into ptrb
		skip	##%11100100000111	'a	x2	begin inline_pasm skip pattern

regexec_	skip	##%1111000000		'| b	x2	begin REGEXEC skip pattern
regload_	mov	ptrb,x			'| b c		get hubadr into ptrb

		rdword	w,ptrb++		'a b c		read start register
		rdword	y,ptrb++		'a b c		read length of pasm code, minus 1

		setq	y			'a b c		read in code
		altd	w			'a b c
		rdlong	0,ptrb++		'a b c		altd causes ptrb++ to inc by 1*4, not by (y+1)*4

	_ret_	popa	x			'| | c		REGLOAD done, pop stack

		shl	y,#2			'a |		update bytecode ptr for inline_pasm
		add	y,ptrb			'a |

call_pasm	mov	w,x			'| |   d	get CALL address
		popa	x			'| b   d	pop stack

		mov	y,pb			'| b   d	save bytecode ptr
		mov	z,ptra			'a b   d	save ptra

		call	w			'a b   d	call pasm code (can use pa/pb/ptra/ptrb/stack)

		testb	v,#31		wc	'a b   d	if inline_pasm, restore local variables to hub
	if_c	setq	#16-1			'a b   d
	if_c	wrlong	buff,dbase		'a b   d

		mov	ptra,z			'a b   d	restore ptra
	_ret_	mov	pb,y			'a b   d	restore bytecode ptr

Rayman · 2019-12-13 22:59

Since you didn't publish anything yet, I'd call it a feature and not a bug...

evanh · 2019-12-14 00:01

Hehe, I don't think the PTRx behaviour could ever be called a feature.

Rayman · 2019-12-14 01:03

Is it really a bug?... Don't the atlx and setq instructions only operate on the next instruction?
Doesn't seem like you can use both...

Docs are explicit on this for altx. Doesn't say this but implies it for setq...

evanh · 2019-12-14 01:23

ALTx and SETQ work in quite different ways. ALTx actively modifies the following instruction in the pipeline, no matter what instruction that might be. SETQ is much more benign, it just fills the hidden Q register and, presumably, sets a flag to say it has done so. It is then up to subsequent instructions to make use of what Q holds.

That Q flag is the messy part. For most op-codes, the default behaviour will have an auto-reset of the flag; with some of them making use of Q at the same time. But certain instructions like AUGx/ALTx will leave the flag set so that Q stays primed. Not too dissimilar to the interrupt blocking mechanism.

PS: And then there is MUXQ which uses Q irrespective of the state of the flag.
PPS: SETQ2 will be filling the same Q register but setting a different flag that only RDLONG/WRLONG action on.

potatohead · 2019-12-14 03:36

Did we ever make a programmer's model of a COG, similar to the ones published for many other CPUs?

If so, I would appreciate a pointer to it.

evanh · 2019-12-14 04:11

That's one term I'd not known. The only one similar I sort of knew was just processor architecture. But of course that has quite a broad coverage.

Anyway, I went and looked it up and of course Wikipedia has an entry. And maybe not surprisingly, it's somewhat pointedly written. The final sentence is this:

"Unfortunately, the terminology around such programming models tends to focus on the details of the hardware that inspired the execution model, and in that insular world the mistaken belief is formed that a programming model is only for the case when an execution model is closely matched to hardware features."

New P2 Silicon

Comments