Optimized PFTH for the P2

mindrobots · 2015-03-11 19:39

bmentink wrote: »

Great! I did not realize those instructions were so flexible .... so the stacks would be in Hub in that case ....

Cheers, B.

X/Y stack space was in a special area of COG memory referred to as AUXRAM. It could be used for stack/data or a Color Look Up Table (CLUT) used for video.

mindrobots · 2015-03-11 19:42

bmentink wrote: »
So the code:
push1                   wrlong  parm1, stackptr
                        add     stackptr, #4
push_ret                ret
Could be written inline as: wrlong parm1,ptra++
... and the code:
pop1                    sub     stackptr, #4
                        rdlong  parm1, stackptr         wz
pop_ret
pop2_ret                ret
Could be written inline as: rdlong parm1,--ptra

That will speed things up. (assumes ptra has been setup to point to the Data stack of course)

PS: Does PASM support alias's? Then we could put PUSHD and POPD as alias's for the above ...

Use PUSHX/POPX and PUSHY/POPY instructions. Look at my PFTH examples, they use the stack space in AUXRAM.

bmentink · 2015-03-11 19:47

ozpropdev wrote: »

oops!
A quick look back at the old notes and the indirect stuff used the CCCC bits in the opcode not the PTRx stuff.
The indirect feature was nice too, but I don't think that made the cut in the new P2.

Your kidding right! So all the stuff that makes a great Forth computer is culled ....... bugger!

bmentink · 2015-03-11 19:48

mindrobots wrote: »

Use PUSHX/POPX and PUSHY/POPY instructions. Look at my PFTH examples, they use the stack space in AUXRAM.

Read the rest of this thread ..... that has been culled .....

mindrobots · 2015-03-11 19:50

bmentink wrote: »

Your kidding right! So all the stuff that makes a great Forth computer is culled ....... bugger!

Grab an FPGA and a copy of the February 2014 P2Hot emulation and you can play with PFTH and the P2 like I did.

bmentink · 2015-03-11 19:52

mindrobots wrote: »

Grab an FPGA and a copy of the February 2014 P2Hot emulation and you can play with PFTH and the P2 like I did.

You are not getting the point! I am looking at what the current thinking is for the P2 instruction set, not the P2-HOT .... lot's of things have been culled.
I want to push back to Chip, if we can't make a decent Forth machine out of P2 because he missed out some essential PUSH/POP style commands ....... before it is too late.

mindrobots · 2015-03-11 19:54

It's too late. you are at least 8 months too late to make any pitch on the latest round of P2 changes.

bmentink · 2015-03-11 19:57

Well if the current instruction set is as per: See http://forums.parallax.com/showthrea...=1#post1260747 then we are too late to have a good Forth machine .....

Looks like I will have to delve into Veralog and make my own ....:-(

mindrobots · 2015-03-11 19:59

The P2 should be a more than decent Forth machine running a version of Tachyon tailored for its final architecture. Tachyon screams on tbe P1 and should really scream on the P2.

bmentink · 2015-03-11 20:01

mindrobots wrote: »

The P2 should be a more than decent Forth machine running a version of Tachyon tailored for its final architecture. Tachyon screams on tbe P1 and should really scream on the P2.

Yes, but this thread is about PFTH! Not Tachyon ....

I had a look at Tachyon, I hate the syntax, the code is too hard to follow, it's un-readable.
I prefer PFTH, it's easy to understand AND quick (and with a bit of tweaking much faster) .... a good example of Forth and PASM to teach the young ones ....

ozpropdev · 2015-03-11 20:06

Don't panic Bernie!
These instructions are in the last proposed P2 instruction set.

			 					PUSHA	D/# 			(alias for WRLONG D/#,PTRA++)
								PUSHB	D/# 			(alias for WRLONG D/#,PTRB++)
								POPA	 			(alias for RDLONG D,--PTRA)
								POPB	 			(alias for RDLONG D,--PTRB)

If these are included then the auto inc/dec feature is still a goer!

bmentink · 2015-03-11 23:27

Cool, can you give me a link to them, maybe I am looking at the not-so-latest ...

LoopyByteloose · 2015-03-11 23:39

bmentink wrote: »

Yes, but this thread is about PFTH! Not Tachyon ....

I had a look at Tachyon, I hate the syntax, the code is too hard to follow, it's un-readable.
I prefer PFTH, it's easy to understand AND quick (and with a bit of tweaking much faster) .... a good example of Forth and PASM to teach the young ones ....

I am looking forward to Forth on the Propeller 2 -- especially if it is both educational and functional. So I still prefer PFTH, while I appreaciate Tachyon's pushing the boundaries. You just can't have one Forth do everything on the Propeller. If you stretch in one direction, you ignore another. Don't forget PropForth. I am sure it will try to keep up.

ozpropdev · 2015-03-12 01:41

@bmentink
I believe these were the most recent opcodes presented on the new P2.
See here

bmentink · 2015-03-12 12:10

ozpropdev wrote: »

@bmentink
I believe these were the most recent opcodes presented on the new P2.
See here

@ozpropdev Still don't get where you got the following from.

			 					PUSHA	D/# 			(alias for WRLONG D/#,PTRA++)
								PUSHB	D/# 			(alias for WRLONG D/#,PTRB++)
								POPA	 			(alias for RDLONG D,--PTRA)
								POPB	 			(alias for RDLONG D,--PTRB)

The link you gave for the latest instruction set was the link I allready quoted, it does not have the above.
It does have a comment "Aliases for WRLONG/RDLONG: PUSHA/PUSHB/POPA/POPB" at the bottom of the quote.

Confused! Did you write the above quote, or did Chip. If it was Chip you must have got it from another source, not the above link ...

ozpropdev · 2015-03-12 16:37

@Bernie
Hmm... I see the problem
The PUSH/POP stuff I posted is from a instruction set dated 9 April 2014. I also have a set dated 14 April 2014.
I can't seem to find either set on the forum now.
Anyhow the 16 April 2014 set refers to PUSHA/POPA/{USHB/POPB as aliases for WRLONG etc.
In order to be a true PUSH/POP it MUST be auto increment/decrement.
Based on the further evidence of CALLA/RETA one can assume that the infrastructure exists for auto inc/dec.
Sorry for the confusion.
BTW. The link in your post is broken.
Cheers
Brian

bmentink · 2015-03-12 22:46

@mindrobots I see you have allready had a go at this in this thread: http://forums.parallax.com/showthread.php/154489-Quick-Gains-converting-P1-code-to-P2-code-pfth-example
Maybe that was what you were looking for, when you were looking for your code that had all the speed improvements in (Return Stack, Data Stack, Serial) ?

In that thread you state:

I need these instructions:

pushx/y ptra
popx/y ptra
and something like jmp *ptra where I can jump to the address in the address pointed to by ptra (I don't think that exists or I don't know how to code it) I'm still looking because that is where the big speed gain is.

I wonder if the following instruction can be used as an indirect jump:

----		1111101 11 1 CCCC 0 nnnnnnnnnnnnnnnnn		CALLA	#abs			(call to 17-bit absolute address using PTRA)
----		1111101 11 1 CCCC 1 nnnnnnnnnnnnnnnnn		CALLA	@rel			(call to 17-bit relative address using PTRA)

Maybe we could make the CALL a JMP by modifying the SP to remove the stacked return address .... I seem to remember using this trick before in the past ....

The only problem is there is no CALLA @D, so would have to use the @rel some how ...

There is also this command:

ZCWS		1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPSW	D,S/@			(jump to S/@, store return address in D, WZ/WC to save/load flags)

evanh · 2015-03-13 03:49

ozpropdev wrote: »

@Bernie
Hmm... I see the problem
The PUSH/POP stuff I posted is from a instruction set dated 9 April 2014. I also have a set dated 14 April 2014.
I can't seem to find either set on the forum now.
Anyhow the 16 April 2014 set refers to PUSHA/POPA/{USHB/POPB as aliases for WRLONG etc.

Here's the 9 April 2014 link - http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1258426&viewfull=1#post1258426

mindrobots · 2015-03-13 04:36

@Bernie, Thanks! That is the thread I was looking for with the final consolidated version of PFTH-P2. Shortly after this, the P2 went to jelly and except for a fun exercise playing on a P2 that will never be in my spare time, I haven't done anything further or thought any more about it. Once we get a more tangible P2, I'm sure I'll start up the hunt again.

bmentink · 2015-03-13 13:02

I have found this old (but most recent I can find Sept-2014) link that describes the P2 --> https://www.parallax.com/news/2014-09-19/propeller-2-schedule-update-longer-we-work-simpler-our-new-multicore-design-will
I did not know that each COG can execute code in hub as well as it's own ram ... interesting ..

Note, this spec is published AFTER the dates publishing the opcodes ... so maybe they have been updated again, I wish Chip would publish his latest offereings ....

Cheers,
Bernie

mindrobots · 2015-03-13 13:09

Chip's busy. We don't want to hear anything from Chip until there is a post that say, "Try this on your FPGAs!"

Execute code from HUBRAM - last I recall, this is still on the feature list for the P2. I think everyone liked that feature! It gives you a MUCH larger code space if you can actually execute code that lives outside the 2KB COG RAM. It opens up a big, flat address space to play with!

evanh · 2015-03-13 14:31

bmentink wrote: »

I did not know that each COG can execute code in hub as well as it's own ram ... interesting ..

Cool eh, we've been calling it HubExec for short. It's existence has a history and it's many details caused Chip a lot of stress during the Prop2-HOT development cycle but it was also deemed as important to achieve.

I don't think anyone even wanted to ask if he was including it in the Prop2-COLD design but he's been adamant it's going to happen. I'd hazard a guess that Chip is re-implementing HubExec in the new design right now.

bmentink · 2015-03-13 18:41

@Dave Hein
Hi Dave, in the following definition of pop1, what is the purpose of the setting of the zero flag? I don't see any instances where you use pop1, that use the zero flag.
Can it simply be removed? It would then allow more "pop" stack operations to be in-lined without penalty ...

pop1                    popx	parm1
			cmp     parm1, #0 wz
pop_ret
pop2_ret                ret

Also I notice in the following fragment instances where two operations could be one ..

semicolonfunc           mov     temp1, #0
                        wrlong  temp1, a_state

I think you can do a simple wrlong #0, a_state, the instructions supports it (was this not possible on the P1?)

Cheers,
Bernie

Dave Hein · 2015-03-13 19:05

_jzfunc uses the zero flag after calling pop1. "wrlong #0, a_state" doesn't work on the P1. I didn't know about that on the P2. That's great if it works.

bmentink · 2015-03-13 21:34

Dave Hein wrote: »

_jzfunc uses the zero flag after calling pop1. "wrlong #0, a_state" doesn't work on the P1. I didn't know about that on the P2. That's great if it works.

If _jzfunc is the only place, then we can just set the flag there, rather than burden every pop1 .. as mindrobots allready did in attached file
He just did not delete the compare out of pop1, which can now happen ... then pop1 can be in-lined in more places ..

If the following is correct, using #constant should work ..

--LS		1110001 0L I CCCC DDDDDDDDD SSSSSSSSS		WRLONG	D/#,S/PTRA/PTRB		(waits for hub)

Cheers,
Bernie

Dave Hein · 2015-03-14 08:44

Yes, that works. It will be interesting to see how fast pfth runs on the P2. One thing that concerns me is the use of hub reads and jumps in the pfth kernel. Jumps cause a pipeline flush, which wastes cycles. Hub reads will cause hub stalls. The fastest Forth implementation for P2 may be one that uses the hubexec mode instead of doing all the primitive operations in a kernel. This will require more code memory, but it may be a lot faster.

bmentink · 2015-03-14 16:40

Dave Hein wrote: »

Yes, that works. It will be interesting to see how fast pfth runs on the P2. One thing that concerns me is the use of hub reads and jumps in the pfth kernel. Jumps cause a pipeline flush, which wastes cycles. Hub reads will cause hub stalls. The fastest Forth implementation for P2 may be one that uses the hubexec mode instead of doing all the primitive operations in a kernel. This will require more code memory, but it may be a lot faster.

Hmmm .. if we could execute most of the code AND have the stacks and some variables in COG, then hub reads will be minimal, I would like to see that approach benchmarked against running entirely in hubexec.

It "should" still be faster, as the hub is clocked at 1/2 the speed of the cogs .. the overflow of variables and/or code could be hubexec ..... maybe we could get the best of both worlds.
We could have a split Forth system, run in COG if possible, remainder in hubexec. I believe the COG's have 4kB of ram now?

Cheers,
Bernie

evanh · 2015-03-14 16:43

bmentink wrote: »

It "should" still be faster, as the hub is clocked at 1/2 the speed of the cogs ..

Prop2 Hub is same speed as it's Cogs - 16 clocks and 16 operations per rotation. Albeit an "operation" here has many facets, from the CORDIC to parallel accesses from all Cogs.

ozpropdev · 2015-03-14 18:11

bmentink wrote: »

I believe the COG's have 4kB of ram now?

The cogs still only have 2K (512 Longs) of cog ram.
The old "HOT" P2 had in addition to cog ram a "CLUT/AUX" ram block of 1K (256 Longs) which has been removed from the new P2.

bmentink · 2015-03-14 18:36

evanh wrote: »

Prop2 Hub is same speed as it's Cogs - 16 clocks and 16 operations per rotation. Albeit an "operation" here has many facets, from the CORDIC to parallel accesses from all Cogs.

I thought I read somewhere on the forums that the COG's run at 100MIPS and the HUB at 50MIPS ... is that not correct?

Which brings me to a question that has to be asked: (.. and please do not see this as a criticism, just need knowledge, I am more than happy with 100MIPS)

Why is it, in the day of modern ARM processors we can get upto 2..3MIPS/Mhz, but with the Prop only 0.5MIPS/Mhz, I though the addition of a pipeline helped with that, but I know
very little about this ... someone with more knowledge please help .. is it something to do with the silicon technology used?

Cheers,
B.

Optimized PFTH for the P2

Comments