Optimized PFTH for the P2

bmentink · 2015-03-10 13:11

@Dave Hein .... Request ... Pretty please.

Can you please provide an optimized version of pfth for the P2 that uses the new hardware, i.e multiply/divide .. new branching instructions etc and whatever else is new and shiny ...
That would be awesome. Also, a better way to include PASM blocks (objects) without having to cut and paste hex instructions would be great ... I know that is a much bigger ask.

I know the instructions are not locked down yet, but I am sure you could start on it and tweak it later ..
I would use it as my language of choice for the P2 ..... especially if it was optimized ...

Cheers,
Bernie

Dave Hein · 2015-03-10 13:24

It's been so long I can't even remember where I left off with pfth on the P2. The last P2 FPGA image that I worked with was the October 2013 version. I never did anything with the early 2014 versions since we never got a document showing the instruction set for that version. I have hope that we will see a new FPGA image sometime in the next 3 months, so I would prefer waiting for that.

mindrobots · 2015-03-10 13:33

Bernie,

I have a version (I think I posted it) that uses the serin/serout instructions, the stack space and I think mul /div. It's either someplace in the locked P2 thread where the images are or the PFTH P2 thread (if there is one). I was doing comparisons as to how much COG spaces was recovered. It
was pretty cool how much those helped.

I'll try and look later tonight. If all else fails, I'm sure it is on my PC someplace.

bmentink · 2015-03-10 13:37

Thanks Dave, makes sense to wait for that image.

@mindrobots If you could dig that out and post it to this thread, that would be great.

mindrobots · 2015-03-10 13:41

bmentink wrote: »

Thanks Dave, makes sense to wait for that image.

@mindrobots If you could dig that out and post it to this thread, that would be great.

This thread has three different versions of PFTH. I think they all work.

It might be in the main thread where I detailed the savings from version to version. You'd want to look around in the end of Feb 2014 to early March 2014 timeframe.

I think the SERIN/SEROUT and stack instructions really helped with PFTH but what do I know.

I'll probably put the February P2-HOT back on my DE2 just to play with PFTH more. But, again, what do I know!

bmentink · 2015-03-10 15:46

@mindrobots >But, again, what do I know!
You know heaps it seems!

So someone just needs to merge all three versions into one and then onto the mult/div optimizing ...
Wish I could help, but no hardware ..

Cheers,
Bernie

mindrobots · 2015-03-10 16:04

I have to check. I thought my versions were progressive. I do have one someplace with both stacks moved to hardware, SERIN/OUT and maybe MUL. It is probably on my Windows laptop which is having a nervous breakdown and getting an SSD transplant on Thursday.

I stopped playing when P2-Hot went off the rails.

bmentink · 2015-03-10 16:05

So as "next", or innerloop in this case, is critical to make as fast as possible, can we do anything better with the new instructions to reduce it to a couple of instructions?
Currently it is:

innerloop rdword parm, pc wz
if_z jmp #exitfunc
add pc, #2
rdword temp1, parm
jmp temp1

pc long @xboot_1+Q ' Program Counter

PS: How do I do code blocks in this forum, can't see any options with the "advanced" editor

Bernie

mindrobots · 2015-03-10 16:09

Haha! Not until we know what the new P2 looks like!!

I didn't get that far with P2-Hot, so maybe. :0) time for you to start studying P2 PASM! :0)

bmentink · 2015-03-10 16:52

mindrobots wrote: »

Haha! Not until we know what the new P2 looks like!!

I didn't get that far with P2-Hot, so maybe. :0) time for you to start studying P2 PASM! :0)

Yep, will do .... let's hope the current documentation doesn't change that much then.
It should be possible to use just two instructions for the inner loop, after all, did that in eForth (indirect threaded) on multiple uP's that supported indirect jumps, i.e Hitachi H8 back in the day ...

B.

PS: Maybe the documentation at: https://docs.google.com/document/pub?id=1-9OtOVRkTxojO-cAFMwpcZEUorkCU_uwi26EQZMV8uM#id.pwsgkg8pbdsq
is not that current. I don't see any reference to popx/pushx or popy/pushy, have these been added later? .. or are they just renamed popa/b

mindrobots · 2015-03-10 18:40

I would count on a mostly different instruction rather than, "not change that much".

Dave Hein · 2015-03-10 19:40

pfth uses an execution token that points to a word in the dictionary. The second read in the loop reads a cog address that points to a routine used to execute the word. This makes it easier to debug by mapping to words in the dictionary. I've also looked at doing a single read that gets the cog address immediately, which would be faster, but it is harder to debug.

bmentink · 2015-03-10 19:49

Can pasm have conditional assembly? If so a DEBUG flag that will select between the debug version of the "innerloop" function, and the optimized one ..

Bernie

bmentink · 2015-03-10 20:03

mindrobots wrote: »

I would count on a mostly different instruction rather than, "not change that much".

Yikes, I hope the POP/PUSH/STACK remains ....... and MUL/DIV ...all that nice hardware ..

If everything remains, this could be a seriously fast Forth beast ...

B.

ozpropdev · 2015-03-10 20:45

This from the instruction set released 14 April 2014 (16 cog 512k hub ram)

Aliases for WRLONG/RDLONG: 	PUSHA/PUSHB/POPA/POPB

It appears PTRX,PTRY have been removed along with the CLUT/AUX ram.
The index registers PTRA/PTRB are still there so Hub stacks are still in the mix.

bmentink · 2015-03-10 22:44

ozpropdev wrote: »
This from the instruction set released 14 April 2014 (16 cog 512k hub ram)
Aliases for WRLONG/RDLONG: 	PUSHA/PUSHB/POPA/POPB
It appears PTRX,PTRY have been removed along with the CLUT/AUX ram.
The index registers PTRA/PTRB are still there so Hub stacks are still in the mix.

Thanks Cobber!

PS: Only 16-bit MUL/DIV that is disapointing, I suppose cordic math has gone as well?

ozpropdev · 2015-03-10 23:06

IIRC the cordic engine was rumoured to become a shared resource on the hub.

evanh · 2015-03-11 01:57

From http://forums.parallax.com/showthread.php/158978-Prop2-Development-Update

There are four things that need developing in order to complete the chip:

1) Hub execution (Verilog, challenging)
2) Pipelined CORDIC in the hub (Verilog, pretty straightforward)
3) Smart pins (Verilog, not hard, but very open-ended)
4) ROM code (Prop2 assembly language, not hard)

My guess of this is the CORDIC engine will be designed to take 16 cycles for a result. And all Cogs can simultaneously have one calculation each running in the pipe.

bmentink · 2015-03-11 11:29

Well at least the CORDIC engine is still there, but to have only a 16bit MUL(S) and not even a DIV is very disappointing, the divide function is what takes the longest and should be in hardware, even more important than the MUL.
If anyone is chatting to Chip, please talk him into some decent maths .... I know he is making compromises, but the basic stuff???

jmg · 2015-03-11 12:32

I thought the MUL was in the COG / opcodes and the CORDIC and other higher maths ops, were shared in the HUB memory mapped ? (but I think with separate data paths, so no one ever knows there are other mathops being processes)
That seems the right way to manage it.

evanh · 2015-03-11 14:27

There is a MUL still in the Cogs, but only 16x16, as bmentink has noted. See http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1260747&viewfull=1#post1260747

There used to be two separate multipliers per Cog, a slower inline 32x32 and a faster 24x24 in the MAC/CORDIC unit.

bmentink · 2015-03-11 15:38

jmg wrote: »

I thought the MUL was in the COG / opcodes and the CORDIC and other higher maths ops, were shared in the HUB memory mapped ? (but I think with separate data paths, so no one ever knows there are other mathops being processes)
That seems the right way to manage it.

Yes it IS the right way to go, my complaint is, if I am not being clear, is the lack of 32bit MUL and DIV functions in hardware, I don't care if they are in the cog or hub, as long as they are done in hardware. Are you saying the "higher maths" functions include 32bit MUL/DIV?

Also, for Forth we **need** the PUSHX/POPX syle instructions for both stacks, to make Forth efficient, we shouldn't have to muck about with incrementing/decrementing pointers ..

B.

PS: I see there are PUSH and POP instructions that work on a 4-level stack, shame that is not larger, maybe then that could be used for the data and return stacks. Simplest would be to put back the POPX(Y)/PUSHX(Y) instructions ....

ozpropdev · 2015-03-11 17:12

bmentink wrote: »

Also, for Forth we **need** the PUSHX/POPX syle instructions for both stacks, to make Forth efficient, we shouldn't have to muck about with incrementing/decrementing pointers ..

If the PTRA/PTRB pointers follow the same implementation of the last P2 they will have auto increment/decrement functionality.
Putting PUSHX/POPX style stack functions back in adds a lot of silicon to the new leaner design. (16 cogs x 256 longs of aux ram).
Don't get me wrong though, I really liked the PUSHX,POPX stuff you speak of.

jmg · 2015-03-11 17:19

bmentink wrote: »

Are you saying the "higher maths" functions include 32bit MUL/DIV?
....

I think that was the intention - just what eventually arrives, we will have to wait and see.
The advantage of shifting them out of the COG, is the high silicon cost is not x16, plus they are not often needed on a 1 cycle basis.

bmentink wrote: »

Also, for Forth we **need** the PUSHX/POPX syle instructions for both stacks, to make Forth efficient, we shouldn't have to muck about with incrementing/decrementing pointers ..

Then there is the question of choice of where such Pointers should point ? COG RAM ? HUB Ram ?, or even an Adjacent COGS Ram ? (allowing an unused COG to donate that useful RAM )

bmentink · 2015-03-11 17:44

ozpropdev wrote: »

If the PTRA/PTRB pointers follow the same implementation of the last P2 they will have auto increment/decrement functionality.
Putting PUSHX/POPX style stack functions back in adds a lot of silicon to the new leaner design. (16 cogs x 256 longs of aux ram).
Don't get me wrong though, I really liked the PUSHX,POPX stuff you speak of.

Thanks. So with the PTRA/PTRB pointers, how are they set to auto increment (PUSH) and auto decrement (POP), won't that have to be done dynamically .. (read: .. i.e add more instructions), instead of the one PUSH or POP?
Or, is there something fancy done with the CALLA and RETA instructions ... don't get it ....

PS: Don't need 256Longs per cog for the Forth stacks ... that is overkill. (4..8 for the Data stack, 32 maybe for the Return stack)

Cheers,
Bernie

bmentink · 2015-03-11 17:48

jmg wrote: »

Then there is the question of choice of where such Pointers should point ? COG RAM ? HUB Ram ?, or even an Adjacent COGS Ram ? (allowing an unused COG to donate that useful RAM )

Well, for PFTH (this topic) each stack is in each cog I believe (please correct me if that is wrong), in which case the pointer needs to be on each cog.
Adjacent cog ram would be slow, the access would have to be through the hub, right?

mindrobots · 2015-03-11 17:56

In PFTH on the P2HOT, there is AUXRAM which ptrx/ptry point to. They do auto increment with push/pop, I forgot the details. You need to look at the P2 instructions. I don't recall if AUXRAM made the cut.

ozpropdev · 2015-03-11 18:39

Bernie
In the old P2 the auto increment was done simply in the same instructions like this

   rdlong mydata,ptra++
or
    wrlong stuff,ptrb--

The feature used some of the condition bits of the opcode to implement the inc/dec stuff.
A very nice feature

Cheers
Brian

P.S. Or the condition bits were used for INDA/INDB?, my memory is a bit foggy today

Here a snippet from the old P2 docs.

Examples:

0000000 00 1 1111 DDDDDDDDD 000000000     RDBYTE  D,PTRA         'read byte at PTRA into D
1100110 11 1 1111 000001010 111000001     WRWORD  #10,PTRB++     'write word value 10 at PTRB,        PTRB += 1*2
0000100 00 1 1111 DDDDDDDDD 011111111     RDLONG  D,PTRA--       'read long at PTRA into D,           PTRA -= 1*4
1111111 00 1 1111 110000001 010000101     RDWIDE  ++PTRB         'read wide at PTRB+32 into WIDEs,    PTRB += 1*32
1100110 00 1 1111 DDDDDDDDD 010111111     WRBYTE  D,--PTRA       'write lower byte in D at PTRA-1,    PTRA -= 1*1

1100110 10 1 1111 DDDDDDDDD 100000111     WRWORD  D,PTRB[7]      'write lower word in D to PTRB+7*2
0000101 00 1 1111 DDDDDDDDD 011011111     RDLONGC D,PTRA++[31]   'read cached long at PTRA into D,    PTRA += 31*4
1100111 11 1 1111 000000000 111111101     WRWIDE  PTRB--[3]      'write WIDEs at PTRB,                PTRB -= 3*32
1100110 00 1 1111 DDDDDDDDD 010000110     WRBYTE  D,++PTRA[6]    'write lower byte in D to PTRA+6*1,  PTRA += 6*1
0000010 00 1 1111 DDDDDDDDD 110110110     RDWORD  D,--PTRB[10]   'read word at PTRB-10*2 into D,      PTRB -= 10*2

bmentink · 2015-03-11 19:17

ozpropdev wrote: »

The feature used some of the condition bits of the opcode to implement the inc/dec stuff.

Great! I did not realize those instructions were so flexible .... so the stacks would be in Hub in that case ....

Cheers, B.

ozpropdev · 2015-03-11 19:23

oops!
A quick look back at the old notes and the indirect stuff used the CCCC bits in the opcode not the PTRx stuff.
The indirect feature was nice too, but I don't think that made the cut in the new P2.

bmentink · 2015-03-11 19:37

So the code:

push1                   wrlong  parm1, stackptr
                        add     stackptr, #4
push_ret                ret

Could be written inline as: wrlong parm1,ptra++
... and the code:

pop1                    sub     stackptr, #4
                        rdlong  parm1, stackptr         wz
pop_ret
pop2_ret                ret

Could be written inline as: rdlong parm1,--ptra wz

That will speed things up. (assumes ptra has been setup to point to the Data stack of course)

PS: Does PASM support alias's? Then we could put PUSHD and POPD as alias's for the above ...

The POP2 and POP1 functions could be similar ... as well as all the other places the data stack is used.
PTRB would then be used for the return stack, and again the instruction would be inlined ...

Optimized PFTH for the P2

Comments