Optimized PFTH for the P2
bmentink
Posts: 107
@Dave Hein .... Request ... Pretty please.
Can you please provide an optimized version of pfth for the P2 that uses the new hardware, i.e multiply/divide .. new branching instructions etc and whatever else is new and shiny ...
That would be awesome. Also, a better way to include PASM blocks (objects) without having to cut and paste hex instructions would be great ... I know that is a much bigger ask.
I know the instructions are not locked down yet, but I am sure you could start on it and tweak it later ..
I would use it as my language of choice for the P2 ..... especially if it was optimized ...
Cheers,
Bernie
Can you please provide an optimized version of pfth for the P2 that uses the new hardware, i.e multiply/divide .. new branching instructions etc and whatever else is new and shiny ...
That would be awesome. Also, a better way to include PASM blocks (objects) without having to cut and paste hex instructions would be great ... I know that is a much bigger ask.
I know the instructions are not locked down yet, but I am sure you could start on it and tweak it later ..
I would use it as my language of choice for the P2 ..... especially if it was optimized ...
Cheers,
Bernie
Comments
I have a version (I think I posted it) that uses the serin/serout instructions, the stack space and I think mul /div. It's either someplace in the locked P2 thread where the images are or the PFTH P2 thread (if there is one). I was doing comparisons as to how much COG spaces was recovered. It
was pretty cool how much those helped.
I'll try and look later tonight. If all else fails, I'm sure it is on my PC someplace.
@mindrobots If you could dig that out and post it to this thread, that would be great.
This thread has three different versions of PFTH. I think they all work.
It might be in the main thread where I detailed the savings from version to version. You'd want to look around in the end of Feb 2014 to early March 2014 timeframe.
I think the SERIN/SEROUT and stack instructions really helped with PFTH but what do I know.
I'll probably put the February P2-HOT back on my DE2 just to play with PFTH more. But, again, what do I know!
You know heaps it seems!
So someone just needs to merge all three versions into one and then onto the mult/div optimizing ...
Wish I could help, but no hardware ..
Cheers,
Bernie
I stopped playing when P2-Hot went off the rails.
Currently it is:
innerloop rdword parm, pc wz
if_z jmp #exitfunc
add pc, #2
rdword temp1, parm
jmp temp1
pc long @xboot_1+Q ' Program Counter
PS: How do I do code blocks in this forum, can't see any options with the "advanced" editor
Bernie
I didn't get that far with P2-Hot, so maybe. :0) time for you to start studying P2 PASM! :0)
Yep, will do .... let's hope the current documentation doesn't change that much then.
It should be possible to use just two instructions for the inner loop, after all, did that in eForth (indirect threaded) on multiple uP's that supported indirect jumps, i.e Hitachi H8 back in the day ...
B.
PS: Maybe the documentation at: https://docs.google.com/document/pub?id=1-9OtOVRkTxojO-cAFMwpcZEUorkCU_uwi26EQZMV8uM#id.pwsgkg8pbdsq
is not that current. I don't see any reference to popx/pushx or popy/pushy, have these been added later? .. or are they just renamed popa/b
Bernie
If everything remains, this could be a seriously fast Forth beast ...
B.
The index registers PTRA/PTRB are still there so Hub stacks are still in the mix.
Thanks Cobber!
PS: Only 16-bit MUL/DIV that is disapointing, I suppose cordic math has gone as well?
My guess of this is the CORDIC engine will be designed to take 16 cycles for a result. And all Cogs can simultaneously have one calculation each running in the pipe.
If anyone is chatting to Chip, please talk him into some decent maths .... I know he is making compromises, but the basic stuff???
That seems the right way to manage it.
There used to be two separate multipliers per Cog, a slower inline 32x32 and a faster 24x24 in the MAC/CORDIC unit.
Yes it IS the right way to go, my complaint is, if I am not being clear, is the lack of 32bit MUL and DIV functions in hardware, I don't care if they are in the cog or hub, as long as they are done in hardware. Are you saying the "higher maths" functions include 32bit MUL/DIV?
Also, for Forth we **need** the PUSHX/POPX syle instructions for both stacks, to make Forth efficient, we shouldn't have to muck about with incrementing/decrementing pointers ..
B.
PS: I see there are PUSH and POP instructions that work on a 4-level stack, shame that is not larger, maybe then that could be used for the data and return stacks. Simplest would be to put back the POPX(Y)/PUSHX(Y) instructions ....
Putting PUSHX/POPX style stack functions back in adds a lot of silicon to the new leaner design. (16 cogs x 256 longs of aux ram).
Don't get me wrong though, I really liked the PUSHX,POPX stuff you speak of.
The advantage of shifting them out of the COG, is the high silicon cost is not x16, plus they are not often needed on a 1 cycle basis.
Then there is the question of choice of where such Pointers should point ? COG RAM ? HUB Ram ?, or even an Adjacent COGS Ram ? (allowing an unused COG to donate that useful RAM )
Thanks. So with the PTRA/PTRB pointers, how are they set to auto increment (PUSH) and auto decrement (POP), won't that have to be done dynamically .. (read: .. i.e add more instructions), instead of the one PUSH or POP?
Or, is there something fancy done with the CALLA and RETA instructions ... don't get it ....
PS: Don't need 256Longs per cog for the Forth stacks ... that is overkill. (4..8 for the Data stack, 32 maybe for the Return stack)
Cheers,
Bernie
Well, for PFTH (this topic) each stack is in each cog I believe (please correct me if that is wrong), in which case the pointer needs to be on each cog.
Adjacent cog ram would be slow, the access would have to be through the hub, right?
In the old P2 the auto increment was done simply in the same instructions like this The feature used some of the condition bits of the opcode to implement the inc/dec stuff.
A very nice feature
Cheers
Brian
P.S. Or the condition bits were used for INDA/INDB?, my memory is a bit foggy today
Here a snippet from the old P2 docs.
Great! I did not realize those instructions were so flexible .... so the stacks would be in Hub in that case ....
Cheers, B.
A quick look back at the old notes and the indirect stuff used the CCCC bits in the opcode not the PTRx stuff.
The indirect feature was nice too, but I don't think that made the cut in the new P2.
Could be written inline as: wrlong parm1,ptra++
... and the code:
Could be written inline as: rdlong parm1,--ptra wz
That will speed things up. (assumes ptra has been setup to point to the Data stack of course)
PS: Does PASM support alias's? Then we could put PUSHD and POPD as alias's for the above ...
The POP2 and POP1 functions could be similar ... as well as all the other places the data stack is used.
PTRB would then be used for the return stack, and again the instruction would be inlined ...