LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)
Bill Henning
Posts: 6,445
Greetings!
As Dave and Ariba have already started posting about their experiments, I think it is time for me to post about the LMM ideas I am experimenting with for Propeller 2.
Please note the following thought experiments have not been tried on my DE0-Nano yet.
One of the problems Parallax (and the GCC team) will face is that Propeller 2 is a very different beast from Propeller 1. To get the most out of the Propeller 2 with LMM, the compiler will need some fairly significant changes.
LMM on Propeller 1, not counting FCACHE etc., executes a native 4 cycle instruction in 20 clock cycles (4 way unrolled LMM loop) or 18 cycles (eight way unrolled LMM loop). In order to save space, I will refer to Propeller-1 style LMM as "LMM1")
Without the pipelining, Propeller 1 can natively execute instructions at up to 20MIPS @ 80Mhz.
With its pipelining, Propeller 2 can natively execute instructions at up to 160MIPS @ 160MHz.
I've been scratching my head trying to make LMM2 far more efficient.
I've also been trying to figure out a "stepping stone" version that will require very few changes to GCC in order to get GCC on Propeller 2 running as quickly as possible - mind you, that will run much slower than a LMM2 that is optimized for the Propeller 2 architecture.
1) LMM1 GCC compatibility mode - LMM2 experiment #1
The above simple kernel will execute LMM1 code at 8 cycles per instruction (ignoring hub etc instructions greater execution times) - and as Propeller 2 is pipelined, this means up to 20MIPS @ 160Mhz - basically, LMM1 code will run at the same speed on Propeller 2 as native cog code does on Propeller 1 (assuming 160MHz and 80MHz clock speeds respectively) for a native "efficiency" factor of 1/8. (ok, JMP, CALL will take 3 cycles inside, but still fit in 8 cycle hub window)
A simple change may boost this, but I have to check that the pipeline will allow the extra speed:
(I think the above is almost exactly the same as Dave's and Ariba's version)
Due to the pipeline delays, this new version is not much faster - but every little bit helps. This version can theoretically execute four simple P2 instructions in 8+5+5+5 = 23 cycles once the hub is in sync, for a 4/23 maximum "efficiency" - however due to hub windows, it would actually take 4/24 cycles to execute the four instructions.
This version should still be mostly GCC Prop1 compatible.
2) LMM2 experiment #2
This version tries to improve on the efficiency of (1) by discarding PropGCC1 (shorthand for Propeller 1 GCC) compatibility..
Actually, for best effect it requires significant changes to GCC to support the "VLIW"-like mode it makes possible (more on this later)
The changes here are:
- using PTRA as the program counter gives free increments
- using REPS gives us free looping
- the delay slots can be used to implement interrupts
As long as the pipelining works out the way I expect, this would execute four native simple instructions in 8+6 = 14 cycles, for a 4/14
Hub syncing will reduce this to an actual best rate of 4/16, ie 40MIPS max at 160MHz.
VLIW mode:
With appropriate changes, PropGCC could support a VLIW mode where ins1/ins2 were treated as either a single 64 bit instruction, or two 32 bit instructions; note all code would have to be generated on eight byte boundaries for this to work well.
This would bring a major advantage to GCC code generation - here are just a couple of examples:
MVI REGx, ##longvalue ---> MOV REGx,ins2
JMP #FJMP ---> SETPTRA ins2
LONG hubaddr
etc
3) LMM2 experiment #3 - "Off-stride"
This is the most I can pack into an 8 cycle hub window.
The good news:
It has a 3/8 efficiency, giving up to 60MIPS!
The bad news:
You lose the nice fast "MVI" and "FJMP" capability (without a LOT of compiler headaches)
Now I am not certain that the ins2 and ins3 will be ready to execute in time, in which case priming and inverting should work:
4) LMM2 experiment #4 - Things get weird...
(Baggers asked if anyone thought of using READQUAD...)
I think this would not be very nice to generate code for, but it does have an 8/16 (50%) potential efficiency - 80MIPS
For best results, it would have to be treated as a 128 bit VLIW machine, and block1 should be primed before entering the loop.
5) LMM2 experiment #5 - Even weirder...
My attempt at shrinking #4 - but I need to make sure it will still execute in 8 cycles (when synced to the hub)
Block must be primed, not sure the pipelining would work (ie keep it 8 cycles) but if it works, it will give us 4/8 ie 50% efficiency for 80MIPS as well.
6) LMM2 experiment #7+ And now, for something completely different...
All kidding aside, the other ideas I've been playing with get really weird, and do not exceed 50% potential efficiency.
For all experiments, FCACHE/FLIB can add a huge boost.
Apologies if I get the syntax wrong on any of the instructions above, I will fix errors as I find out about them
UPDATE
Releasing the Prop2 Terasic boards was an extremely useful thing to do.
Testing the RDQUAD based LMM2 ideas above Chip was able to find and fix a small error in the Verilog sources, potentially saving the cost of an additional shuttle run.
Andy noticed the first sign of a problem, I and others were able to verify it.
After Chip's change, single cycle instructions would work, but unexpected issues cropped up with multi-cycle instructions.
There is a potential useful work-around, however it reduces the maximum MIPS from 80 to 60 as it only executes three instructions, but provides a 32 bit instruction or address for them - which removes the need for MVI, FJMP and many other constructs.
I'll add that code here shortly, and I plan to verify that it works for both single and multi-cycle instructions.
I am very happy to see how people jumped in with both feet to test P2!
Update:
I may have gotten it running using RDQUAD with 4 instructions executed per RDQUAD!
---
Let me know what you think! Your input is most appreciated...
---
Notes:
All methods show above will do much better with FCACHE and FLIB added into the equation, the efficiency above is for strainght in-line code.
(I've been playing with LMM2 on paper since Chip started posting the instructions - but I did not want to post until I had enough information on the pipelining)
As Dave and Ariba have already started posting about their experiments, I think it is time for me to post about the LMM ideas I am experimenting with for Propeller 2.
Please note the following thought experiments have not been tried on my DE0-Nano yet.
One of the problems Parallax (and the GCC team) will face is that Propeller 2 is a very different beast from Propeller 1. To get the most out of the Propeller 2 with LMM, the compiler will need some fairly significant changes.
LMM on Propeller 1, not counting FCACHE etc., executes a native 4 cycle instruction in 20 clock cycles (4 way unrolled LMM loop) or 18 cycles (eight way unrolled LMM loop). In order to save space, I will refer to Propeller-1 style LMM as "LMM1")
Without the pipelining, Propeller 1 can natively execute instructions at up to 20MIPS @ 80Mhz.
With its pipelining, Propeller 2 can natively execute instructions at up to 160MIPS @ 160MHz.
I've been scratching my head trying to make LMM2 far more efficient.
I've also been trying to figure out a "stepping stone" version that will require very few changes to GCC in order to get GCC on Propeller 2 running as quickly as possible - mind you, that will run much slower than a LMM2 that is optimized for the Propeller 2 architecture.
1) LMM1 GCC compatibility mode - LMM2 experiment #1
next rdlong ins,pc ' could be rdlongc jmpd #next add pc,#4 ins nop
The above simple kernel will execute LMM1 code at 8 cycles per instruction (ignoring hub etc instructions greater execution times) - and as Propeller 2 is pipelined, this means up to 20MIPS @ 160Mhz - basically, LMM1 code will run at the same speed on Propeller 2 as native cog code does on Propeller 1 (assuming 160MHz and 80MHz clock speeds respectively) for a native "efficiency" factor of 1/8. (ok, JMP, CALL will take 3 cycles inside, but still fit in 8 cycle hub window)
A simple change may boost this, but I have to check that the pipeline will allow the extra speed:
again reps #511,4 nop ' delay slot next rdlongc ins,pc nop ' delay slot useful for profiling - ie "ADD LMMOPS,#1" add pc,#4 ins nop jmp #again ' we will break the REPS with any JMP or CALL
(I think the above is almost exactly the same as Dave's and Ariba's version)
Due to the pipeline delays, this new version is not much faster - but every little bit helps. This version can theoretically execute four simple P2 instructions in 8+5+5+5 = 23 cycles once the hub is in sync, for a 4/23 maximum "efficiency" - however due to hub windows, it would actually take 4/24 cycles to execute the four instructions.
This version should still be mostly GCC Prop1 compatible.
2) LMM2 experiment #2
This version tries to improve on the efficiency of (1) by discarding PropGCC1 (shorthand for Propeller 1 GCC) compatibility..
Actually, for best effect it requires significant changes to GCC to support the "VLIW"-like mode it makes possible (more on this later)
loopy reps #511,#6 nop ' delay slot next rdlongc ins1,[ptra++] rdlongc ins2,[ptra++] nop ' delay slot nop ' delay slot - or use a JMPD here if you don't want to use REPS ins1 nop ins2 nop jmp #loopy ' so we go back to REPS when a JMP/CALL busts the REPS loop
The changes here are:
- using PTRA as the program counter gives free increments
- using REPS gives us free looping
- the delay slots can be used to implement interrupts
As long as the pipelining works out the way I expect, this would execute four native simple instructions in 8+6 = 14 cycles, for a 4/14
Hub syncing will reduce this to an actual best rate of 4/16, ie 40MIPS max at 160MHz.
VLIW mode:
With appropriate changes, PropGCC could support a VLIW mode where ins1/ins2 were treated as either a single 64 bit instruction, or two 32 bit instructions; note all code would have to be generated on eight byte boundaries for this to work well.
This would bring a major advantage to GCC code generation - here are just a couple of examples:
MVI REGx, ##longvalue ---> MOV REGx,ins2
JMP #FJMP ---> SETPTRA ins2
LONG hubaddr
etc
3) LMM2 experiment #3 - "Off-stride"
This is the most I can pack into an 8 cycle hub window.
loopy reps #511,#6 nop next rdlongc ins1,[ptra++] rdlongc ins2,[ptra++] rdlongc ins3,[ptra++] ins1 nop ins2 nop ins3 nop jmp #loopy ' used when REPS loop is broken
The good news:
It has a 3/8 efficiency, giving up to 60MIPS!
The bad news:
You lose the nice fast "MVI" and "FJMP" capability (without a LOT of compiler headaches)
Now I am not certain that the ins2 and ins3 will be ready to execute in time, in which case priming and inverting should work:
loopy reps #511,#6 nop ins1 nop ins2 nop ins3 nop rdlongc ins1,[ptra++] rdlongc ins2,[ptra++] rdlongc ins3,[ptra++] jmp #loopy ' used when REPS loop is broken
4) LMM2 experiment #4 - Things get weird...
(Baggers asked if anyone thought of using READQUAD...)
top reps #511,#12 setquad #block1 next2 rdquad [ptr++] setquad #block2 nop ' delay slot block1 nop nop nop nop next1 rdquad [ptr++] setquad #block1 nop 'delay slot block2 nop nop nop nop jmp #top
I think this would not be very nice to generate code for, but it does have an 8/16 (50%) potential efficiency - 80MIPS
For best results, it would have to be treated as a 128 bit VLIW machine, and block1 should be primed before entering the loop.
5) LMM2 experiment #5 - Even weirder...
My attempt at shrinking #4 - but I need to make sure it will still execute in 8 cycles (when synced to the hub)
top reps #511,#6 setquad #block block nop nop nop nop jmpd #block rdquad [ptr++] jmp #top
Block must be primed, not sure the pipelining would work (ie keep it 8 cycles) but if it works, it will give us 4/8 ie 50% efficiency for 80MIPS as well.
6) LMM2 experiment #7+ And now, for something completely different...
All kidding aside, the other ideas I've been playing with get really weird, and do not exceed 50% potential efficiency.
For all experiments, FCACHE/FLIB can add a huge boost.
Apologies if I get the syntax wrong on any of the instructions above, I will fix errors as I find out about them
UPDATE
Releasing the Prop2 Terasic boards was an extremely useful thing to do.
Testing the RDQUAD based LMM2 ideas above Chip was able to find and fix a small error in the Verilog sources, potentially saving the cost of an additional shuttle run.
Andy noticed the first sign of a problem, I and others were able to verify it.
After Chip's change, single cycle instructions would work, but unexpected issues cropped up with multi-cycle instructions.
There is a potential useful work-around, however it reduces the maximum MIPS from 80 to 60 as it only executes three instructions, but provides a 32 bit instruction or address for them - which removes the need for MVI, FJMP and many other constructs.
I'll add that code here shortly, and I plan to verify that it works for both single and multi-cycle instructions.
I am very happy to see how people jumped in with both feet to test P2!
Update:
I may have gotten it running using RDQUAD with 4 instructions executed per RDQUAD!
---
Let me know what you think! Your input is most appreciated...
---
Notes:
All methods show above will do much better with FCACHE and FLIB added into the equation, the efficiency above is for strainght in-line code.
(I've been playing with LMM2 on paper since Chip started posting the instructions - but I did not want to post until I had enough information on the pipelining)
Comments
To save typing, and make it easier for all of us to be specific about exactly which variant we are talking about:
In case any of you are wondering, I did not publish LMM2P as it required significantly more compiler work than LMM2Q and performed more poorly than LMM2Q due to additional overhead and thrashing.
THE TORTURE TEST EXPLAINED
Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)
Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration
i1-i4 are simple adds to counters, which are later deposited into result1-result4
i5-i7 increment result6 in the hub
i8 does an extra increment of result1
The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.
The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)
This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times
Therefore the mathematically correct expected results (in hex) are:
result 1: $0000FE ($7F+$7F)
result 2: $000080
result 3: $000080
result 4: $000080
result 5: $000FF7 *** approximate
result 6: $00007F
256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots
$800+$3F0+$3F0 = $FE0 cycles
NOTE:
Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.
This post reserved for source code, links and FAQ's
- added LMM2 using RDQUAD test
- added LMM2 using RDQUAD test 2
- added LMM2 using RDQUAD test 3 to verify Andy's results
- added what looks like verified working RDQUAD LMM code
- added torture test for latest Nano loadable that matches the instructions above
In theory, should be better.
In practice, does not look like it, as it is very awkward...
I'll add it to the top thread.
Edit: No a quick read up on it and you are correct
I did mention it was a bit weird... had to be, due to how the pipelining works.
I'll be trying these experiments on my DE0-Nano, but its a bit painful as I can't use tasks to do serial out if I want accurate cycle counts, I have to keep dumping out to the monitor and examining the hub.
But it is still a lot of fun :-) :-) :-)
I am unsure if this works or not, but Chip said that we can set the quad cache into a window of cog ram. If those 4 quads were windowed into the LMM loop as instr1-4 I wonder if they would be able to be executed from there? Of course this makes the LMM compilers more difficult but if looking for every last bit of speed it could be worth looking into.
(4) and (5) in post #1 use RDQUAD (cache mapped into the cog), and assuming they work, will be able to execute code from the hub at 50% of native cog speed (much more with FCACHE etc).. I think I may have to remove the delay slots from (4) to make it fit within the hub window; it is not needed anyway due to how I use two rdquads.
and you are absolutely correct, it does require more compiler work.
I think this is what can be done:
Note that there are 8 permutations possible of the repeated block by putting the last instruction where the first one is, and so on.
Bill, I'm not totally sure about the timing on this, as I haven't done the experiment in a while, but this could be verified. If it is, indeed, the 5th instruction after the RDQUAD when the mapped QUADs become executable, you would want to put the QUAD window right after the RDQUAD instruction. The funny thing is that it would always be executing the QUAD from the RDQUAD before the RDQUAD just above the QUAD window.
I will try it - as long as I can get 4 LMM instructions running in 8 clocks I'll be happy - regardless of how I need to arrange them :-)
If you squint at it just the right way... in a way it becomes pipelined LMM code.
here is the code fragment:
Adding "ins6" makes the 256 loops take 4096+ cycles
Kye - yes, the delay slots can definitely be used to simulate interrupts, as previously posted.
Experiment #3 is also verified!!!
768 LMM instructions executed in 2049 cycles!
Executed the RDQUAD loop in 801 cycles, so it really is possible to put 7 instructions between hub-synced RDQUAD's!
I also verified that on the first go-around it only starts executing fetched code at ins6, so it needs 5 delay slots before the first executable slot.
Note - any JMP or CALL out of the loop will have to "prime" ins1-ins4 with NOP's or do a RDQUAD ptr++ at least 5 cycles ahead.
Basically, this is a pipelined LMM engine.
This runs as-is on DE0-Nano, and should run on the DE2-115
- use pnut.exe to download and run it
- when the test is finished, it drops back into the monitor
- use the 'n' command to switch to longs for output
- type '2000.2007' to see the two counters
- the first long is how many cycles 256 iterations of the quad LMM loop took, it is normally between 801-803
- the second long is how many LMM instructions were executed, it will be 3Fc (first iteration of the loop executes NOP's)
THIS WAS FUN!
One minor thing: after doing a 'GETCNT reg' and some instructions to time, you can end with a 'SUBCNT reg' and you'll have the difference in reg.
I don't think it's directly comparable: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3885.html
I got,
>2000.2007
02000- 02 08 00 00 FC 03 00 00
Now it needs a variety of instructions running..
Do you have a price ? - or what was your guess ?
At UPEW, a rough estimate of no more than 1.5x Prop1 cost was given, so my guess is $12 or less quantity one for Prop2
http://www.digikey.com/product-detail/en/LPC1785FBD208,551/568-7573-ND/2677567
120Mhz ARM for $12.00 at Digikey
I suspect that is competitive with a single cog if we also use FCACHE... but we will have 7 extra cogs as well
Chip's suggested replacement for 4 and 5 was simpler, and while it is possible to get 4&5 running with the correct alignment, simple is good.
All those 'add count,#1' did prove the code ran, now we can try other stuff.
Mind you, adding FCACHE, FLIB etc will increase the performance even more; I expect 99% native speed for FCACHE-friendly code.
Type 'n' into the monitor to switch to long mode. Then do '2000.2007'. Or, all at once you can type 'n2000.2007'.
(I used identical instructions so I could generate the LMM code easily)
Do you think something like the following would be enough to test the potential latency issue:
I'd repeat it 512 times, and execute all 4k instructions
It will work out well to have a RDQUAD followed immediately by the QUAD window, but the instructions executing in the QUAD window won't be from the RDQUAD just executed, but from the RDQUAD executed in the prior iteration.
I noticed the pipelining effect, $800 loops only executed "add count,#1" $3FC times, due to executing the nop's the first time around.
This has implications for FJMP, FCALL, FRET etc, but it is managable.
That might do it. I'm thinking if you did something like...
WRBYTE h00,PTRB++
WRBYTE h01,PTRB++
WRBYTE h02,PTRB++
WRBYTE h03,PTRB++
WRBYTE h04,PTRB++
WRBYTE h05,PTRB++
WRBYTE h06,PTRB++
WRBYTE h07,PTRB++
<repeat>
...you could see very easily if you were executing part of one RDQUAD and part of another.
I think you'd just throw away the last RDQUAD's four instructions, kind of like the pipeline throws away three instructions on a branch.