Some P2 benchmark results
ersmith
Posts: 6,068
in Propeller 2
All of these times are using the fastspin compiler (both P1 and P2), so we're comparing apples to apples. The P2 results were obtained on my DE2-115, so using an 80 MHz clock (the same as the P1, which was on a C3 board).
I think for both xxtea and fft_bench the hardware multiply is making a big difference. Also, of course, hubexec is a lot faster than LMM; we're not seeing as much of a speedup on the COG test. OTOH since Fibonacci is recursive both the P1 and P2 are hitting the stack (in hub) heavily, so they're quite memory bound.
Eric
P1 P2 hub fft_bench: 21622192 cycles (270ms) 4324532 cycles (54 ms) hub fibo(20): 10507664 cycles (131ms) 4053316 cycles (50 ms) cog fibo(20): 2977168 cycles (37ms) 2276664 cycles (28 ms) hub xxtea: 116384 cycles 23450 cycles
I think for both xxtea and fft_bench the hardware multiply is making a big difference. Also, of course, hubexec is a lot faster than LMM; we're not seeing as much of a speedup on the COG test. OTOH since Fibonacci is recursive both the P1 and P2 are hitting the stack (in hub) heavily, so they're quite memory bound.
Eric
Comments
Even if the assembly code was exactly the same, P2 only has 2 clocks per instruction where P1 is 4...
Plus, I'd think the new instructions could make a lot of code faster too...
I suspect that's due to the branching, which will take 5 clocks each.
Edit: see later comment.
P1: 4 cycles when branching, 8 cycles when not branching
P2: 4 cycles when branching, 2 cycles when not branching
Edit: corrected.
Branches take 4 when taken, 2 when not taken.
The stack is in hub in both cases, so yes, the cog fibo needs hub access, and I think that's probably the bottleneck.
(Runs so as not to suffer the slings and arrows of outrageous comments )
stack over/under flows.
He is Australian... what do you expect?