Some P2 benchmark results

ersmith · 2016-05-09 11:37

All of these times are using the fastspin compiler (both P1 and P2), so we're comparing apples to apples. The P2 results were obtained on my DE2-115, so using an 80 MHz clock (the same as the P1, which was on a C3 board).

                     P1                      P2
hub fft_bench: 21622192 cycles (270ms)  4324532 cycles (54 ms)
hub fibo(20):  10507664 cycles (131ms)  4053316 cycles (50 ms)
cog fibo(20):   2977168 cycles (37ms)   2276664 cycles (28 ms)
hub xxtea:       116384 cycles            23450 cycles

I think for both xxtea and fft_bench the hardware multiply is making a big difference. Also, of course, hubexec is a lot faster than LMM; we're not seeing as much of a speedup on the COG test. OTOH since Fibonacci is recursive both the P1 and P2 are hitting the stack (in hub) heavily, so they're quite memory bound.

Eric

Rayman · 2016-05-09 13:01

Strange that cog fibo is not at least 2x faster on P2.
Even if the assembly code was exactly the same, P2 only has 2 clocks per instruction where P1 is 4...

Plus, I'd think the new instructions could make a lot of code faster too...

Seairth · 2016-05-09 13:50

Rayman wrote: »

Strange that cog fibo is not at least 2x faster on P2.
Even if the assembly code was exactly the same, P2 only has 2 clocks per instruction where P1 is 4...

Plus, I'd think the new instructions could make a lot of code faster too...

I suspect that's due to the branching, which will take 5 clocks each.

Edit: see later comment.

Dave Hein · 2016-05-09 13:58

Also, hub accesses occur every 16 cycles on both the P1 and P2, so stack accesses are the same.

Rayman · 2016-05-09 17:06

Think branching is 8 cycles on P1. Does the cog fibo need hub access? Thought the name implied was only in cog...

Seairth · 2016-05-09 17:31

Rayman wrote: »

Think branching is 8 cycles on P1. Does the cog fibo need hub access? Thought the name implied was only in cog...

P1: 4 cycles when branching, 8 cycles when not branching
P2: 4 cycles when branching, 2 cycles when not branching

Edit: corrected.

cgracey · 2016-05-09 17:39

Seairth wrote: »

Rayman wrote: »

Think branching is 8 cycles on P1. Does the cog fibo need hub access? Thought the name implied was only in cog...

P1: 4 cycles when branching, 8 cycles when not branching
P2: 5 cycles when branching, 2 cycles when not branching

Branches take 4 when taken, 2 when not taken.

ersmith · 2016-05-09 18:58

Rayman wrote: »

Think branching is 8 cycles on P1. Does the cog fibo need hub access? Thought the name implied was only in cog...

The stack is in hub in both cases, so yes, the cog fibo needs hub access, and I think that's probably the bottleneck.

mindrobots · 2016-05-09 21:48

So, does this mean being able to support a stack in LUT would remove significant bottlenecks for a lot of use cases?

(Runs so as not to suffer the slings and arrows of outrageous comments )

ozpropdev · 2016-05-10 01:41

Implementing a 16 deep stack in LUT ram

		wrlut	value,lutsp	'push to lut
		incmod	lutsp,#15

		decmod	lutsp,#15	'pop from lut
		rdlut	value,lutsp
.
.
lutsp		long	0

For debug purposes you can take advantage of the WC bit to check for
stack over/under flows.

		wrlut	value,lutsp	'push to lut
		incmod	lutsp,#15 wc
	if_c	add	stack_overflow,#1

		decmod	lutsp,#15 wc	'pop from lut
		rdlut	value,lutsp
	if_c	add	stack_underflow,#1
.
.
lutsp		long	0
stack_overflow	long	0
stack_underflow	long	0

mindrobots · 2016-05-10 01:48

Show off!!

rjo__ · 2016-05-10 01:55

mindrobots wrote: »

Show off!!

He is Australian... what do you expect?

Some P2 benchmark results

Comments