Shop OBEX P1 Docs P2 Docs Learn Events
Hub Exec vs Cog Exec Timings... some figures — Parallax Forums

Hub Exec vs Cog Exec Timings... some figures

78rpm78rpm Posts: 264
edited 2015-10-25 02:25 in Propeller 2
I modified the 32 bit by 32 bit division I posted recently to record minimum and maximum execution times within the code. I then timed the function when it was executing in Hub and in a Cog. The timings are now in. The function accesses Cog registers only except for the stack pushes in the Hub, but they were outside the timing control.

The timings are the clock count returned by subtracting two getcnt instructions, which I think are system clocks, or are they system clock / 2? If the former, at 50MHz the clock period is 20ns.

Firstly we have the timings for Hub Exec:
Minimum time for division           = 1573
Minimum time for division numerator = 1000000000
Minimum time for division divisor   = 50000000
Maximum time for division           = 1589
Maximum time for division numerator = 2328
Maximum time for division divisor   = 1000
Minimum time = 1573 clocks * 20ns = 31.460us
Maximum time = 1589 clocks * 20ns = 31.780us
This gives approximately min and max 31,466 divisions / second

Secondly we have the timings for Cog Exec:
Minimum time for division           = 821
Minimum time for division numerator = 2328
Minimum time for division divisor   = 1000
Maximum time for division           = 1573
Maximum time for division numerator = 1000000000
Maximum time for division divisor   = 50000000
Minimum time = 821 clocks * 20ns = 16.420us
Maximum time = 1573 clocks * 20ns = 31.460us
This gives a max case of approx 31,786 divisions / second
This gives a min case of approx 60,901 divisions / second

Now, you may have noticed that the min and max values are for the same pair of numerator / divisor, but they are swapped between Hub and Cog Exec. I think it likely that Hub is heavenly influenced by refilling the instruction fifo / stream.

As the code only accesses Cog registers, I think we can say as a guidline in this instance that the % speed of Hub wrt Cog is 821 / 1580ish = 51.96%, so roughly 50% or half speed.

Comments

  • interesting.

    Enjoy!

    Mike
  • evanhevanh Posts: 16,040
    edited 2015-10-25 03:26
    Going by your conclusion of 2:1 difference between HubExec and CogExec I'll take a stab and say the test loop is about eight instructions long - Based on an average FIFO reload time of 16 clocks.

    Also, I don't understand the numbers. What do they all mean? Why are some large round numbers? And what's with the 2:1 variance in min to max of CogExec?
  • evanhevanh Posts: 16,040
    edited 2015-10-25 03:28
    Of course the FIFO average delay might be lower, in which case the test loop will also be proportionally shorter.
  • It depends on the code but in a similar test I can get 2M baud comms in cogexec and 460.8K baud in hub exec.
    Roughly 4:1 performance difference.

  • jmgjmg Posts: 15,175
    78rpm wrote: »
    Minimum time for division           = 1573
    Minimum time for division numerator = 1000000000
    Minimum time for division divisor   = 50000000
    Maximum time for division           = 1589
    Maximum time for division numerator = 2328
    Maximum time for division divisor   = 1000
    
    Secondly we have the timings for Cog Exec:
    Minimum time for division           = 821
    Minimum time for division numerator = 2328
    Minimum time for division divisor   = 1000
    Maximum time for division           = 1573
    Maximum time for division numerator = 1000000000
    Maximum time for division divisor   = 50000000
    

    Are those seemingly strange numeric values proven to be the Minimum time & Minimum time for Division operation ?

  • evanh wrote: »
    Going by your conclusion of 2:1 difference between HubExec and CogExec I'll take a stab and say the test loop is about eight instructions long - Based on an average FIFO reload time of 16 clocks.

    Also, I don't understand the numbers. What do they all mean? Why are some large round numbers? And what's with the 2:1 variance in min to max of CogExec?
    There is one forward branch within a 32 count loop of 12 instructions, including djnz at the end with a backwards branch. If I take out the branch and have conditional execution on the following two instructions, that will prevent a reload of the Hub Exec FIFO at that point, leaving just the one refill at the loop end for djnz.
    divide32b32_inCog
    
    		pusha	t0
    		pusha	t1
    		pusha	t2
    		pusha	t3
    		pusha	t4
    		pusha	t5
    		pusha	t6
    
    		or	r2, r2  wz
    	if_nz	jmp	#send_dbg_div_go
    		loc	adrb, #@division_by_zero_error
    		calla	#@send_msg
    		jmp	#send_dbg_div_done
    
    send_dbg_div_go
    		pusha	r1		' save for timing
    		pusha	r2		' save for timing
    		getcnt	scratch
    
    		mov	t1, #0		' Q
    		mov	t2, #0		' R
    		mov	t3, r1		' N
    		mov	t4, ##1 << 31   ' bit mask
    		mov	t0, #32		' num bits in N
    		mov	t6, t0
    		sub	t6, #1		' range 31 to 0
    .next_loop	  
    		  shl   t2, #1		' R = R << 1
       		  mov   t5, t3		' N
    		  and	t5, t4  wz	' bit n set of N
    	if_z	  clrb  t2, #0		' clr bit 0 of R
    	if_nz	  setb  t2, #0		' set bit 0 of R
    		  cmp   t2, r2  wc, wz
    	if_b	  jmp	#.after
    		    sub	  t2, r2 ' R <- R - D
    		    setb  t1, t6	' but 31 to 0 of Q
    .after
    		  shr	t4, #1 		' bit mask
    		  sub   t6, #1		' adjust bit number 31 to 0
    		djnz	t0, #.next_loop
    	  
    send_dbg_div_done
    
    		popa	r2		' entry div
    		popa	r1		' entry num
    
    		getcnt	t7
    		sub	t7, scratch
    		cmp	t7, div32b_min_time  wc, wz
    	if_ae	jmp	#.max_time
    		mov	div32b_min_time, t7
    		mov	div32b_min_num, r1
    		mov	div32b_min_div, r2
    .max_time	cmp	t7, div32b_max_time  wc, wz
    	if_be	jmp	#.all_done
    		mov	div32b_max_time, t7
    		mov	div32b_max_num, r1
    		mov	div32b_max_div, r2
    .all_done
    		mov	r1, t1		' return quotient result in r1
    		mov	r2, t2		' return remainder result in r2
    
    		popa	t6
    		popa	t5
    		popa	t4
    		popa	t3
    		popa	t2
    		popa	t1
    		popa	t0
    		reta
    ' divide32b32_inCog	end
    
    

    The large round numbers are an illusion, in binary they are not. It's the method I used for converting from binary to ASCII decimal. Dividing down from the largest power of ten for the number, which is checked before hand, and then it's a simple index table lookup from there. So / 10_000_000 say, to get the qutient in range 1 to 9, then multiply that by the divisor again to subtract from the value of the number you are converting.

    The meaning of the numbers are the minimum elapsed time for a division. As some numbers have differnt values in numerator and divisor, the branch will occur at different rates, thus the fifo will refill again. When an elapsed time is lower than the currently stored minimum, a new minimum is stored, which we've just calculated, and the numerator and divisor for that minimum event are stored.

    I'm not too sure why the variation in min and max in Cog Exec, except perhaps a pipeline flush on the branches, as all the variables are in Cog RAM. Interrupts are disabled, so is debug I hope. I'll have to think on that a while, or perhaps someone else will have some thoughts.

    Hope this helps.
  • jmg wrote: »
    78rpm wrote: »
    Minimum time for division           = 1573
    Minimum time for division numerator = 1000000000
    Minimum time for division divisor   = 50000000
    Maximum time for division           = 1589
    Maximum time for division numerator = 2328
    Maximum time for division divisor   = 1000
    
    Secondly we have the timings for Cog Exec:
    Minimum time for division           = 821
    Minimum time for division numerator = 2328
    Minimum time for division divisor   = 1000
    Maximum time for division           = 1573
    Maximum time for division numerator = 1000000000
    Maximum time for division divisor   = 50000000
    

    Are those seemingly strange numeric values proven to be the Minimum time & Minimum time for Division operation ?

    No, they are the min and max values which the divide function has processed. They happen to be two large numeric values in max case, and two in the min case. I think the true maximum time may be all bits / all bits.
  • evanhevanh Posts: 16,040
    Definitely want to remove that JMP #.after line. I think all Prop2 branches take two instruction slots minimum, even in CogExec.

    I note you capture the timing inside the PUSHA's but outside the POPA's. That pair of POPA's will be the culprit for variance in the CogExec results.
  • evanh wrote: »
    Definitely want to remove that JMP #.after line. I think all Prop2 branches take two instruction slots minimum, even in CogExec.

    I note you capture the timing inside the PUSHA's but outside the POPA's. That pair of POPA's will be the culprit for variance in the CogExec results.
    I've done that now, but there is something odd I've noticed which needs more investigation. It's reporting 206 clocks, but with a waitx bit_time in there it doesn't change. Um....
  • I've modified the Cog (not Hub Exec yet) code and as there are no branches it's a constant cycle count, unless a divide by 0 is detected. As it is a constant cycle count, only the first divide is recorded in both the min and max variables.

    The code needs a bit more work to sort out numerator and divisor as -ve numbers. I'll do that next. First, here's the new code:
    divide32b32_inCog
    
    		getcnt	scratch
    		pusha	t0
    		pusha	t1
    		pusha	t2
    		pusha	t3
    		pusha	t4
    		pusha	t5
    		pusha	t6
    
    		or	r2, r2  wz
    	if_nz	jmp	#send_dbg_div_go
    		loc	adrb, #@division_by_zero_error
    		calla	#@send_msg
    		jmp	#send_dbg_div_done
    
    send_dbg_div_go
    		pusha	r1		' save for timing
    		pusha	r2		' save for timing
    
    		mov	t1, #0		' Q
    		mov	t2, #0		' R
    		mov	t3, r1		' N
    		mov	t4, ##1 << 31   ' bit mask
    		mov	t6, #31
    .next_loop	  
    		  shl   t2, #1		' R = R << 1
       		  mov   t5, t3		' N
    		  and	t5, t4  wz	' bit n set of N
    	if_z	  clrb  t2, #0		' clr bit 0 of R
    	if_nz	  setb  t2, #0		' set bit 0 of R
    		  cmp   t2, r2  wc, wz
    	if_ae	    sub	  t2, r2 ' R <- R - D
    	if_ae	    setb  t1, t6	' but 31 to 0 of Q
    		  shr	t4, #1 		' bit mask
    		djns	t6, #.next_loop
    	  
    
    		mov	r1, t1		' return quotient result in r1
    		mov	r2, t2		' return remainder result in r2
    
    send_dbg_div_done
    
    		popa	t1		' entry div
    		popa	t0		' entry num
    
    		popa	t6
    		popa	t5
    		popa	t4
    		popa	t3
    		popa	t2
    		getcnt	t7
    
    		sub	t7, scratch
    		cmp	t7, div32b_min_time  wc, wz
    	if_b	mov	div32b_min_time, t7
    	if_b	mov	div32b_min_num, t0
    	if_b	mov	div32b_min_div, t1
    		cmp	t7, div32b_max_time  wc, wz
    	if_a	mov	div32b_max_time, t7
    	if_a	mov	div32b_max_num, t0
    	if_a	mov	div32b_max_div, t1
    .all_done
    		mov 	r15, div32b_max_time
    
    		popa	t1
    		popa	t0
    		reta
    ' divide32b32_inCog	end
    
    Now those strange numbers:
    Minimum time for division           = 983
    Minimum time for division numerator = 2184
    Minimum time for division divisor   = 1000
    Maximum time for division           = 983
    Maximum time for division numerator = 2184
    Maximum time for division divisor   = 1000
    
    Constant cycle count of 983 * 20ns = 19.66us
    Or 50,864 divides per second
  • evanhevanh Posts: 16,040
    That is impressively consistent given you are now including lots of hub accesses in the measurement. A rough sum up of the same nets me 262 + (32 * 22) = 966 clocks, the extra in the measured result will be from poor, but consistent, hub miss-alignments ... that's good news for deterministic behaviour.
  • evanh wrote: »
    That is impressively consistent given you are now including lots of hub accesses in the measurement. A rough sum up of the same nets me 262 + (32 * 22) = 966 clocks, the extra in the measured result will be from poor, but consistent, hub miss-alignments ... that's good news for deterministic behaviour.

    The pusha and popa are to Hub RAM. However..
    cmp	t7, div32b_min_time  wc, wz
    	if_b	mov	div32b_min_time, t7
    	if_b	mov	div32b_min_num, t0
    	if_b	mov	div32b_min_div, t1
    		cmp	t7, div32b_max_time  wc, wz
    	if_a	mov	div32b_max_time, t7
    	if_a	mov	div32b_max_num, t0
    	if_a	mov	div32b_max_div, t1
    
    ... are all in Cog RAM

    So the non-deterministic part would be synchronising to the Hub for the first pusha and the first popa at entry and exit.

    A speed up could be achieved by using a small stack in Cog RAM. I do not recall the depth of the small Cog stack FIFO, whether 4, 8 or 16 depth.

  • cgraceycgracey Posts: 14,208
    The CALL/RET lifo stack is 8 levels deep.
  • cgracey wrote: »
    The CALL/RET lifo stack is 8 levels deep.
    Thank you. I think I will modify and test for a speed comparison.

Sign In or Register to comment.