Boosting SPIN performance

ozpropdev · 2014-10-12 07:21

Hi All

I've been looking at the SPIN interpreter to see if any performance gains can be made with some custom instructions.
One area that looked like a good candidate was the handler jump decoding. This process takes 10
instructions each loop. I managed to replace this one area with one custom instruction and was able to achieve a 5% performance gain.

Here is the section of code in the interpreter I speak of.

                        mov     t1,op                   'jump to handler
                        ror     t1,#4
                        add     t1,#jumps
                        movs    :jump,t1
                        rol     t1,#2
                        shl     t1,#3
:jump                   mov     t2,jumps
                        shr     t2,t1
                        and     t2,#$FF
                        movs    getret,t2

'and also

jumps                   byte    j0,j1,j2,j3
                        byte    j4,j5,j6,j7
                        byte    j8,j9,jA,jB
                        byte    jC,jD,jE,jF

which was replaced with

			ones	getret,op

Nine spacer NOP's were inserted before the label "LOOP" to maintain the original handler addresses.
These spacer Nop's and the 4 LONGs used by the table will be removed in Part #2 to shrink the interpreter a little.

and here is the SPIN program i based my results on.

    start := cnt

    '200 million cycles approx.  on standard P1 SPIN
    '190 million cycles approx.  on modified P1V SPIN
    
    repeat x from 0 to 117924
        q := byte[x]
        z++

    finish := cnt - start

and here is the verilog for the instruction that was added.

assign r			= i[5] 					? add_r
					: i[4] 					? log_r
					: i[3] 					? rot_r
					: i[2] 					? special   // added this line to original code
					: run || ~&p[8:4]		? bus_q
					: 32'b0;		// write 0's to last 16 registers during load;


wire [31:0] special = i[1:0] == 2'b11 ? {24'h5c7c00,spin} : 32'b0;  //ones

wire [7:0] spin = s[5:2] == 4'h0 ? 8'h1e
						:s[5:2] == 4'h1 ? 8'h2b
						:s[5:2] == 4'h2 ? 8'h3e
						:s[5:2] == 4'h3 ? 8'h46
						:s[5:2] == 4'h4 ? 8'h4a
						:s[5:2] == 4'h5 ? 8'h62
						:s[5:2] == 4'h6 ? 8'h73
						:s[5:2] == 4'h7 ? 8'h73
						:s[5:2] == 4'h8 ? 8'h91
						:s[5:2] == 4'h9 ? 8'h9a
						:s[5:2] == 4'ha ? 8'ha1
						:s[5:2] == 4'hb ? 8'ha1
						:s[5:2] == 4'hc ? 8'hb4
						:s[5:2] == 4'hd ? 8'hc4
						:s[5:2] == 4'he ? 8'hcf
						:s[5:2] == 4'hf ? 8'hd6
						:0;

I've included a unscrambled high_rom hex file with the modified SPIN interpreter
Cheers
Brian

pik33 · 2014-10-12 08:20

So there may be applicable algorithm such as:

- analyse the interpreter
- see what new asm instructions can speed it
- implement these instructions
- rewrite the interpreter

There is a mul instruction available and working, so "*" should be rewritten at first - this should add much speed everywhere it is used.

Cluso99 · 2014-10-12 12:53

ozpropdev,
see my faster spin interpreter. By using a hub lookup table for bytecode decoding, I was able to unravel a lot of the interpreter to improve its performance. I obtained ~25% improvement. By placing the lookup table in extended cog ram, it would improve further.
The next biggest saving would be the stack in extended cog ram as almost every bytecode does a pop and push.

jmg · 2014-10-12 15:01

Cluso99 wrote: »

ozpropdev,
see my faster spin interpreter. By using a hub lookup table for bytecode decoding, I was able to unravel a lot of the interpreter to improve its performance. I obtained ~25% improvement. By placing the lookup table in extended cog ram, it would improve further.
The next biggest saving would be the stack in extended cog ram as almost every bytecode does a pop and push.

One idea here, would be for the extended RAM to be single port, which gives much less die impact cost.
(Less important on a FPGA, but the FPGA designs should still have an eye to what packs best in a die )
Opcodes like Stack and Table access do not need multi-port RAM, and other micros have indirect memory areas like this.

ozpropdev · 2014-10-12 18:26

Cluso99 wrote: »

ozpropdev,
see my faster spin interpreter. By using a hub lookup table for bytecode decoding, I was able to unravel a lot of the interpreter to improve its performance. I obtained ~25% improvement. By placing the lookup table in extended cog ram, it would improve further.
The next biggest saving would be the stack in extended cog ram as almost every bytecode does a pop and push.

Thanks Ray

Heres the link to "SPIN Interpreter - faster"

Had a quick look, interesting read. Nice work!

Dave Hein · 2014-10-13 06:42

I think you would get a significant increase in speed if you increased the cog RAM so that the interpreter could be unrolled. I created an unrolled interpreter for the Oct 2013 P2 FPGA image, which also used a 256-entry jump table. The interpreter required about 1.5K longs of memory, and most of it ran in the hubexec mode. I think it could be done in 1K longs without too much loss in performance.

mklrobo · 2014-10-13 07:40

Dave Hein wrote: »

I think you would get a significant increase in speed if you increased the cog RAM so that the interpreter could be unrolled. I created an unrolled interpreter for the Oct 2013 P2 FPGA image, which also used a 256-entry jump table. The interpreter required about 1.5K longs of memory, and most of it ran in the hubexec mode. I think it could be done in 1K longs without too much loss in performance.

I got an idea from Dave's entry. Pardon my Ignorance, could the FPGA be programmed, an image for memory, another image for other functions?

jmg · 2014-10-13 15:23

Another thought on this extended RAM, could it be made a little more common, so rather than waste RAM not used in COGs it becomes more like HUB RAM. but without the slow-equal-to-everyone-access. - ie a different MUX scheme would be used for Stack/table memory.
Slow-ByteCode would be self contained and the new faster, unrolled ByteCode would use part of this memory.
This could then be either smaller separate RAM, or 'peeled-off' from HUB memory as needed ?

mark · 2014-10-13 22:16

jmg wrote: »

Another thought on this extended RAM, could it be made a little more common, so rather than waste RAM not used in COGs it becomes more like HUB RAM. but without the slow-equal-to-everyone-access. - ie a different MUX scheme would be used for Stack/table memory.
Slow-ByteCode would be self contained and the new faster, unrolled ByteCode would use part of this memory.
This could then be either smaller separate RAM, or 'peeled-off' from HUB memory as needed ?

I was thinking something similar for a stack machine "super cog" instruction space where there would be segments of hub ram in the higher address range that could be broken off from the remaining hub ram by simple switches on the address and data lines, and then connect it to the addressing circuit for that segment and data lines on the other end of the memory array. I don't see why something like this couldn't be done especially if the hub ram gets partitioned like the P2. Then perhaps if necessary, add an address offset translator in the hub to make the memory still look contiguous to the code.

ozpropdev · 2014-10-17 05:56

Update:
Following on from the addition of a "op-decode" instruction, I added a cached hub byte read instruction (replicating RDBYTEC in P2) and gained another 5% overall performance.
Average gain is ~10% now. Looking at the "mathop" area now as the next area of attention. .

Boosting SPIN performance

Comments