Boosting SPIN performance
ozpropdev
Posts: 2,793
Hi All
I've been looking at the SPIN interpreter to see if any performance gains can be made with some custom instructions.
One area that looked like a good candidate was the handler jump decoding. This process takes 10
instructions each loop. I managed to replace this one area with one custom instruction and was able to achieve a 5% performance gain.
Here is the section of code in the interpreter I speak of.
which was replaced with
Nine spacer NOP's were inserted before the label "LOOP" to maintain the original handler addresses.
These spacer Nop's and the 4 LONGs used by the table will be removed in Part #2 to shrink the interpreter a little.
and here is the SPIN program i based my results on.
and here is the verilog for the instruction that was added.
I've included a unscrambled high_rom hex file with the modified SPIN interpreter
Cheers
Brian
I've been looking at the SPIN interpreter to see if any performance gains can be made with some custom instructions.
One area that looked like a good candidate was the handler jump decoding. This process takes 10
instructions each loop. I managed to replace this one area with one custom instruction and was able to achieve a 5% performance gain.
Here is the section of code in the interpreter I speak of.
mov t1,op 'jump to handler ror t1,#4 add t1,#jumps movs :jump,t1 rol t1,#2 shl t1,#3 :jump mov t2,jumps shr t2,t1 and t2,#$FF movs getret,t2 'and also jumps byte j0,j1,j2,j3 byte j4,j5,j6,j7 byte j8,j9,jA,jB byte jC,jD,jE,jF
which was replaced with
ones getret,op
Nine spacer NOP's were inserted before the label "LOOP" to maintain the original handler addresses.
These spacer Nop's and the 4 LONGs used by the table will be removed in Part #2 to shrink the interpreter a little.
and here is the SPIN program i based my results on.
start := cnt '200 million cycles approx. on standard P1 SPIN '190 million cycles approx. on modified P1V SPIN repeat x from 0 to 117924 q := byte[x] z++ finish := cnt - start
and here is the verilog for the instruction that was added.
assign r = i[5] ? add_r : i[4] ? log_r : i[3] ? rot_r : i[2] ? special // added this line to original code : run || ~&p[8:4] ? bus_q : 32'b0; // write 0's to last 16 registers during load; wire [31:0] special = i[1:0] == 2'b11 ? {24'h5c7c00,spin} : 32'b0; //ones wire [7:0] spin = s[5:2] == 4'h0 ? 8'h1e :s[5:2] == 4'h1 ? 8'h2b :s[5:2] == 4'h2 ? 8'h3e :s[5:2] == 4'h3 ? 8'h46 :s[5:2] == 4'h4 ? 8'h4a :s[5:2] == 4'h5 ? 8'h62 :s[5:2] == 4'h6 ? 8'h73 :s[5:2] == 4'h7 ? 8'h73 :s[5:2] == 4'h8 ? 8'h91 :s[5:2] == 4'h9 ? 8'h9a :s[5:2] == 4'ha ? 8'ha1 :s[5:2] == 4'hb ? 8'ha1 :s[5:2] == 4'hc ? 8'hb4 :s[5:2] == 4'hd ? 8'hc4 :s[5:2] == 4'he ? 8'hcf :s[5:2] == 4'hf ? 8'hd6 :0;
I've included a unscrambled high_rom hex file with the modified SPIN interpreter
Cheers
Brian
Comments
- analyse the interpreter
- see what new asm instructions can speed it
- implement these instructions
- rewrite the interpreter
There is a mul instruction available and working, so "*" should be rewritten at first - this should add much speed everywhere it is used.
see my faster spin interpreter. By using a hub lookup table for bytecode decoding, I was able to unravel a lot of the interpreter to improve its performance. I obtained ~25% improvement. By placing the lookup table in extended cog ram, it would improve further.
The next biggest saving would be the stack in extended cog ram as almost every bytecode does a pop and push.
One idea here, would be for the extended RAM to be single port, which gives much less die impact cost.
(Less important on a FPGA, but the FPGA designs should still have an eye to what packs best in a die )
Opcodes like Stack and Table access do not need multi-port RAM, and other micros have indirect memory areas like this.
Thanks Ray
Heres the link to "SPIN Interpreter - faster"
Had a quick look, interesting read. Nice work!
Slow-ByteCode would be self contained and the new faster, unrolled ByteCode would use part of this memory.
This could then be either smaller separate RAM, or 'peeled-off' from HUB memory as needed ?
I was thinking something similar for a stack machine "super cog" instruction space where there would be segments of hub ram in the higher address range that could be broken off from the remaining hub ram by simple switches on the address and data lines, and then connect it to the addressing circuit for that segment and data lines on the other end of the memory array. I don't see why something like this couldn't be done especially if the hub ram gets partitioned like the P2. Then perhaps if necessary, add an address offset translator in the hub to make the memory still look contiguous to the code.
Following on from the addition of a "op-decode" instruction, I added a cached hub byte read instruction (replicating RDBYTEC in P2) and gained another 5% overall performance.
Average gain is ~10% now. Looking at the "mathop" area now as the next area of attention. .