Fast Bytecode Interpreter

kwinn · 2017-03-22 19:00

jmg wrote: »

kwinn wrote: »

I like the idea but really hope the IDE will highlight the instructions that are executed when the cursor is on the skip instruction.

I'm unclear - did you mean highlight many lines at once, ahead of PC (which would be rare in a debugger)
or did you mean step and highlight those lines, as they execute.
This second mode is how most debuggers I use work, they skip to the next active opcode and execute that on step.

I meant highlight many lines in the source code when the cursor is on the bit mask of the skip instruction, although after thinking about it some more that may be hard to do. Being able to see the lines that are executed for each bit mask as we are coding would be a big help though.

Stepping and highlighting as they execute while debugging may be the way to go.

Rayman · 2017-03-22 19:24

I was just thinking about using this for my mixed 2bpp and 4bpp video driver...
Maybe it can help me roll up the loop...

There's another thing that would be useful...
Could you make it so that IJZ worked with, say, SETQ to make so that you can increment from 0 up to some value that you put in Q?

If Q is some kind of internal cog register, maybe that's possible?

I take that back, wouldn't help now that I think more about it...

Heater. · 2017-03-22 19:26

Given that the bits of the SKIP instruction may be determined at run time. Changing every time it is used. It is impossible for the IDE to highlight what is to be executed or not as you edit code.

It might be possible for a debugger. At run time the debugger knows what the bits in the SKIP are and can show you the source with the instructions to be executed highlighted.

All sounds like a distant dream to me.

Rayman · 2017-03-22 19:36

I see there are spin methods listed in the instruction spreadsheet now...

jmg · 2017-03-22 20:00

ersmith wrote: »

For example, IIRC the original Tachyon interpreter executes bytecodes very quickly because each bytecode is basically the address of the routine implementing it. The outer loop for such an interpreter is extremely small and fast, and skip is unlikely to be of much help.

Yes, IIRC Peter changed to a 16b-code in V4 to get more speed, and it would be interesting to see what his inner loops code like on the newest P2 .
It is a good idea to test this on more than one 'byte-code' engine.
eg I see in the Lua thread, that has a 6b instruction choice.

jmg · 2017-03-22 20:09

cgracey wrote: »

It's 8x faster than just cancelling instructions. And it only even happens if there are more than seven 1's in a row. Most code can be jiggered around so that it never even happens.

Can you give timing tables for Added-reload-time vs skip counts and alignments ?
For each of COG.LUT.HUB exec
I think here you say any executed opcode takes time, and that gives the Skip engine time to reload in parallel if it needs to ?
If someone does skip 25 opcodes, that needs reloads at 7-8, 15-16, 23-24 boundaries, which is still 8x faster.

Seems like skip 'works like skip' for HUBexec, but for GOCexec it is better called BITEXEC as (mostly) the skipped lines do not count, only the executed lines are in the code-cycle basket.

Sounds like P2 is going to need a good simulator

I wonder how practical it is to use the AVR approach, which was to build the Sim-engine from the Verilog ?
Comes in slower than hand coded, but a whole heap more accurate.
As the complexity of P2 climbs, it starts to look like the only solution.

jmg · 2017-03-22 20:11

Rayman wrote: »

I was just thinking about using this for my mixed 2bpp and 4bpp video driver...
Maybe it can help me roll up the loop...

Good idea, the more use cases the better, plus it will help test all the combinations (like the _ret_ Chip uncovered)

jmg · 2017-03-22 20:30

Cluso99 wrote: »

...
Therefore, what about 1 new instruction "execute p" as follows...

loop            rfbyte  p               'get program bytecode                   5-18
                execute p                                                       2
                                                                               ------
                                                                    overhead = 7-20

pipeline clocks for "execute p" would become:
1:      rd "execute p" instruction 
2:      rd lut address "p" into internal 32bit register "i" (maybe same as SETQ2 register??)
3:      forward i[8:0] to PC and cog address so that this clock fetches the next instruction (ie the _RET_ goto)          
        load internal skip register with i[31:9] as skip[22:0] & set skip valid
4:
5:      optional: wr i[31:0] to cog register "p"
6:

By directly forwarding "i" (= contents of lut "p")into the return address and skip register, the push/pop is not required,
meaning neither is altd i and shr i,#9.
The "execute p" instruction effectively performs the following in 2 clocks (saving 8 clocks every bytecode)...
                rdlut   i,p             'lookup long in lut                     2
                altd    i               'get snippet address from 9 LSBs        2
                push    #0              'push snippet address                   2
                shr     i,#9            'shift right to get 23-bit skip pattern 2
        _ret_   skip    i               'set skip pattern and execute snippet   2
                                                                               ------
                                                                               10

This would save 8 clocks overhead for every bytecode (down from 15-28).

I'm not sure it can do all of that in 2 clocks, but the basic idea is good :
Have an opcode-split that extracts some index/mask combination, and then bitexec a subset of instructions.
eg The Lua wordcode has a 6 bit index, from the other thread.

jmg · 2017-03-22 20:35

Seairth wrote: »

@cgracey, what happens if there's an active skip mask inside of a rep block? Will the rep still work correctly?

.. and is that correct for all modes of COG.LUT.HUB ?
Seems REP in hubexec should be ok, as the skips are true skips (aka NOPs==) that need 1 clock/32b.
However, REP+Skip in COG... ?

potatohead · 2017-03-22 23:08

Should work in both. Pretty sure the rep circuit just counts. If the PC is incremented to skip instructions, it's likely the rep circuit won't even know.

jmg · 2017-03-22 23:18

potatohead wrote: »

Should work in both. Pretty sure the rep circuit just counts. If the PC is incremented to skip instructions, it's likely the rep circuit won't even know.

If the REP just counts, then it will change action between HUB and COG skips. (breaks code portability, which till now, has been an important cornerstone of P2)
A count-only REP would include NOP-skips in HUB mode, but mostly exclude them in COG.
In cases where a cycle is needed to load the next 8 bits, REP would include that, making REP a challenge to manage.
Chip needs to clarify what Skip can work with, and what it cannot.

potatohead · 2017-03-22 23:26

It's not gonna do that. I'm sure the cancel logic can also signal rep to hold the count, in HUB execute mode or correct the count, which ever makes sense for the HUB code skipped to be nops.

Not doing that seems like it would count as a bug, and when we get an image, testing skip against a few cases can get things sorted out.

Rep holds instruction to execute count and iteration value, not an address. Should be possible to have it count executed instructions.

Seems to me, it working that way would make sense either way skip gets used: rep then skip or skip then rep.

Inside a rep, skip would be evaluated each time. Outside a rep, skip would determine whether a rep gets activated.

In both cases, an active rep would repeat X not skipped instructions.

jmg · 2017-03-22 23:44

potatohead wrote: »

It's not gonna do that. I'm sure the cancel logic can also signal rep to hold the count, in HUB execute mode or correct the count, which ever makes sense.

Hopefully, yes, but your original claim was "the rep circuit just counts"
Chip will let us know how REP actually interacts with SKIP.

potatohead · 2017-03-22 23:50

He will.

Rep does just count. We got that info some time back.

And your query was in COG too.

Cluso99 · 2017-03-23 01:48

jmg wrote: »

Cluso99 wrote: »

...
Therefore, what about 1 new instruction "execute p" as follows...

loop            rfbyte  p               'get program bytecode                   5-18
                execute p                                                       2
                                                                               ------
                                                                    overhead = 7-20

pipeline clocks for "execute p" would become:
1:      rd "execute p" instruction 
2:      rd lut address "p" into internal 32bit register "i" (maybe same as SETQ2 register??)
3:      forward i[8:0] to PC and cog address so that this clock fetches the next instruction (ie the _RET_ goto)          
        load internal skip register with i[31:9] as skip[22:0] & set skip valid
4:
5:      optional: wr i[31:0] to cog register "p"
6:

By directly forwarding "i" (= contents of lut "p")into the return address and skip register, the push/pop is not required,
meaning neither is altd i and shr i,#9.
The "execute p" instruction effectively performs the following in 2 clocks (saving 8 clocks every bytecode)...
                rdlut   i,p             'lookup long in lut                     2
                altd    i               'get snippet address from 9 LSBs        2
                push    #0              'push snippet address                   2
                shr     i,#9            'shift right to get 23-bit skip pattern 2
        _ret_   skip    i               'set skip pattern and execute snippet   2
                                                                               ------
                                                                               10

This would save 8 clocks overhead for every bytecode (down from 15-28).

I'm not sure it can do all of that in 2 clocks, but the basic idea is good :
Have an opcode-split that extracts some index/mask combination, and then bitexec a subset of instructions.
eg The Lua wordcode has a 6 bit index, from the other thread.

Fairly sure this could work in 2 clocks (ie 6pipeline clocks).

There is no need to use a push/ret sequence which also wastes a valuable stack slot in this scenario.
Therefore the altd i instruction is no longer required.
There is no need for the shift as this can select the appropriate bits in verilog by muxes.

So what we are actually doing is..
fetching D=LUT[p] in stage 2
setting skip=D[31:9]=LUT[p][31:9] and enable skip
setting a jump PC=D[8:1]=LUT[p][8:1]

The write result is not required as there is no use for "i" externally.

Basically,it is a jump instruction to D[8:0] (fetched from LUT[p]) while simultaneously setting up the skip instruction with D[31:23] (fetched from LUT[p]). Jumps taken take 2 clocks.

msrobots · 2017-03-23 02:24

nice

kwinn · 2017-03-23 02:29

Heater. wrote: »

Given that the bits of the SKIP instruction may be determined at run time. Changing every time it is used. It is impossible for the IDE to highlight what is to be executed or not as you edit code.

It might be possible for a debugger. At run time the debugger knows what the bits in the SKIP are and can show you the source with the instructions to be executed highlighted.

All sounds like a distant dream to me.

Do you mean the bits will somehow be calculated by the running code? I was thinking it would be more a case of selecting one of several fixed bit maps.

potatohead · 2017-03-23 03:00

Thing is they can be dynamically generated. So far, examples are all fixed, but the value is fetched from a register, or specified ## style.

It can change or be computed during runtime.

Cluso99 · 2017-03-23 03:20

Chip,

This is a simpler explanation of my post a few back.

I could see this execution loop

loop            rfbyte  p               'get program bytecode                   5-18
                rdlut   i,p             'lookup long in lut                     2
                altd    i               'get snippet address from 9 LSBs        2
                push    #0              'push snippet address                   2
                shr     i,#9            'shift right to get 23-bit skip pattern 2
        _ret_   skip    i               'set skip pattern and execute snippet   2
                                                                               ------
                                                                    overhead = 15-28

being able to be utilised in all sorts of interpreters, as well as in special lookup tables for executing special routines.

Therefore, what about 1 new instruction "execute p" as follows...

loop            rfbyte  p               'get program bytecode                   5-18
                execute p                                                       2
                                                                               ------
                                                                    overhead = 7-20

EXECUTE D
"D" is a LUT address

The "EXECUTE D" instruction when run as:

        EXECUTE   p             ' "p" is a LUT address

performs the following in a single instruction of 2 clocks        

        JUMP    D[8:0]             ' jump to address fetched from LUT as D[8:0]
        SKIP    D[31:9]            ' set skip to data fetched from LUT as D[31:9]

So "EXECUTE D" can be thought of a JUMP instruction where the D contents are in LUT (not COG) and
concurrently a SKIP instruction is setup using those D contents from LUT.
There is no need for a writeback to COG/LUT because those D contents from LUT are not required to be saved.

This would save 8 clocks overhead for every bytecode (down from 15-28).

jmg · 2017-03-23 05:00

kwinn wrote: »

Do you mean the bits will somehow be calculated by the running code? I was thinking it would be more a case of selecting one of several fixed bit maps.

Well, yes the choices are not 2^32, but once you have more than a single constant, it becomes a parameter / register variable, and that is calculated by the running code.
A P2 debug should be able to step over the used lines ( but we do not yet have a P2 debug).

Hopefully solid Debug is one thing Chip tests before sign-off, as I can expect simple HW tweaks can make a big difference to operation an accuracy.

cgracey · 2017-03-23 10:24

Cluso99 wrote: »

So, in the above, there is no execution penalty for these small skips. If there are more than 8 skips in a row, add an extra 2 clocks for each set of 8.

Yes, that calculation is correct.

Here is the timing for that loop:

loop            rfbyte  p               '2 (comes from the FIFO)
                rdlut   i,p             '3
                altd    i               '2
                push    #0              '2
                shr     i,#9            '2
        _ret_   skip    i               '4 (because of the _ret_, otherwise 2)

If you look at those last four instructions, they are just setting up a SKIP pattern from i[31:9] and jumping to i[8:0]. There's a lot of monkey motion there.

Someone in a later post was talking about making a single instruction which does the RDLUT, along with those last four instructions, in order to form a complete bytecode executor. I agree, but making the RDLUT a part of it doesn't save much and it limits its use.

I moved the RET/RETA/RETB instructions to overlap with CALL/CALLA/CALLB, since they could be differentiated by opposite immediate bits. This freed up three {#}D instruction slots, into which I placed SKIP and two other new ones that really take care of business:

EEEE 1101011 CZ0 DDDDDDDDD 000101100        JMP     D           {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101101        CALL    D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101101   *    RET                 {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101110        CALLA   D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101110   *    RETA                {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101111        CALLB   D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101111   *    RETB                {WC,WZ}

EEEE 1101011 00L DDDDDDDDD 000110000        JMPREL  D/#
EEEE 1101011 00L DDDDDDDDD 000110001   **   SKIP    D/#
EEEE 1101011 00L DDDDDDDDD 000110010   **   SHRED   D/#
EEEE 1101011 00L DDDDDDDDD 000110011   **   SHREDR  D/#

These new SHRED and SHREDR instructions are as follows:

SHRED {#}D - JMP to {10'b0, D[9:0]} and use D[31:10] as a SKIP pattern.

SHREDR {#}D - CALL to {10'b0, D[9:0]} and use D[31:10] as a SKIP pattern.

You can see that they can jump anywhere in cog and lut space and set 22 bits of SKIP data. They each take 4 clocks. SHREDR means shred and return, as opposed to just shred.

Now, our interpreter loop looks like this:

loop            rfbyte  p               '2  get next bytecode
                rdlut   i,p             '3  translate into long
		shred	i		'4  jump to snippet with skips

		(returns to loop on _ret_ with pre-stuffed stack)

If you wanted to save three clocks, you could just do an 'RFLONG i' and skip the RDLUT.

I've got all this implemented and tested, so I'll start compiling the FPGA images.

cgracey · 2017-03-23 10:35

Seairth wrote: »

Cluso99 wrote: »

On page 1 of this thread, Chip said

This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.

So the _RET_ is free!

That can't be right, unless the _RET_ flag is being processed at the first stage of the pipeline. But that would have to come with a lot of caveats and edge cases! So, I'm assuming that the _RET_ is still being processed when the "host" instruction is executed. In which case...

You are only saving the cost of the additional RET instruction (i.e. "saves two clocks and a long"). But _RET_ should still cause the current pipelined instructions to be cancelled and the popped instruction branch to be loaded, just like a RET instruction would have caused. Looking at the instruction listing, RET costs 4 clocks. Presumably, that's 1 for executing and 3 for reloading the pipeline. In the case of something like "_ret_ skip #0", the SKIP will take two cylces to execute and the _ret_ will cause 3 additional cycles for reloading the pipeline.

_RET_ adds two clocks to whatever instruction it's in front of. Those two clocks are the trailing instruction being flushed through the pipeline. Of course, it takes more clocks in hub-exec because of the FIFO.

The _RET_ occurs simultaneously with the instruction executing. It happens in the background, as it's just a matter of updating the PC from the bottom of the 8-level hardware stack.

cgracey · 2017-03-23 10:42

Seairth wrote: »

@cgracey, what happens if there's an active skip mask inside of a rep block? Will the rep still work correctly?

When REP is running in hub-exec, it actually does a JMP to get back to the top of the block, which entails resetting and reloading the FIFO. In cog-exec, the PC just gets changed in order to loop. So, I think SKIP would work in either case, but it may not work the same, as there'd be an extra SKIP bit consumed on each block iteration in hub-exec mode. In any case, SKIP would behave the same in both modes if it was just used within the code block, and not when it looped.

Whoa! This would blow up in cog-exec mode, because the REP logic counts instructions. If SKIP is causing fewer or variable numbers of instructions to execute, it's not going to work, unless you planned very carefully with perhaps constant instruction counts in every SKIP case, and set the REP block size accordingly.

All REP does is get you out of needing a 4-clock branch instruction. It's only useful/beneficial for blocks of only several instructions, I think. There's probably no need for SKIP inside REP. But, you may find some value in it. Seems like a recipe for madness, to me.

cgracey · 2017-03-23 10:45

Cluso99 wrote: »

Chip,
I could see this execution loop

loop            rfbyte  p               'get program bytecode                   5-18
                rdlut   i,p             'lookup long in lut                     2
                altd    i               'get snippet address from 9 LSBs        2
                push    #0              'push snippet address                   2
                shr     i,#9            'shift right to get 23-bit skip pattern 2
        _ret_   skip    i               'set skip pattern and execute snippet   2
                                                                               ------
                                                                    overhead = 15-28

being able to be utilised in all sorts of interpreters, as well as in special lookup tables for executing special routines.

Therefore, what about 1 new instruction "execute p" as follows...

loop            rfbyte  p               'get program bytecode                   5-18
                execute p                                                       2
                                                                               ------
                                                                    overhead = 7-20

pipeline clocks for "execute p" would become:
1:      rd "execute p" instruction 
2:      rd lut address "p" into internal 32bit register "i" (maybe same as SETQ2 register??)
3:      forward i[8:0] to PC and cog address so that this clock fetches the next instruction (ie the _RET_ goto)          
        load internal skip register with i[31:9] as skip[22:0] & set skip valid
4:
5:      optional: wr i[31:0] to cog register "p"
6:

By directly forwarding "i" (= contents of lut "p")into the return address and skip register, the push/pop is not required,
meaning neither is altd i and shr i,#9.
The "execute p" instruction effectively performs the following in 2 clocks (saving 8 clocks every bytecode)...
                rdlut   i,p             'lookup long in lut                     2
                altd    i               'get snippet address from 9 LSBs        2
                push    #0              'push snippet address                   2
                shr     i,#9            'shift right to get 23-bit skip pattern 2
        _ret_   skip    i               'set skip pattern and execute snippet   2
                                                                               ------
                                                                               10

This would save 8 clocks overhead for every bytecode (down from 15-28).

We were thinking the same things.

Ariba · 2017-03-23 10:52

I like the names of the new instructions

Ad: "Propeller 2, the controller with integrated code shredder"

Andy

Seairth · 2017-03-23 10:58

Why SHRED?

Cluso99 · 2017-03-23 11:04

cgracey wrote: »
Cluso99 wrote: »

So, in the above, there is no execution penalty for these small skips. If there are more than 8 skips in a row, add an extra 2 clocks for each set of 8.

Yes, that calculation is correct.

Here is the timing for that loop:
loop            rfbyte  p               '2 (comes from the FIFO)
                rdlut   i,p             '3
                altd    i               '2
                push    #0              '2
                shr     i,#9            '2
        _ret_   skip    i               '4 (because of the _ret_, otherwise 2)
If you look at those last four instructions, they are just setting up a SKIP pattern from i[31:9] and jumping to i[8:0]. There's a lot of monkey motion there.

Someone in a later post was talking about making a single instruction which does the RDLUT, along with those last four instructions, in order to form a complete bytecode executor. I agree, but making the RDLUT a part of it doesn't save much and it limits its use.

I moved the RET/RETA/RETB instructions to overlap with CALL/CALLA/CALLB, since they could be differentiated by opposite immediate bits. This freed up three {#}D instruction slots, into which I placed SKIP and two other new ones that really take care of business:
EEEE 1101011 CZ0 DDDDDDDDD 000101100        JMP     D           {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101101        CALL    D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101101   *    RET                 {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101110        CALLA   D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101110   *    RETA                {WC,WZ}
EEEE 1101011 CZ0 DDDDDDDDD 000101111        CALLB   D           {WC,WZ}
EEEE 1101011 CZ1 000000000 000101111   *    RETB                {WC,WZ}

EEEE 1101011 00L DDDDDDDDD 000110000        JMPREL  D/#
EEEE 1101011 00L DDDDDDDDD 000110001   **   SKIP    D/#
EEEE 1101011 00L DDDDDDDDD 000110010   **   SHRED   D/#
EEEE 1101011 00L DDDDDDDDD 000110011   **   SHREDR  D/#
These new SHRED and SHREDR instructions are as follows:

SHRED {#}D - JMP to {10'b0, D[9:0]} and use D[31:10] as a SKIP pattern.

SHREDR {#}D - CALL to {10'b0, D[9:0]} and use D[31:10] as a SKIP pattern.

You can see that they can jump anywhere in cog and lut space and set 22 bits of SKIP data. They each take 4 clocks. SHREDR means shred and return, as opposed to just shred.

Now, our interpreter loop looks like this:
loop            rfbyte  p               '2  get next bytecode
                rdlut   i,p             '3  translate into long
		shred	i		'4  jump to snippet with skips

		(returns to loop on _ret_ with pre-stuffed stack)
If you wanted to save three clocks, you could just do an 'RFLONG i' and skip the RDLUT.

I've got all this implemented and tested, so I'll start compiling the FPGA images.

WTG Chip!

Good time saving for code executed for every byte code.

Curious to know if AUGD #$xxx followed by SHRED/SHREDR #D would work too?

BTW What about JMPSKIP or JMPSKP and CALLSKP ?

cgracey · 2017-03-23 11:35

Cluso99 wrote: »

...Good time saving for code executed for every byte code.

Curious to know if AUGD #$xxx followed by SHRED/SHREDR #D would work too?

BTW What about JMPSKIP or JMPSKP and CALLSKP ?

AUGD would work before SHRED.

I was thinking of JMPSKP and CALLSKP, and renaming SKIP to SKP, too.

cgracey · 2017-03-23 11:36

Seairth wrote: »

Why SHRED?

Lots of snippets doing varied things quickly.

cgracey · 2017-03-23 11:38

Maybe more like this.

Fast Bytecode Interpreter

Comments