Fast Bytecode Interpreter

David Betz · 2017-03-22 00:27

Rayman wrote: »

Googling around, I see AVR has this:

SBIC - Skip if Bit in I/O Register is Cleared
Description:
This instruction tests a single bit in an I/O register and skips the next instruction if the bit is cleared. This instruction operates on the lower 32 I/O registers - addresses 0-31.

This is pretty wild too, let an I/O pin control whether code is executed or not...

At least it's a local effect. The PDP-10 had lots of SKIP instructions but they would only skip the immediately following instruction.

Roy Eltham · 2017-03-22 00:27

David Betz,
It's not an objection to them, it's the combination of them with templates to create monstrous meta-programming Smile. Also, C++ syntax for them is funky.

FYI, I think lambdas are awesome and I use them.

David Betz · 2017-03-22 00:28

Roy Eltham wrote: »

jmg,
A novice is not coding in ASM.

Funny you should say that. I've seen many people here rave about how P1 assembly code is really easy. I guess that won't be true of P2 code.

jmg · 2017-03-22 00:29

ozpropdev wrote: »
Maybe Pnut could support a simple bit mask facility like this
		skip	##!{1,6,13,14}    '   "use these"
		jmp	#rw_mem
'instead of
		skip	##%001111110111100	'execute jmp/rfbyte/add/rdlong/pusha
		jmp	#rw_mem

Yup, there are many ways simple extensions to ASM can reduce the housekeeping risks here.

One litmus test, is can you add/comment out single lines of code, and not break anything ?

Hand-maintaining-masks fails that test.

Label-based schemes do not.

David Betz · 2017-03-22 00:31

jmg wrote: »
ozpropdev wrote: »
Maybe Pnut could support a simple bit mask facility like this
		skip	##!{1,6,13,14}    '   "use these"
		jmp	#rw_mem
'instead of
		skip	##%001111110111100	'execute jmp/rfbyte/add/rdlong/pusha
		jmp	#rw_mem
Yup, there are many ways simple extensions to ASM can reduce the housekeeping risks here.

One litmus test, is can you add/comment out single lines of code, and not break anything ?

Hand-maintaining-masks fails that test.

Label-based schemes do not.

It would be even better if you could use labels to indicate which instructions to skip. However, it's still going to be hard to see what the code will do with various bit patterns.

Roy Eltham · 2017-03-22 00:33

I think PASM is easier than other MCU/CPU ASMs. I love coding in PASM, but it's not novice level coding.

Novice level is things like Spin (or a subset of it), BASIC, Blockly, etc.

Maybe I'm wrong, but I think it's a stretch to equate anything ASM with novice coding.

potatohead · 2017-03-22 00:46

Coding p2 assembly remains easy. Through all of this, one can treat it like a P1.

The difference is a lot more capability.

We added events, specific triggers and vectors, aka interrupts, and what the chip can do at this process physics went way up. Each of these things added to that.

Same reasoning the whole way.

So far, I've written bog standard PASM only having to know a few things, P1 style. If that works, ok fine.

This stuff is for when doing that won't work.

People go to PASM to get things done they can't do otherwise. In this vein, we have done very well. A lot is possible now.

They also go to it for hardware access. We did well there too.

Some will just go there because they like programming at that explicit level. We made a playground.

What I like is we still have kept the ability to read PASM and then use what we read. It's got some sharp edges now, but not so sharp that people won't pick it up.

Early on, a design goal was to maximize what people can do at the hardware level. They don't have to do that, and many won't. But when they do, this design is gonna deliver.

jmg · 2017-03-22 00:49

Roy Eltham wrote: »

I can think of cases where I would not want the skip to be cancelled by a ret. I think it's better if it's not cancelled by anything (except another skip).

Can you elaborate ?

jmg · 2017-03-22 00:51

Roy Eltham wrote: »

jmg,
A novice is not coding in ASM.

Hehe, everyone here was a novice at some time.
Anyone moving onto ASM from Spin, or C, is by definition a novice.
Even someone quite used to Assembler on a PIC, AVR, or 8051, will be a novice when they encounter a P2.

It is a mistake to ignore that.

David Betz · 2017-03-22 00:52

cgracey wrote: »

I see it as allowing for speed and compactness that is not achievable by any other means. This is ninja programming in assembler.

Of course, one function per instruction will always be faster. Another solution is to reduce your instruction set so that the code fits in COG+HUB without using this trick. Also, it does nothing for compactness of byte code. It just helps with compactness of the VM at the expense of execution speed of the byte code instructions. Try using something like the CMM instruction set used by PropGCC. It achieves almost the code density of Spin byte codes but runs faster.

Rayman · 2017-03-22 00:57

I can see how fitting everything in COG+LUT is something to strive for.
This could help other things besides VMs...

potatohead · 2017-03-22 00:59

How many times has code been unrolled and or repeated in blocks to insure real time response on a multi case basis?

This skip improves on that very considerably. I'm travelling, so my assess here is limited, but this gain is why I said worth it earlier.

Code density can go up, but so can choices or decisions per time unit. That's a big gain in signal generate or signal acquire and response.

David Betz · 2017-03-22 01:08

potatohead wrote: »

How many times has code been unrolled and or repeated in blocks to insure real time response on a multi case basis?

This skip improves on that very considerably. I'm travelling, so my assess here is limited, but this gain is why I said worth it earlier.

Code density can go up, but so can choices or decisions per time unit. That's a big gain in signal generate or signal acquire and response.

Ignoring the code size issues, how does SKIP allow you to do anything you can't do with separate code sequences for each path? How would it be faster?

potatohead · 2017-03-22 01:10

In COG, it adds to the PC, basically planned in advance jmp.

David Betz · 2017-03-22 01:11

potatohead wrote: »

In COG, it adds to the PC, basically planned in advance jmp.

But you don't need to add to the PC if you're running separate code sequences. That is not an optimization.

Dave Hein · 2017-03-22 01:25

jmg wrote: »

Dave Hein wrote: »

I'm OK with the SKIP instruction. I hope that anybody that uses it documents it well so the code is semi-readable. In my opinion, I don't think it's necessary to complicate the assembler by adding special support for the SKIP instruction. Let people handcode the skip patterns. Why take the fun out of it by letting the assembler generate the patterns.

My approach is a little different.
I like to write code that is inherently clear, and safe.
PC's and software exist to manage the housekeeping.
That is why I already use Macros, and Conditionals in my Assembly
Mundane Housekeeping is less 'fun', than risk.
"Let people handcode the skip patterns" seems like quite bad advice to give any novice, but there was no winky here.

I guess I'm just hoping to see the P2 sometime in the foreseeable future.

tonyp12 · 2017-03-22 01:26

ARM cortex have similar IT instruction (If Then)
you can do: ITTT or ITTEE etc, you only get max 4 bit pattern but you do get Else.

I think they are better than conditional as they are not NOP and actually change PC.

David Betz · 2017-03-22 01:32

tonyp12 wrote: »

ARM cortex have similar IT instruction (If Then)
you can do: ITTT or ITTEE etc, you only get max 4 bit pattern but you do get Else.

I think they are better than conditional as they are not NOP and actually change PC.

Hmmm... I guess I'd better keep quiet. :-)

Seairth · 2017-03-22 01:34

Rayman wrote: »

Might be interesting to have an ENDSKIP command that cannot be skipped.
Or, not let SKIP be skipped...

Oh!!! We do!

_ret_ skip #0

edit: Okay. This is actually not an answer to you, but it is what made me think of it. To that point:

No need to make RET special. We already have an ability to cancel the skip on return (rather, return on cancel of skip). Of course, that means the instruction can't use the conditional predicates, but that's probably not a big deal.

David Betz · 2017-03-22 01:39

tonyp12 wrote: »

ARM cortex have similar IT instruction (If Then)
you can do: ITTT or ITTEE etc, you only get max 4 bit pattern but you do get Else.

I think they are better than conditional as they are not NOP and actually change PC.

Actually, this sounds kind of handy since the bits in this case mean "then" and "else". I suppose you could do that with SKIP with a conditional instruction before the SKIP to invert the mask depending on a boolean. Not quite as compact as the ARM incarnation though. Let's hope that ARM didn't patent this idea since the SKIP instruction might run afoul of the patent.

jmg · 2017-03-22 01:39

David Betz wrote: »

Ignoring the code size issues, how does SKIP allow you to do anything you can't do with separate code sequences for each path? How would it be faster?

The way I see it, skip buys size in COG code (which can translate to speed elsewhere, by fitting more code into COGs), but it can buy speed in HUBexec, as it allows the fetch fifo to stream, and it can also buy speed in XIP uses, where the 'chunk handler' can server larger bites, perhaps even whole functions.

David Betz · 2017-03-22 01:44

jmg wrote: »

David Betz wrote: »

Ignoring the code size issues, how does SKIP allow you to do anything you can't do with separate code sequences for each path? How would it be faster?

The way I see it, skip buys size in COG code (which can translate to speed elsewhere, by fitting more code into COGs), but it can buy speed in HUBexec, as it allows the fetch fifo to stream, and it can also buy speed in XIP uses, where the 'chunk handler' can server larger bites, perhaps even whole functions.

Can't hubexec stream straight line code? In fact, SKIP is slower in hubexec mode because it just cancels the instructions it skips. I'd have to see some examples of how you intend to do XIP to understand how SKIP would help that.

tonyp12 · 2017-03-22 01:51

>do that with SKIP with a conditional instruction before the SKIP to invert the mask depending on a boolean. Not quite as compact as the ARM

yes the P2 would need two skip instructions, but you don't necessary just always invert the mask, you could sprinkle in some do_always.

David Betz · 2017-03-22 01:59

tonyp12 wrote: »

>do that with SKIP with a conditional instruction before the SKIP to invert the mask depending on a boolean. Not quite as compact as the ARM

yes the P2 would need two skip instructions, but you don't necessary just always invert the mask, you could sprinkle in some do_always.

I noticed a post that says the ARM C compiler actually generates some of these IT sequences. Even just a four instruction sequence is hard to read and understand at a glance unless you do something simple like ITTEE.

Seairth · 2017-03-22 02:01

David Betz wrote: »

potatohead wrote: »

In COG, it adds to the PC, basically planned in advance jmp.

But you don't need to add to the PC if you're running separate code sequences. That is not an optimization.

Re-read Chip's original post on this. He had potentially 108 permutations to deal with. Obviously, he's not going to create a separate snippet for each permutation, though that would certainly be the fastest code. The obvious solution is to make it procedural, but now the same procedure must be executed every time, meaning that none of the permutations are as efficient as they could be. The is the classic trade-off between code efficiency and execution efficiency (though usually we tend to look at it the other direction, when we do things like unroll loops, inline functions to avoid call overhead, etc).

What he is doing is actually the same as the procedural code, except that he's applying the conditional branches at the beginning of the block instead of inline. In other words, if he had written this as procedural code, most of what you see there would still be in that order, but would have had some conditional jumps (and/or bit tests with conditionals) sprinkled in as well.

Of course, he could have stopped right there. That alone would have sped things up. But then he realized that he could advance the PC by greater steps and get even faster (and that was purely an implementation detail, nothing to do with the instruction itself).

jmg · 2017-03-22 02:02

David Betz wrote: »

Can't hubexec stream straight line code?

Well yes, but straight line code is not conditional blocks. Skip saves jumps around blocks.

David Betz wrote: »

In fact, SKIP is slower in hubexec mode because it just cancels the instructions it skips.

I'm not sure it just does that. My reading of Chip's comments, was intra-block skips (currently size 8 ) were faster.

David Betz wrote: »

I'd have to see some examples of how you intend to do XIP to understand how SKIP would help that.

External Serial code memory (be it HyperFlash, hyperRAM, QuadSPI or similar), has a significant number of cycles overhead for any change in address & works better when it can stream bytes.

Skip allows moderately sized conditional blocks, without jumps.

An ideal XIP (as other MCUs offer it) sets an address and streams from QuadSPI.
P2 XIP is going to be a notch down from that, as there is no hardware flash manager.

So I'd expect one cog (and compilers) to manage in SW fetches of larger chunks of code.
The larger the chunk, the less the address-change cost is as a %.
In some cases whole functions could load into a buffer and execute, but in cases where that is not possible, some means to reduce address-issues will give faster overall speed.

tonyp12 · 2017-03-22 02:15

>Even just a four instruction sequence is hard to read and understand at a glance unless you do something simple like ITTEE.

It looks like they force you to show your intent, something similar could be used on P2 that generates the two masks for then_else.

Although other Thumb instructions are unconditional, all instructions that are made conditional by an IT instruction must be written with a condition.
These conditions must match the conditions imposed by the IT instruction. For example, an ITTEE EQ instruction imposes the EQ condition on the first two following instructions, and the NE condition on the next two.
Those four instructions must be written with EQ, EQ, NE and NE conditions respectively.
I agree with that it's likely specified this way to help reduce programming errors, as the condition code isn't encoded in the machine opcode.

David Betz · 2017-03-22 02:19

jmg wrote: »

David Betz wrote: »

Can't hubexec stream straight line code?

Well yes, but straight line code is not conditional blocks. Skip saves jumps around blocks.

David Betz wrote: »

In fact, SKIP is slower in hubexec mode because it just cancels the instructions it skips.

I'm not sure it just does that. My reading of Chip's comments, was intra-block skips (currently size 8 ) were faster.

David Betz wrote: »

I'd have to see some examples of how you intend to do XIP to understand how SKIP would help that.

External Serial code memory (be it HyperFlash, hyperRAM, QuadSPI or similar), has a significant number of cycles overhead for any change in address & works better when it can stream bytes.

Skip allows moderately sized conditional blocks, without jumps.

An ideal XIP (as other MCUs offer it) sets an address and streams from QuadSPI.
P2 XIP is going to be a notch down from that, as there is no hardware flash manager.

So I'd expect one cog (and compilers) to manage in SW fetches of larger chunks of code.
The larger the chunk, the less the address-change cost is as a %.
In some cases whole functions could load into a buffer and execute, but in cases where that is not possible, some means to reduce address-issues will give faster overall speed.

I'm pretty sure Chip said that hub code would just cancel the skipped instructions and always increment the PC by 4.

I'm not sure I followed your XIP description completely but I'll take your word for it. It seems writing compilers for P2 will be more difficult than it was for P1 to make good use of the instruction set. Anyone want to attempt an LLVM-based compiler?

cgracey · 2017-03-22 02:20

David Betz wrote: »

cgracey wrote: »

I see it as allowing for speed and compactness that is not achievable by any other means. This is ninja programming in assembler.

Of course, one function per instruction will always be faster. Another solution is to reduce your instruction set so that the code fits in COG+HUB without using this trick. Also, it does nothing for compactness of byte code. It just helps with compactness of the VM at the expense of execution speed of the byte code instructions. Try using something like the CMM instruction set used by PropGCC. It achieves almost the code density of Spin byte codes but runs faster.

No, it runs faster, too. For example, Spin2 hub read/write is achieved through different arrangements of 18 static instructions. There are 216 different permutations of those 18 instructions. With SKIP, I can execute just the patterns I want with no dead-time between the active instructions. That's a big win, and it takes only 4 clocks to set up.

David Betz · 2017-03-22 02:24

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

I see it as allowing for speed and compactness that is not achievable by any other means. This is ninja programming in assembler.

Of course, one function per instruction will always be faster. Another solution is to reduce your instruction set so that the code fits in COG+HUB without using this trick. Also, it does nothing for compactness of byte code. It just helps with compactness of the VM at the expense of execution speed of the byte code instructions. Try using something like the CMM instruction set used by PropGCC. It achieves almost the code density of Spin byte codes but runs faster.

No, it runs faster, too. For example, Spin2 hub read/write is achieved through different arrangements of 18 static instructions. There are 216 different permutations of those 18 instructions. With SKIP, I can execute just the patterns I want with no dead-time between the active instructions. That's a big win, and it takes only 4 clocks to set up.

But with one instruction sequence per instruction there are no dead instructions are there? In any case, if you really need that many different permutations I guess you have no choice but something like this.

Fast Bytecode Interpreter

Comments