Fast minimum-footprint bytecode interpreter

cgracey · 2013-05-08 11:24

After OPC, Potatohead mentioned a great idea to me: in-line assembly for Spin.

Currently, to execute assembly language, you must write a whole PASM program, launch it with COGNEW, then communicate with it via memory. This is cumbersome and requires a whole extra cog to do any PASM, at all.

If the Spin interpreter footprint could be made small by making it modular, there would be room in the cog for assembly to be loaded and executed, all within the context of Spin. Then you could get the benefit of assembly-language speed within your Spin code, not to mention being able to ease into PASM, without having to use a whole separate cog.

The key is making a modular interpreter that loads snippets of interpreter code as needed, based on compact Spin byte codes. Most snippets are under four instructions, so they could load very quickly with RDLONGC. The total size of all snippets could be far in excess of the cog's 512 registers, though, as they reside in hub RAM and are dynamically loaded and executed, as needed.

Anyway, after Potatohead talked to me about this, my head's been on fire and I've been thinking about how to make a small-footprint bytecode interpreter so that in-line assembly could be practical. It turns out the core interpreter only needs ~14 instructions, plus perhaps space for ~32 dynamic instructions, to accommodate potentially-large snippets. And it's actually faster to do this than to branch to one of many cog-resident routines.

'
' Get byte code, look up descriptor long, load code longs, execute code longs, repeat
'
nextbyte	rdbyte	x,ptra++		'get next byte code

		shl	x,#2			'convert byte code to descriptor address
		add	x,descbase

		setptrb	codebase		'point PTRB to code base

		rdlong	x,x			'get long descriptor

		movd	:reps,x			'descriptor[8..0] is number of code longs

		shr	x,#7			'descriptor[15..7] is code offset
		addptrb	x			'point PTRB to code longs

:reps		reps	#1,#1			'ready for fast in-line code load
		setindb	#code			'point INDB to in-line code
		rdlongc	indb++,ptrb++		'load code longs in-line using cached read

		mov	indb++,jmpback		'finish in-line code with jmp back
		shr	x,#9			'descriptor[31..16] is for code use

code		long	0[32]			'dynamic in-line code, loops to nextbyte


descbase	long	$1000			'address of 256 descriptor longs
codebase	long	$1400			'address of code longs
jmpback		jmp	#nextbyte
x		long	0			'variable

This is just the interpreter. To load in-line assembly, it could be $000..$1AF based, with Spin local variables sitting at $1B0..$1BF, or so. The top of memory is the interpreter and snippet space, while the stack RAM gets used for the Spin run-time stack. By exploiting Spin for its scoping and modularity, this framework might be an easy way to write extensive PASM apps.

Here is how in-line assembly could look:

PRI addvars(a, b)

  ASM

 	mov	result,a
	add	result,b

  ENDASM

David Betz · 2013-05-08 11:39

This looks cool! I'll have to look at it closer later. I did a VM that could execute native instructions for the xbasic system I worked on for a while. The inline PASM instructions allowed me to have a simpler VM that didn't need specific I/O instructions. It was written in PASM and interfaced to the same XMM cache drivers that PropGCC is currently using. In fact, the PropGCC cache drivers were taken originally from xbasic.

Seairth · 2013-05-08 12:00

This looks like a good general-purpose pattern for more than just SPIN. One note: are the indexes for the descriptor bit ranges correct in the comments? Also, would this mean that the inline PASM would execute exactly the same way that the interpreter code snippets would, almost as if you are creating your own language extensions?

Sapieha · 2013-05-08 12:33

Hi Chip.

It Can be nice addition to Cluso's Debugger

Thanks

cgracey wrote: »
After OPC, Potatohead mentioned a great idea to me: in-line assembly for Spin.

Currently, to execute assembly language, you must write a whole PASM program, launch it with COGNEW, then communicate with it via memory. This is cumbersome and requires a whole extra cog to do any PASM, at all.

If the Spin interpreter footprint could be made small by making it modular, there would be room in the cog for assembly to be loaded and executed, all within the context of Spin. Then you could get the benefit of assembly-language speed within your Spin code, not to mention being able to ease into PASM, without having to use a whole separate cog.

The key is making a modular interpreter that loads snippets of interpreter code as needed, based on compact Spin byte codes. Most snippets are under four instructions, so they could load very quickly with RDLONGC. The total size of all snippets could be far in excess of the cog's 512 registers, though, as they reside in hub RAM and are dynamically loaded and executed, as needed.

Anyway, after Potatohead talked to me about this, my head's been on fire and I've been thinking about how to make a small-footprint bytecode interpreter so that in-line assembly could be practical. It turns out the core interpreter only needs ~14 instructions, plus perhaps space for ~32 dynamic instructions, to accommodate potentially-large snippets. And it's actually faster to do this than to branch to one of many cog-resident routines.
'
' Get byte code, look up descriptor long, load code longs, execute code longs, repeat
'
nextbyte    rdbyte    x,ptra++        'get next byte code

        shl    x,#2            'convert byte code to descriptor address
        add    x,descbase

        setptrb    codebase        'point PTRB to code base

        rdlong    x,x            'get long descriptor

        movd    :reps,x            'descriptor[8..0] is number of code longs

        shr    x,#7            'descriptor[15..7] is code offset
        addptrb    x            'point PTRB to code longs

:reps        reps    #1,#1            'ready for fast in-line code load
        setindb    #code            'point INDB to in-line code
        rdlongc    indb++,ptrb++        'load code longs in-line using cached read

        mov    indb++,jmpback        'finish in-line code with jmp back
        shr    x,#9            'descriptor[31..16] is for code use

code        long    0[32]            'dynamic in-line code, loops to nextbyte


descbase    long    $1000            'address of 256 descriptor longs
codebase    long    $1400            'address of code longs
jmpback        jmp    #nextbyte
x        long    0            'variable
This is just the interpreter. To load in-line assembly, it could be $000..$1AF based, with Spin local variables sitting at $1B0..$1BF, or so. The top of memory is the interpreter and snippet space, while the stack RAM gets used for the Spin run-time stack. By exploiting Spin for its scoping and modularity, this framework might be an easy way to write extensive PASM apps.

Here is how in-line assembly could look:
PRI addvars(a, b)

  ASM

     mov    result,a
    add    result,b

  ENDASM

Cluso99 · 2013-05-08 13:02

The concept seems fantastic. It is sort of like loading a pasm overlay to execute each bytecode.

LMM code and Inline (overlay) code each have their advantages, which depend upon the actual code being executed. Once you have a loop, then inline/overlay is faster, otherwise usually LMM will win.

Perhaps by using tasking, performance could be improved by operating two tasks - one reading ahead and fetching the next opcode and overlay, while the other is executing the prior bytecode fetched and loaded.

This might even have some use with Peters Tachyon.

cgracey · 2013-05-08 13:15

Seairth wrote: »

One note: are the indexes for the descriptor bit ranges correct in the comments? Also, would this mean that the inline PASM would execute exactly the same way that the interpreter code snippets would, almost as if you are creating your own language extensions?

You're right about the bit ranges being wrong. I need to fix that.

To run in-line assembler, I would run a snippet to load the code and set the context, then call the loaded code. After that, I would update the local variables from the copies that the assembler code used.

Dave Hein · 2013-05-08 13:25

..........

cgracey · 2013-05-08 15:17

Dave Hein wrote: »

...On the P2, the Spin interpreter could implement most of the frequently used bytecodes in optimized PASM, and the less frequently used bytecodes could be implemented in LMM PASM...

For short sequences of instructions, it seems faster to load it in-line and execute it than it does to determine and execute a branch to somewhere else.

TinkersALot · 2013-05-08 16:05

I know it is dangerous to eat pie in the sky while gazing at the world through a crystal ball but, are the coglets in the future with dynamic linking too? hehehe

jmg · 2013-05-08 16:11

cgracey wrote: »

.... After that, I would update the local variables from the copies that the assembler code used.

Can the in-line PASM not use the addresses that Spin uses, as needed, to save the moves for results and globals ?

Also, perhaps a more complex mix example of Spin + PASM would help in #1

pedward · 2013-05-08 17:01

I'm thinking that to make this work properly, you will need to do pointer offset fixups.

In your example above, you access SPIN arguments from PASM. In SPIN 1 the first 9 longs were cached in COG memory, I think it would be worthwhile to set aside 16 longs for local storage, then when you execute inline code, the compiler will do a fixup and adjust the D or S offsets to point to the local copy of the variable. This would mean that only the first 16 parameters and local variables would be accessible via inline PASM, that would be a nice limit.

Are you intent on limiting instructions only to those that are non-branching, or are you open to allowing any snippet that can fit in 2^5 instructions, run?

I think the latter makes this feature more useful and functional.

I would also recommend not using the descbase the way you're doing it, use the stack ram. You have 256 longs in stack and you shouldn't waste that much space with stack use, so put your descriptor table in stack.

Use SPA for the stack and SPB for descriptor lookup. It only costs 2 instructions to access this, whereas you spend a lot more than that in SPIN 1 to do the same thing.

I'd put descriptors in $80-$FF. If you allocate 16 longs for immediate access in SPIN, then you can put the overflow on the stack. In SPIN 1 you only had 64 opcodes, so having 128 should be useful in SPIN 2.

rogloh · 2013-05-08 17:40

What a great concept. I like it and mixed performance should be very good indeed.

For single PASM instruction snippets, you get 4 hub windows per "interpreted" instruction, and the same rate appears to hold for up to 4 instructions in the snippet - assuming 1 clock per PASM instruction and if the snippets themselves are quad aligned which should be possible. That is then effectively up to 5, 10, 15, or 20 raw PASM MIPs (for 1, 2, 3, 4 instruction snippet sizes) at 160MHz and the hub RAM could certainly hold a fairly large byte code program and any associated PASM code snippets.

Plus you can always have some instruction codes jumping off to COG resident functions that are executed more frequently or are more complex to avoid the extra load overload once the snippets get bigger. I guess that is how the SPIN part generally works (though I am less familiar with that aspect).

Cool!

cgracey · 2013-05-08 17:42

jmg wrote: »

Can the in-line PASM not use the addresses that Spin uses, as needed, to save the moves for results and globals ?

Also, perhaps a more complex mix example of Spin + PASM would help in #1

It seems to me that because locals are kept inside the cog, they are easy conduit for assembly code. Globals live in hub RAM and are not directly accessible in cog RAM, so if a local is used to pass a global variable address to PASM code, that would be about as straightforward as could be.

cgracey · 2013-05-08 17:49

pedward wrote: »

I'm thinking that to make this work properly, you will need to do pointer offset fixups.

In your example above, you access SPIN arguments from PASM. In SPIN 1 the first 9 longs were cached in COG memory, I think it would be worthwhile to set aside 16 longs for local storage, then when you execute inline code, the compiler will do a fixup and adjust the D or S offsets to point to the local copy of the variable. This would mean that only the first 16 parameters and local variables would be accessible via inline PASM, that would be a nice limit.

Are you intent on limiting instructions only to those that are non-branching, or are you open to allowing any snippet that can fit in 2^5 instructions, run?

I think the latter makes this feature more useful and functional.

I would also recommend not using the descbase the way you're doing it, use the stack ram. You have 256 longs in stack and you shouldn't waste that much space with stack use, so put your descriptor table in stack.

Use SPA for the stack and SPB for descriptor lookup. It only costs 2 instructions to access this, whereas you spend a lot more than that in SPIN 1 to do the same thing.

I'd put descriptors in $80-$FF. If you allocate 16 longs for immediate access in SPIN, then you can put the overflow on the stack. In SPIN 1 you only had 64 opcodes, so having 128 should be useful in SPIN 2.

Using stack RAM for descriptor lookup would only save a few clocks, but would tie up a huge resource (stack RAM).

Fixups wouldn't be necessary, as you could quickly relocate the 16 locals (result, parameters, variables) to fixed addresses which would be directly addressable by PASM code. Afterwards, you would restore them. There would be 16 cycles of overhead for each directional move. Not really a big deal to make a very simple addressing rule. 'Result' would always be at, say, $1B0, with parameters and variables following. Your PASM code would load at $000 and continue, potentially, to $1AF. No rules on what instructions you use (looping, etc). When you are done, you jump to $1C0 or allow the jump to $1C0 tacked onto the end of your code to execute.

Cluso99 · 2013-05-08 18:55

cgracey wrote: »

Using stack RAM for descriptor lookup would only save a few clocks, but would tie up a huge resource (stack RAM).

I came to the same conclusion for my variant of your spin interpreter. The stack is accessible from both ends, as well as semi-directly. I thought one end could be for the stack, and the other perhaps for a limited set of globals which would be relatively accessed from the end.

pedward · 2013-05-08 21:00

cgracey wrote: »

Using stack RAM for descriptor lookup would only save a few clocks, but would tie up a huge resource (stack RAM).

Fixups wouldn't be necessary, as you could quickly relocate the 16 locals (result, parameters, variables) to fixed addresses which would be directly addressable by PASM code. Afterwards, you would restore them. There would be 16 cycles of overhead for each directional move. Not really a big deal to make a very simple addressing rule. 'Result' would always be at, say, $1B0, with parameters and variables following. Your PASM code would load at $000 and continue, potentially, to $1AF. No rules on what instructions you use (looping, etc). When you are done, you jump to $1C0 or allow the jump to $1C0 tacked onto the end of your code to execute.

When I say fixups, I mean that variables in a local scope would be automatically address adjusted at compile time so you wouldn't need to do things such as r14, r15, r11 as named for statically located registers. The registers defined and scoped by DAT should be possible too, but maybe not.

The stack RAM is 256 entries, how do you expect all of this to get used? I looked at the code in the SPIN 1 interpreter and it takes self modifying code and about 4 instructions, but with the stack it only takes 2 to do indirect addressing, without pipeline restrictions.

I'm trying to think of ways to make the SPIN 2 interpreter as fast as possible. The code density of the new instruction set is already 30% better than the first Prop.

jmg · 2013-05-08 22:33

pedward wrote: »

I'm trying to think of ways to make the SPIN 2 interpreter as fast as possible. The code density of the new instruction set is already 30% better than the first Prop.

Fast is always nice, but this thread's approach make small quite important too, as that buys room for more PASM, and that suddenly gets a LOT faster. (not just 30%~50%).
You are right that keeping all the balls in the air re variable addresses, and lifetimes, will be tricky.
The failure modes can get quite subtle here.

I guess another tiny block runs the debugger / COG sniffer ?

cgracey · 2013-05-08 23:43

Here's the latest:

'
' Get byte code, look up descriptor long, load code longs, execute code longs, repeat
'
nextbyte	rdbyte	x,ptra++		'0..2	get next byte code

		shl	x,#2			'3	convert byte code to descriptor address
		add	x,descbase		'4

		setptrb	codebase		'5

		rdlong	x,x			'6..2	get descriptor long

		movd	:reps,x			'3	descriptor[8..0] holds code longs minus 1

		shr	x,#7			'4	descriptor[23..7] is code offset
		addptrb	x			'5	point PTRB to code longs

:reps		reps	#1,#1			'6	ready for fast in-line code load
		setindb	#code			'7	point INDB to in-line code
		rdlongc	indb++,ptrb++		'0..2	load code longs in-line using cached read

		mov	indb++,jmpback		'	finish in-line code with jmp back
		shr	x,#24-7			'	descriptor[31..24] is arbitrary value

code		long	0[32]			'	dynamic in-line code, loops to nextbyte


descbase	long	$1000			'address of 256 descriptor longs
codebase	long	$1400			'address of code longs
jmpback		jmp	#nextbyte		'jmp back instruction
x		long	0			'variable

cgracey · 2013-05-08 23:47

pedward wrote: »

When I say fixups, I mean that variables in a local scope would be automatically address adjusted at compile time so you wouldn't need to do things such as r14, r15, r11 as named for statically located registers. The registers defined and scoped by DAT should be possible too, but maybe not.

The stack RAM is 256 entries, how do you expect all of this to get used? I looked at the code in the SPIN 1 interpreter and it takes self modifying code and about 4 instructions, but with the stack it only takes 2 to do indirect addressing, without pipeline restrictions.

I'm trying to think of ways to make the SPIN 2 interpreter as fast as possible. The code density of the new instruction set is already 30% better than the first Prop.

I'm thinking that the stack RAM will hold the run-time stack, which is subroutine addresses, local variables, and math workspace. I think hard-limiting PUB/PRI's to 16 result/parameters/locals will allow almost 16 call depths, which should be plenty. Better to use stack RAM than cog RAM for that business. The cog RAM stays flexible.

pedward · 2013-05-09 01:21

I think I understand your hesitance to use the stack, every time you call another routine you have to push the locals onto the stack.

One advantage to using part of the stack for lookup is using RETx to cause a jump directly.

I was thinking that it might be nice if the memory were laid out like below, then you could do something like set the INB to jmpback, then use the post decrement form of INDB and PTRB to read the instructions. In this manner the inline PASM and locals could share a heap. If you want pure SPIN and have a lot of local storage, then you get 47 longs, but if you want inline PASM, you can have up to 47 instructions, it's up to the programmer to balance the usage.

result long 0
locals long 0[15]
pasm long 0[32]
jmpback jmp #nextbyte

The jmp is put at the very end so that you just back up however many instructions you load, then the jmp acts as a sentinel at the end of memory. It's a little messy, but you could SETINDB to #jmpback, then SETPTRB to the code offset + code size, and back down the reads with PTRB-- and INDB--.

Also, since you use indention to delineate blocks, why not keep with convention and omit the ENDASM keyword.

As a matter of historical convention, I like the keyword INLINE for inline assembly, it is just neat to me.

jmg · 2013-05-09 03:36

pedward wrote: »

Also, since you use indention to delineate blocks, why not keep with convention and omit the ENDASM keyword.

Because indent is a bad convention to follow....

pedward wrote: »

As a matter of historical convention, I like the keyword INLINE for inline assembly, it is just neat to me.

- but ASM is clearer, as it actually states what the language change is, and what is inside the delimiters is clearly ASM.
Clarity helps, especially if English is not your first language.

It could even be PASM..ENDPASM on that basis, then someone knows to reach for the PASM manual.

David Betz · 2013-05-09 05:17

jmg wrote: »

Because indent is a bad convention to follow....

- but ASM is clearer, as it actually states what the language change is, and what is inside the delimiters is clearly ASM.
Clarity helps, especially if English is not your first language.

It could even be PASM..ENDPASM on that basis, then someone knows to reach for the PASM manual.

Your suggestion about using PASM is a good one. In my xbasic interpreter I also allowed inline bytecode assembly so ASM might be ambiguous. Does it refer to Spin bytecodes or to PASM instructions?

potatohead · 2013-05-09 08:05

I like it too, and for the same reasons.

kwinn · 2013-05-09 10:21

potatohead wrote: »

I like it too, and for the same reasons.

Same here.

cgracey · 2013-05-09 10:25

Here is a short bytecode app that toggles pin0:

CON

  base	= $E80

DAT
'
'
' Get byte code, look up descriptor long, load code longs, execute code longs, repeat
'
		setptra	bytebase

nextbyte	rdbytec	x,ptra++		'0..2	get next byte code

		shl	x,#2			'3	convert byte code to descriptor address
		add	x,descbase		'4

		rdlong	x,x			'6..2	get descriptor long

		movd	:reps,x			'3	descriptor[8..0] holds code longs minus 1

		shr	x,#7			'4	descriptor[23..7] is code offset
		setptrb	x			'5	point PTRB to code longs

:reps		reps	#1,#1			'6	ready for fast in-line code load
		setindb	#code			'7	point INDB to in-line code
		rdlongc	indb++,ptrb++		'0..2	load code longs in-line using cached read

		mov	indb++,jmpback		'	finish in-line code with jmp back
		shr	x,#24-7			'	descriptor[31..24] is arbitrary value

code		long	0[32]			'	dynamic in-line code, loops to nextbyte


bytebase	long	base + bytes<<2
descbase	long	base + descs<<2
jmpback		jmp	#nextbyte		'jmp back instruction
x		long	0			'variable


bytes		byte	0	'push 0
		byte	1	'notp (pop)
		byte	2,$02	'loop


descs		long	(base + code0<<2)<<7 + 0
		long	(base + code1<<2)<<7 + 1
		long	(base + code2<<2)<<7 + 1


code0		pusha	#0			'constant 0

code1		popa	x			'notp
		notp	x

code2		rdbytec	x,ptra			'branch backwards by an 8-bit offset
		subptra	x

This 4-byte program loops every 64 clocks, which is pretty quick.

pedward · 2013-05-09 11:38

It seems obvious to me that "ASM" implies PASM because that's the only assembler that runs on this chip. I just like INLINE because that was how it is done in C, but because it conveys "hey, this is inline embedded code". Declaring the inline code as assembler isn't really necessary.

My argument for the use of whitespace as a block delimiter was to ensure the language addition was orthogonal; it looked like it was designed as a block element from day one. Having a closing block keyword just totally throws "conventional" language design into a language which is largely absent of it. Furthermore, you would then have to document exceptions that say "when you put code inside this block construct, whitespace doesn't matter, but everywhere else you must maintain strict order".

It's analogous to "do as I say, not as I do". You can't tell people to simultaneously adhere to strict structuring and then relax that structuring just for one element.

I am aware that PASM within a DAT block is entirely open and has no whitespace restrictions, but there is a clear line drawn between SPIN and PASM at that point, not so when embedded inline.

Embedded PASM is the gateway drug for new programmers, they dip their toe in the pool with easy inline assembler as a stepping stone to writing full on COG code.

I do feel that inline PASM has a much broader implications for the whole language -- it allows very fast and compact code without wasting a single COG. This will allow hybrid drivers for a lot of peripherals, opening the way for the drivers to be resource free, again multiplying the useful number of COGs.

While on that subject, I feel it would also be nice to expose the task switching and multi-threading capabilities in SPIN as well. The ability to launch COGLETs using these constructs, would be very powerful.

The SPIN interpreter could be written to have yield points within it and allow task switching to occur periodically, supporting perhaps 1-3 COGLETs for some small periodic code, or come to think of it, perhaps the resources can be shared such that between 1 and 4 SPIN interpreters can execute simultaneously per COG.

potatohead · 2013-05-09 12:40

I am of the opposite opinion, and it's because white space actually doesn't matter for the inline portion. When we put the PASM in to a SPIN program this way, it is encapsulated and compartmentalized. There is a line, and it's clear. Inside the block is PASM, outside of it is SPIN.

How can't you tell people?

Seriously, inside the block, it's PASM and you get to do PASM things, which are not SPIN things. Outside the block, it's SPIN, and you get to do SPIN things. Done this way, all the syntax, etc... surrounding PASM is entirely valid within a PASM block, making PASM consistent everywhere it appears.

We aren't merging SPIN and PASM, so much as providing an access right to the metal where needed.

jmg · 2013-05-09 13:36

potatohead wrote: »

I
Seriously, inside the block, it's PASM and you get to do PASM things, which are not SPIN things. Outside the block, it's SPIN, and you get to do SPIN things. Done this way, all the syntax, etc... surrounding PASM is entirely valid within a PASM block, making PASM consistent everywhere it appears.

Correct, and I'd make it PASM..ENDPASM to make it clear to even non-english programmers.

It allows direct & predictable cut/paste of PASM, ( & Conditional ASM ?) without suddenly finding your paste accidently triggered an unexpected indent inferred boundary.

Ariba · 2013-05-10 18:49

cgracey wrote: »

Here is a short bytecode app that toggles pin0:

CON

  base	= $E80

DAT
'
'
' Get byte code, look up descriptor long, load code longs, execute code longs, repeat
'
		setptra	bytebase

nextbyte	rdbytec	x,ptra++		'0..2	get next byte code

		shl	x,#2			'3	convert byte code to descriptor address
		add	x,descbase		'4

		rdlong	x,x			'6..2	get descriptor long

		movd	:reps,x			'3	descriptor[8..0] holds code longs minus 1

		shr	x,#7			'4	descriptor[23..7] is code offset
		setptrb	x			'5	point PTRB to code longs

:reps		reps	#1,#1			'6	ready for fast in-line code load
		setindb	#code			'7	point INDB to in-line code
		rdlongc	indb++,ptrb++		'0..2	load code longs in-line using cached read

		mov	indb++,jmpback		'	finish in-line code with jmp back
		shr	x,#24-7			'	descriptor[31..24] is arbitrary value

code		long	0[32]			'	dynamic in-line code, loops to nextbyte


bytebase	long	base + bytes<<2
descbase	long	base + descs<<2
jmpback		jmp	#nextbyte		'jmp back instruction
x		long	0			'variable


bytes		byte	0	'push 0
		byte	1	'notp (pop)
		byte	2,$02	'loop


descs		long	(base + code0<<2)<<7 + 0
		long	(base + code1<<2)<<7 + 1
		long	(base + code2<<2)<<7 + 1


code0		pusha	#0			'constant 0

code1		popa	x			'notp
		notp	x

code2		rdbytec	x,ptra			'branch backwards by an 8-bit offset
		subptra	x

This 4-byte program loops every 64 clocks, which is pretty quick.

Chip

64 clocks seems to be wrong for me. So I looked a bit closer. I think you loop only the last 2 bytecode instructions, so you have a lot of POPs without PUSHs. This only works because the POPs return zero and you toggle PIN 0. To loop all the instructions the last bytecode should be: 2,$03 and then it takes 96 clocks. And 96 is best case when the whole bytecode fits in a quad, if you cross quad boundaries it takes 112 clocks (just add another long before the bytecode to test this).

I like the flexibility of this bytecode interpreter, and I think it works about 10 times faster on a 160MHz Prop2 than Spin on the Prop 1, but an optimized Spin2 interpreter in a cog would maybe be 20..30 times faster.
Another concern I have is the hubmemory footprint of this bytecode interpreter. I think it will need more than 4kByte of hubram for all the Spin functions. The description table alone needs 1kB and then it will be hard to share subroutines for the overlays you load. Maybe a few native subroutines in addition to the interpreter loop in cogram helps to minimize hub-footprint.

Andy

cgracey · 2013-05-10 21:20

Ariba wrote: »

Chip

64 clocks seems to be wrong for me. So I looked a bit closer. I think you loop only the last 2 bytecode instructions, so you have a lot of POPs without PUSHs. This only works because the POPs return zero and you toggle PIN 0. To loop all the instructions the last bytecode should be: 2,$03 and then it takes 96 clocks. And 96 is best case when the whole bytecode fits in a quad, if you cross quad boundaries it takes 112 clocks (just add another long before the bytecode to test this).

I like the flexibility of this bytecode interpreter, and I think it works about 10 times faster on a 160MHz Prop2 than Spin on the Prop 1, but an optimized Spin2 interpreter in a cog would maybe be 20..30 times faster.
Another concern I have is the hubmemory footprint of this bytecode interpreter. I think it will need more than 4kByte of hubram for all the Spin functions. The description table alone needs 1kB and then it will be hard to share subroutines for the overlays you load. Maybe a few native subroutines in addition to the interpreter loop in cogram helps to minimize hub-footprint.

Andy

Very astute, Andy! I thought it was running too fast, too, but I errantly attributed it to the RDBYTEC instructions. You're right that the jump distance needs to be 3, not 2. It was a fluke that it was running correctly.

I figure, too, that the hub footprint will be about 4KB. Many little snippets could be combined into general-case snippets, with some conditional execution, but that will slow things down.

I like the idea of having just the bytecode expander/loader in cog RAM, as it leaves lots of memory free for inline assembly that conveniently ORGs to $000.

Roy Eltham · 2013-05-10 22:17

The Spin compiler could link in only the used snippets of the language for the program being compiled which would reduce the memory footprint.

In fact, it's possible for the compiler to generate the specific snippets from "general case" ones.

Fast minimum-footprint bytecode interpreter

Comments