Fast Bytecode Interpreter

Cluso99 · 2017-04-26 21:26

WOW! That's a fantastic achievement Chip.

I know that the maths side will give a huge improvement too. So really looking forward to see how the remaining portions go too.

jmg · 2017-04-26 21:41

cgracey wrote: »
repeat
  outa++

Curious what speeds these give :

repeat
  outa++
  outb++

repeat
  outa := outa+1
  outb := outb+1

jmg · 2017-04-26 21:45

potatohead wrote: »

At that speed, P2 SPIN will be comparable to P1 PASM.

P1 ASM would do the above in also 2 lines, for ~ 8x faster than P2 Spin.
P2 ASM could be the above via REP + one line, (a somewhat special case), and be ~8x faster than P1 ASM.

JasonDorie · 2017-04-26 23:35

potatohead wrote: »

At that speed, P2 SPIN will be comparable to P1 PASM.

2 pin toggles per microsecond puts that two instruction loop at about 500ns per iteration, right?
P1 PASM is 20 instructions per microsecond, so it's still an order of magnitude slower than ASM, but still an order of magnitude faster than P1 Spin. (I'm not complaining

)

(edit: yeah, what jmg said)

potatohead · 2017-04-27 00:20

At 160 Mhz, would it be 4 toggles?

Any yes, which is why I wrote "comparable" For a lot of things, the performance would largely be there. Point being, people could reasonably attempt things in P2 SPIN that would definitely be PASM on P1.

Add some well placed inline P2 PASM, and it could be golden. ;D

cgracey · 2017-04-27 00:24

Note that the scope picture was taken at 2us/division.

Rayman · 2017-04-27 00:28

And, I think I heard Spin2 will let you put ASM calls in cog, right?

cgracey · 2017-04-27 00:35

Rayman wrote: »

And, I think I heard Spin2 will let you put ASM calls in cog, right?

Yes.

potatohead · 2017-04-27 00:38

This is gonna rock Chip!

@all, interpreter numbers!

Rayman · 2017-04-27 01:14

I'm curious to see how ASM in SPIN2 cached to cog is going to be implemented. Seems like there could be a million ways to do it...

cgracey · 2017-04-27 02:00

I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):

%xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
%xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
%xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx

With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

jmg · 2017-04-27 02:13

cgracey wrote: »
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
%xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
%xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
%xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

Not sure I follow ?
You have an 8 bit field to the left of the table examples, is this a new bytecode proposed, or a new P2 opcode, or 3 new opcodes ?

cgracey · 2017-04-27 02:23

Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.

jmg · 2017-04-27 02:31

cgracey wrote: »

Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.

Ah, ok, how does that mesh with the suggestions ersmith made here, to make porting the RISC-V and ZPU interpreters easier. ?
http://forums.parallax.com/discussion/comment/1409433/#Comment_1409433

Peter Jakacki · 2017-04-27 02:34

I am kind of following this with interest but I have a simple loop in Tachyon running from hub RAM with the P1 @80Mhz.

2 PIN -1 FOR H L NEXT

Now I know I'm not toggling but I already measure 833kHz. I am hoping that Spin2 could do better than what I am doing already on P1.

cgracey · 2017-04-27 02:55

Peter Jakacki wrote: »
I am kind of following this with interest but I have a simple loop in Tachyon running from hub RAM with the P1 @80Mhz.
2 PIN -1 FOR H L NEXT
Now I know I'm not toggling but I already measure 833kHz. I am hoping that Spin2 could do better than what I am doing already on P1.

I suspect the Spin2 interpreter has more complex variable setup than Tachyon. Also, the Spin program loops for each toggle, so there is a branch.

Peter Jakacki · 2017-04-27 02:59

Yes, but there is a branch with NEXT so Tachyon is continually reading those instructions from hub RAM between FOR NEXT, so 3 instructions including NEXT per loop in fact. -1 FOR is a simple way of setting up a very long repeat. I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?

cgracey · 2017-04-27 03:05

You loop after H and L. I loop after OUTA++. So, I'm looping every other instruction. That takes time with the FIFO.

I don't know how I could make my bytecodes any leaner. What kinds of things are you thinking about?

jmg · 2017-04-27 03:10

Peter Jakacki wrote: »

... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?

There is room for inline ASM ?

A byte-code is byte sized here, so it's hard to make that leaner and meaner.
In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
I think you effectively did that, with the move to a 16b bytecode ?

Peter Jakacki · 2017-04-27 03:19

jmg wrote: »

Peter Jakacki wrote: »

... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?

There is room for inline ASM ?
A byte-code is byte sized here, so it's hard to make that leaner and meaner.
In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
I think you effectively did that, with the move to a 16b bytecode ?

I can have my V3 bytecode do the same as what I did here, so that's not the problem. The advantages of wordcode over bytecode in regards to Tachyon are more apparent as the system grows whereas bytecode is definitely more memory efficient when the system is still small.

I can see the cog memory being made available by Chip's compact bytecode is a great advantage too. If Spin1 bytecode interpreter had even just a little bit of cog memory available for user code it would have made a huge difference. Even Tachyon reserves about 20 something longs for special modules too.

Seairth · 2017-04-27 04:36

cgracey wrote: »
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
%xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
%xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
%xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

Can you show a snippet of the code you're trying to replace/shorten?

cgracey · 2017-04-27 05:28

The 18 bytecodes (3*6) for...

'
' Branches - jmp, jz, jnz
'
bra_b	rfbyte	pa	'	b
bra_w	rfword	pa	'	| w
bra_l	rflong	pa	'	| | l
        test	x   wz	'	| | c d e f	a: branch fwd
        popa	x	'	| | c d e f	b: branch rev
if_nz	ret     	'	| | c d | |	c: test, pop, branch fwd if z
if_z	ret     	'	| | | | e f	d: test, pop, branch rev if z
        add	pb,pa	'	a | c | e |	e: test, pop, branch fwd if nz
        sub	pb,pa	'	| b | d | f	f: test, pop, branch rev if nz
_ret_	rdfast	#0,pb	'	a b c d e f

Would become 3 bytecodes for...

'
' Branches - jmp, jz, jnz
'
bra	rfdata	pa	'	a c e 
        test	x   wz	'	| c e	a: branch
        popa	x	'	| c e
if_nz	ret	        '	| c |	c: test, pop, branch if z
if_z	ret	        '	| | e
        add	pb,pa	'	a c e	e: test, pop, branch if nz
_ret_	rdfast	#0,pb	'	a c e

Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.

There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.

jmg · 2017-04-27 06:04

cgracey wrote: »

The 18 bytecodes (3*6) for...
Would become 3 bytecodes for...

More than shorten code snippets, it reduces the number of bytecodes needed for each snippet. There would be really huge savings in the variable setups.

Oh, yes, that sounds significant. It means room to add smarter bytecodes ?
I think that also means other byte-code designs, can more easily support types of BYTE WORD LONG ?
One pain in the trend to PCs is 'type laziness', where languages round up to larger types, because 'hey, everyone has 8GB now, right' ?

cgracey · 2017-04-27 06:08

Right. It would allow conservation of bytecodes without performance reduction.

jmg · 2017-04-27 06:21

cgracey wrote: »

There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.

How many does that leave spare from the 256 ?

Conga · 2017-04-27 07:48

cgracey wrote: »
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
%xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
%xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
%xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

If the tag bit or bits were placed starting at MSB, the encoding scheme would be similar to UTF-8.

This is proof that it's quite a general purpose mechanism --- definitely useful outside bytecode engines.

The existence of UTF-8 also suggests that having MSB as the first tag bit would be better,
unless the bytecode really needs the tag to be at the LSB end.

Conga · 2017-04-27 07:59

Of course, UTF-8 encodes unsigned integers (codepoints);
there's no sign extension.

The need for sign extension might change the perspective
so having the tag bit(s) at the LSB end might be more "natural":
fewer surprises when interpreting the values, the sign bit would be in same place.

Conga · 2017-04-27 08:12

cgracey wrote: »
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
%xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
%xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
%xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

Separate from the tag bits position (LSB or MSB end),
I would suggest a different mnemonic for the instruction:

RFVARS

This kind of encoding was named "VARint" (VARiable-length INTeger) in several designs, SQLite being only one example.

The trailing "S" stands for "Signed" or "Sign-extended".

"VARS" looks like a plural noun so it might be confusing;
"SVAR" could fix this.

RFSVAR then?

Conga · 2017-04-27 08:18

An "RFUVAR" instruction ('U' = "Unsigned") without sign extension also makes sense.

But the need for the signed variant is obvious in the context of bytecode engines, so "RFSVAR" should be the winner if only one fits.

Heater. · 2017-04-27 08:22

Good grief. Somewhere around here I suggested that some fantasy CPU could suck up UTF-8 as it's native machine code. A great way to keep short instructions short and longer ones, longer as needed.

I thought I was joking.

I'm not sure the signed/unsigned thing is an issue. UTF-8 is all about encoding bits. What they actually mean is another question.

Fast Bytecode Interpreter

Comments