Fast Bytecode Interpreter

potatohead · 2017-03-26 02:38

GCC should use a fair number of instructions to start. It can always be improved.

Maxing out hardware features on this chip is basically intended to be done with PASM. The C people will in line and encapsulate just like the SPIN people will.

No reason for delays.

David Betz · 2017-03-26 03:02

Rayman wrote: »

I suppose the downside to this (sorry) is that GCC is unlikely to use any of these new instructions, right?

So, we may be delaying rollout for a small fraction of potential users....

I suppose there could be a version of CMM for P2 that could use these instructions if its VM can't fit in COG/LUT otherwise or to provide more "fcache" space.

Peter Jakacki · 2017-03-26 03:07

cgracey wrote: »

I've got the Prop123_A9 and BeMicro_A9 compiles done. Compiling for the Prop123_A7 now.

I'm considering using one of my CVA9s for a specialized test instrument since it's a one off so I will try out the latest enhancements and see if I can improve my interpreter as well.

cgracey · 2017-04-01 08:58

I just got the unary math operators together for Spin2:

op_unary	popa	x	wz	'pop	a b c d e f g h i j k	a: NOT
	if_nz	not	x,#0		'NOT	a | | | | | | | | | |	b: !
		not	x		'!	a b | | | | | | | | |	c: - 
		neg	x		'-	| | c | | | | | | | |	d: ||
		abs	x		'||	| | | d | | | | | | |	e: >|
		topone	x		'>|	| | | | e | | | | | |	f: |<
		decod	x		'|<	| | | | | f | | | | |	g: ~
		shl	x,#24		'~	| | | | | | g | | | |	h: ~~
		sar	x,#24		'	| | | | | | g | | | |	i: SQRT()
		shl	x,#16		'~~	| | | | | | | h | | |	j: LOG()
		sar	x,#16		'	| | | | | | | h | | |	k: EXP()
		qsqrt	x		'SQRT()	| | | | | | | | i | |
		qlog	x		'LOG()	| | | | | | | | | j |
		qexp	x		'EXP()	| | | | | | | | | | k
		getqx	x		'	| | | | | | | | i j k
	_ret_	pusha	x		'push	a b c d e f g h i j k

That block gets called from the 9-clock interpreter loop via EXECF, the instructions of interest execute, and then it returns to the loop.

All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.

jmg · 2017-04-01 09:20

That's looking nicely compact

I did find some numbers for PLC code, running via a eCLR byte-code engine :

eCLR   Performance data 1000 logical or arithmetic-STL statements 
a 1 Cortex®-M3 72 MHz 
b ARM9 ™ 240 MHz 
c Cortex®-A9 800 MHz 
d PowerPC ™ e500 1.2 GHz 1 
e Intel® 2.8 GHz 
       a     b      c     d     e  
       72M3  240A9  800A9 1.2G  i2.8GHz
BOOL  33 μs  12 μs  1 μs  1 μs 0.35 μs 
BYTE  33 μs  12 μs  1 μs  1 μs 0.35 μs 
INT   33 μs  18 μs  4 μs  1 μs 0.35 μs 
DINT  33 μs  18 μs  4 μs  8 μs 0.35 μs 
REAL 449 μs 100 μs 40 μs 13 μs 1.8 μs 

1 The indicated system limits and performance values may differ depending on the configuration of the hardware. 
Supported CPUs Altera® Cyclone® V Intel® X86, Intel® Atom ™ Altera® Nios® II PowerPC ™ e300, PowerPC ™ 
e500 ARM7 ™, ARM9 ™, ARM11 ™ SH02, SH03, SH04 Cortex®-M3, Cortex®-M4, Cortex®-R4F Xilinx Zynq®

I think this means the most basic add/and/not operations. like A = B+C or A= D AND E.
The derived sysCLKs here seem suss ?

A forum posting at RaspBerryPi suggests this

10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.

This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?

You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?

cgracey · 2017-04-01 09:50

jmg wrote: »
That's looking nicely compact

I did find some numbers for PLC code, running via a eCLR byte-code engine :
eCLR   Performance data 1000 logical or arithmetic-STL statements 
a 1 Cortex®-M3 72 MHz 
b ARM9 ™ 240 MHz 
c Cortex®-A9 800 MHz 
d PowerPC ™ e500 1.2 GHz 1 
e Intel® 2.8 GHz 
       a     b      c     d     e  
       72M3  240A9  800A9 1.2G  i2.8GHz
BOOL  33 μs  12 μs  1 μs  1 μs 0.35 μs 
BYTE  33 μs  12 μs  1 μs  1 μs 0.35 μs 
INT   33 μs  18 μs  4 μs  1 μs 0.35 μs 
DINT  33 μs  18 μs  4 μs  8 μs 0.35 μs 
REAL 449 μs 100 μs 40 μs 13 μs 1.8 μs 

1 The indicated system limits and performance values may differ depending on the configuration of the hardware. 
Supported CPUs Altera® Cyclone® V Intel® X86, Intel® Atom ™ Altera® Nios® II PowerPC ™ e300, PowerPC ™ 
e500 ARM7 ™, ARM9 ™, ARM11 ™ SH02, SH03, SH04 Cortex®-M3, Cortex®-M4, Cortex®-R4F Xilinx Zynq®
I think this means the most basic add/and/not operations. like A = B+C or A= D AND E.
The derived sysCLKs here seem suss ?

A forum posting at RaspBerryPi suggests this
10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.
This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?

You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?

I found that .pdf and they are talking about native ASM instructions actually executing. It seems that this eCLR bytecode translates easily into instructions. It also has a 256k runtime minimum. If the Prop2 were to run that same test in the way those other benchmarks were computed it would be about (Fclk/2 * 16 processors) / 1000 operations.

cgracey · 2017-04-01 11:30

These 13 instructions constitute the binary operators for multiply, divide, remainder, scale, and fraction:

op_binary2	setq	#2-1		'pop	a b c d e	a: *
		rdlong	x,ptra[--2] wc	'C=Ys	a b c d e	b: /
		testb	x,#31	    wz	'!Z=Xs	a b c | |	c: //
		abs	x		'	a b c | |	d: **
		abs	y		'	a b c | |	e: */
		qmul	x,y		'	a | | d |
		qdiv	x,y		'	| b c | |
		qfrac	x,y		'	| | | | e
		getqx	x		'	a b | | e
    if_c_eq_z	neg	x		'	a b | | |
		getqy	x		'	| | c d |
    if_nz	neg	x		'	| | c | |
	_ret_	pusha	x		'push	a b c d e

If you add up all the letters, you can see that these routines coded discretely would have taken 37 instructions. Here, EXECF results in 65% smaller code that runs just as fast!

David Betz · 2017-04-01 12:08

cgracey wrote: »
These 13 instructions constitute the binary operators for multiply, divide, remainder, scale, and fraction:
op_binary2	setq	#2-1		'pop	a b c d e	a: *
		rdlong	x,ptra[--2] wc	'C=Ys	a b c d e	b: /
		testb	x,#31	    wz	'!Z=Xs	a b c | |	c: //
		abs	x		'	a b c | |	d: **
		abs	y		'	a b c | |	e: */
		qmul	x,y		'	a | | d |
		qdiv	x,y		'	| b c | |
		qfrac	x,y		'	| | | | e
		getqx	x		'	a b | | e
    if_c_eq_z	neg	x		'	a b | | |
		getqy	x		'	| | c d |
    if_nz	neg	x		'	| | c | |
	_ret_	pusha	x		'push	a b c d e
If you add up all the letters, you can see that these routines coded discretely would have taken 37 instructions. Here, EXECF results in 65% smaller code that runs just as fast!

It certainly does work well for code compression.

Seairth · 2017-04-01 12:08

cgracey wrote: »

All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.

You could also do

op_unary        popa    x       wz      'pop    a b c d e f | | | | |   
        if_nz   not     x,#0            'NOT    a | | | | | | | | | |   a: NOT
                not     x               '!      a b | | | | | | | | |   b: !
                neg     x               '-      | | c | | | | | | | |   c: - 
                abs     x               '||     | | | d | | | | | | |   d: ||
                topone  x               '>|     | | | | e | | | | | |   e: >|
                decod   x               '|<     | | | | | f | | | | |   f: |<
                popa    x       wz      'pop    | | | | | | g h i j k   
        _ret_   pusha   x               'push   a b c d e f | | | | |
                shl     x,#24           '~      | | | | | | g | | | |   g: ~
                sar     x,#24           '       | | | | | | g | | | |   
                shl     x,#16           '~~     | | | | | | | h | | |   h: ~~
                sar     x,#16           '       | | | | | | | h | | |   
                qsqrt   x               'SQRT() | | | | | | | | i | |   i: SQRT()
                qlog    x               'LOG()  | | | | | | | | | j |   j: LOG()
                qexp    x               'EXP()  | | | | | | | | | | k   k: EXP()
                getqx   x               '       | | | | | | | | i j k
        _ret_   pusha   x               'push   | | | | | | g h i j k

An additional popa/pusha have been added at slots 8 and 9. This does add two instructions, but it gets rid of the NOP penalty.

cgracey · 2017-04-01 12:24

Seairth wrote: »

cgracey wrote: »

All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.

You could also do

op_unary        popa    x       wz      'pop    a b c d e f | | | | |   
        if_nz   not     x,#0            'NOT    a | | | | | | | | | |   a: NOT
                not     x               '!      a b | | | | | | | | |   b: !
                neg     x               '-      | | c | | | | | | | |   c: - 
                abs     x               '||     | | | d | | | | | | |   d: ||
                topone  x               '>|     | | | | e | | | | | |   e: >|
                decod   x               '|<     | | | | | f | | | | |   f: |<
                popa    x       wz      'pop    | | | | | | g h i j k   
        _ret_   pusha   x               'push   a b c d e f | | | | |
                shl     x,#24           '~      | | | | | | g | | | |   g: ~
                sar     x,#24           '       | | | | | | g | | | |   
                shl     x,#16           '~~     | | | | | | | h | | |   h: ~~
                sar     x,#16           '       | | | | | | | h | | |   
                qsqrt   x               'SQRT() | | | | | | | | i | |   i: SQRT()
                qlog    x               'LOG()  | | | | | | | | | j |   j: LOG()
                qexp    x               'EXP()  | | | | | | | | | | k   k: EXP()
                getqx   x               '       | | | | | | | | i j k
        _ret_   pusha   x               'push   | | | | | | g h i j k

An additional popa/pusha have been added at slots 8 and 9. This does add two instructions, but it gets rid of the NOP penalty.

Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.

Check out these equality operators. 70% code size reduction:

op_binary0	setq	#2-1		'pop	a b c d e f	a: <
		rdlong	x,ptra[--2]	'	a b c d e f	b: =<,<=
		cmps	x,y	wc,wz	'	a | c d e |	c: ==
		cmps	y,x	wc,wz	'	| b | | | f	d: <>
		muxc	x,_FFFFFFFF	'	a | | | | f	e: =>,>=
		muxnc	x,_FFFFFFFF	'	| b | | e |	f: >
		muxz	x,_FFFFFFFF	'	| | c | | |
		muxnz	x,_FFFFFFFF	'	| | | d | |
	_ret_	pusha	x		'push	a b c d e f

Seairth · 2017-04-01 12:32

cgracey wrote: »

Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.

Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.

cgracey · 2017-04-01 12:40

Seairth wrote: »

cgracey wrote: »

Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.

Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.

I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.

Seairth · 2017-04-01 12:51

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.

Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.

I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.

Oh, right! So actually we shouldn't be getting hung up on the NOP penalty (in this case), since all of those instructions will be processed within a single hub cycle. Even with the NOP, you're still waiting on the pusha.

Seairth · 2017-04-01 12:53

Does anyone else get the feeling that these SKIP blocks are like microcode, but at a macro level?

cgracey · 2017-04-01 12:57

Seairth wrote: »

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.

Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.

I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.

Oh, right! So actually we shouldn't be getting hung up on the NOP penalty (in this case), since all of those instructions will be processed within a single hub cycle. Even with the NOP, you're still waiting on the pusha.

I'll have to do some tests to find out what the sweet spot is, but getting within a near egg-beater slot is worth 16 cycles.

potatohead · 2017-04-01 17:55

Yes.

http://visual6502.org/wiki/index.php?title=6507_Decode_ROM

http://visual6502.org/JSSim/

Look at top center. While the 6507 ROM listing I found varies a tiny bit from the bits in the die simulation rendering, the basic idea is you can see the bit patterns in the Visual 6502 are wired to atomic operations, much like the bit patterns are associated with instructions.

The chip does this in parallel. All at the same time.

A P2 does it sequentially in software.

Very microcode like, if you ask me.

Electrodude · 2017-04-01 17:58

cgracey wrote: »
I just got the unary math operators together for Spin2:
op_unary	popa	x	wz	'pop	a b c d e f g h i j k	a: NOT
	if_nz	not	x,#0		'NOT	a | | | | | | | | | |	b: !
		not	x		'!	a b | | | | | | | | |	c: - 
		neg	x		'-	| | c | | | | | | | |	d: ||
		abs	x		'||	| | | d | | | | | | |	e: >|
		topone	x		'>|	| | | | e | | | | | |	f: |<
		decod	x		'|<	| | | | | f | | | | |	g: ~
		shl	x,#24		'~	| | | | | | g | | | |	h: ~~
		sar	x,#24		'	| | | | | | g | | | |	i: SQRT()
		shl	x,#16		'~~	| | | | | | | h | | |	j: LOG()
		sar	x,#16		'	| | | | | | | h | | |	k: EXP()
		qsqrt	x		'SQRT()	| | | | | | | | i | |
		qlog	x		'LOG()	| | | | | | | | | j |
		qexp	x		'EXP()	| | | | | | | | | | k
		getqx	x		'	| | | | | | | | i j k
	_ret_	pusha	x		'push	a b c d e f g h i j k
That block gets called from the 9-clock interpreter loop via EXECF, the instructions of interest execute, and then it returns to the loop.

All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.

The three unary CORDIC operations take significantly longer than the other unary operations. Would eliminating the NOP delay for qsqrt, qlog, and qexp make a difference?

If so, this could be done by moving qsqrt, qlog, and qexp to the top, but leaving the getqx at the bottom. That way, the NOP delay for those operators wouldn't matter, since it'd be waiting for the CORDIC anyway.

op_unary	popa	x	wz	'pop	a b c d e f g h i j k	a: NOT
		qsqrt	x		'SQRT()	| | | | | | | | i | |	b: !
		qlog	x		'LOG()	| | | | | | | | | j |	c: -
		qexp	x		'EXP()	| | | | | | | | | | k	d: ||
	if_nz	not	x,#0		'NOT	a | | | | | | | | | |	e: >|
		not	x		'!	a b | | | | | | | | |	f: |<
		neg	x		'-	| | c | | | | | | | |	g: ~
		abs	x		'||	| | | d | | | | | | |	h: ~~
		topone	x		'>|	| | | | e | | | | | |	i: SQRT()
		decod	x		'|<	| | | | | f | | | | |	j: LOG()
		shl	x,#24		'~	| | | | | | g | | | |	k: EXP()
		sar	x,#24		'	| | | | | | g | | | |
		shl	x,#16		'~~	| | | | | | | h | | |
		sar	x,#16		'	| | | | | | | h | | |
		getqx	x		'	| | | | | | | | i j k
	_ret_	pusha	x		'push	a b c d e f g h i j k

Electrodude · 2017-04-01 18:00

Also, have you considered leaving the top stack element in a register? Tachyon Forth does this. It would eliminate two hub accesses from every stack operation. Unary operations would require no hub access, and binary operations would require only one, to pop the second value on the stack.

cgracey · 2017-04-01 19:58

Electrodude wrote: »

Also, have you considered leaving the top stack element in a register? Tachyon Forth does this. It would eliminate two hub accesses from every stack operation. Unary operations would require no hub access, and binary operations would require only one, to pop the second value on the stack.

That is a great idea! I will do that.

Also, about those unary CORDIC instructions, they actually get skipped over by the PC when they are not needed, so they take no time, at all. That's the cool thing about SKIPF/EXECF. The only time a NOP occurs is when eight contiguous instructions are getting skipped.

jmg · 2017-04-01 20:41

cgracey wrote: »
jmg wrote: »
A forum posting at RaspBerryPi suggests this
10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.
This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?

You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?
I found that .pdf and they are talking about native ASM instructions actually executing. It seems that this eCLR bytecode translates easily into instructions. It also has a 256k runtime minimum. If the Prop2 were to run that same test in the way those other benchmarks were computed it would be about (Fclk/2 * 16 processors) / 1000 operations.

The first numbers are more marketing, and they rather fail a simple litmus test of checking the SysCLK ratio,
in some cases, I get under 1 sysclk ?! Something is awry. Not even a JIT compiler is that good...

The second, forum value of 3ms/10k lines seems to be a real measured system, with an equation given, so it should be possible to compare that with PASM and Spin.ByteCodes, for each of BOOL (this can be a pin, or memory bit), BYTE, INT, DINT, REAL ?
Fastest I think is COG u32 vars, which can be 4 lines of PASM for 8 SysCLKS or 100ns @ 80MHz
(Pi is ~ 300ns/line)
u8,u16,u64 will all be somewhat slower, more so if they use HUB vars.
BOOL code on P2 I've not looked at too closely yet.
SpinByte codes will be in the ballpark of 1us/line ?

cgracey · 2017-04-02 07:59

Here are the unary operators, using electrodude's idea about keeping the top level of the stack in a register. This is crazy simple now. Things like !, -, and || will take a total of 13 clocks to completely execute, from bytecode fetch to bytecode fetch:

op_notb		test	x	wz	'NOT
	_ret_	muxz	x,_FFFFFFFF

op_not	_ret_	not	x		'!

op_neg	_ret_	neg	x		'-

op_abs	_ret_	abs	x		'||

op_ecd	_ret_	topone	x		'>|

op_dcd	_ret_	decod	x		'|<

op_sxb		shl	x,#24		'~
	_ret_	sar	x,#24

op_sxw		shl	x,#16		'~~
	_ret_	sar	x,#16

op_sqrt		qsqrt	x		'SQRT()	a	a: SQRT()
op_log		qlog	x		'LOG()	| b	b: LOG()
op_exp		qexp	x		'EXP()	| | c	c: EXP()
	_ret_	getqx	x		'	a b c

cgracey · 2017-04-02 08:48

What would you guys think about making the TRUE result = 1, instead of $FFFFFFFF, for NOT/AND/OR and equality tests?

I think my original thinking was that all ones could be used as an & (bitwise AND) argument. Maybe just 1 would be better, though. Verilog behaves this way and I've come to like it.

EDIT: ...On second thought, this is not a good idea because it would confuse the behavior between ! and NOT. The way it is now is probably best.

cgracey · 2017-04-02 09:51

I got all the unary and binary operators, plus equality operators, rewritten to work by keeping the top of the stack in a register. The code size is 60 longs, same as before. The difference now is that it runs way faster because it does fewer pushes and pops and all the SKIPF-induced NOPs are gone, since I tightened things up a bit.

'
'
' Unary operators
'
op_notb		test	x	wz	'NOT
	_ret_	muxz	x,_FFFFFFFF

op_not	_ret_	not	x		'!

op_neg	_ret_	neg	x		'-

op_abs	_ret_	abs	x		'||

op_ecd	_ret_	topone	x		'>|

op_dcd	_ret_	decod	x		'|<

op_sxb		shl	x,#24		'~
	_ret_	sar	x,#24

op_sxw		shl	x,#16		'~~
	_ret_	sar	x,#16

op_sqrt		qsqrt	x		'	a		a: SQRT()
op_log		qlog	x		'	| b		b: LOG()
op_exp		qexp	x		'	| | c		c: EXP()
	_ret_	getqx	x		'	a b c
'
'
' Binary operators
'
op_logic	popa	y		'pop	a b c d e	a: AND
		test	x	wz	'	a b | | |	b: OR
		muxnz	x,_FFFFFFFF	'	a b | | |	c: &
		test	y	wz	'	a b | | |	d: |
		muxnz	y,_FFFFFFFF	'	a b | | |	e: ^
	_ret_	and	x,y		'	a | c | |
	_ret_	or	x,y		'	  b   d |
	_ret_	xor	x,y		'	        e

op_shift	mov	y,x		'swap	a b c d e f g	a: >>
		popa	x		'pop	a b c d e f g	b: <<
	_ret_	shr	x,y		'	a | | | | | |	c: ~>
	_ret_	shl	x,y		'	  b | | | | |	d: <~
	_ret_	sar	x,y		'	    c | | | |	e: ->
	_ret_	sal	x,y		'	      d | | |	f: <-
	_ret_	ror	x,y		'	        e | |	g: ><
	_ret_	rol	x,y		'	          f |
		rev	x		'	            g
		rol	x,y		'	            g
		rol	x,#1		'	            g
	_ret_	triml	x,y		'	            g

op_addsub	mov	y,x		'swap	a b c d		a: #>
		popa	x		'pop	a b c d		b: <#
	_ret_	mins	x,y		'	a | | |		c: +
	_ret_	maxs	x,y		'	  b | |		d: -
	_ret_	add	x,y		'	    c |
	_ret_	sub	x,y		'	      d

op_muldiv	popa	y	wc	'pop	a b c d e	a: *
		testb	x,#31	wz	'c=ys	a b c | |	b: /
		abs	y		'!z=xs	a b c | |	c: //
		abs	x		'	a b c | |	d: **
		qmul	y,x		'	a | | d |	e: */
		qdiv	y,x		'	| b c | |
		qfrac	y,x		'	| | | | e
		getqx	x		'	a b | | e
    if_c_eq_z	neg	x		'	a b | | |
		getqy	x		'	| | c d |
    if_c	neg	x		'	| | c | |
		ret			'	a b c d e

op_equal	popa	y		'pop	a b c d e f	a: <
		cmps	y,x	wc,wz	'	a | c d e |	b: =<,<=
		cmps	x,y	wc,wz	'	| b | | | f	c: ==
	_ret_	muxc	x,_FFFFFFFF	'	a | | | | f	d: <>
	_ret_	muxnc	x,_FFFFFFFF	'	  b | | e	e: =>,>=
	_ret_	muxz	x,_FFFFFFFF	'	    c |		f: >
	_ret_	muxnz	x,_FFFFFFFF	'	      d

ozpropdev · 2017-04-02 10:58

Nice work Chip!
A big chunk of interpreter in 60 longs (with blazing speed) is very neat.

cgracey · 2017-04-02 12:49

Here are the constants:

'
'
' Constants
'
con_low		pusha	x		'push	a b c d e f g		a: 0
	_ret_	mov	x,#0		'	a | | | | | |		b: 1
	_ret_	mov	x,#1		'	  b | | | | |		c: 2
	_ret_	mov	x,#2		'	    c | | | |		d: 3
	_ret_	mov	x,#3		'	      d | | |		e: 4
	_ret_	mov	x,#4		'	        e | |		f: 5
	_ret_	mov	x,#5		'	          f |		g: 6
	_ret_	mov	x,#6		'	            g

con_high	pusha	x		'push	a b c d e f g		a: 7
	_ret_	mov	x,#7		'	a | | | | | |		b: 8
	_ret_	mov	x,#8		'	  b | | | | |		c: 15
	_ret_	mov	x,#15		'	    c | | | |		d: 16
	_ret_	mov	x,#16		'	      d | | |		e: 31
	_ret_	mov	x,#31		'	        e | |		f: 32
	_ret_	mov	x,#32		'	          f |		g: -1
	_ret_	mov	x,_FFFFFFFF	'	            g

con_data	pusha	x		'push	a b c d e f g h i	a: byte
	_ret_	rfbyte	x		'	a | | | | | | | |	b: !byte
		rfbyte	x		'	  b | | | f g h i	c: word
	_ret_	rfword	x		'	  | c | | | | | |	d: !word
		rfword	x		'	  |   d | | | | |	e: long
	_ret_	rflong	x		'         |   | e | | | |	f: byte + decode
	_ret_	decod	x		'	  |   |   f | | |	g: byte + decode + not
		decod	x		'	  |   |	    g h i	h: byte + decode + decrement
	_ret_	not	x		'	  b   d	    g | |	i: byte + decode + negate
	_ret_	sub	x,#1		'                     h |
	_ret_	neg	x		'	                i

potatohead · 2017-04-02 16:56

Honestly, this is kind of crazy effective.

Not sure if you all looked at the 6502 links I dropped, but this general approach looks one heck of a lot like that chips instruction decode and execute ROM.

Seems doing so much Verilog has done Chip some good.

We have a sequential version of what is a common hardware approach.

I see it now as a truth table version of logic. If one works at encoding the table, it ends up being a set of branches and operators, "the function."

Like XOR is a function, with a little table. This is some function, we don't care what function, just that the truth table for it matches requirements. Without it, one would basically make this same table as a starting point, then work out efficient combinations of branches, jumps, and, nor, XOR, ops to get it all done.

Doing it this way avoids all of that logic, and it's possible to just evaluate the table for all cases, set the bits and go.

Frankly, once people see what happens, this will be easier for a lot of parse and decision type things.

SPIN should have it. No joke.

jmg · 2017-04-02 21:23

cgracey wrote: »

I got all the unary and binary operators, plus equality operators, rewritten to work by keeping the top of the stack in a register. The code size is 60 longs, same as before. The difference now is that it runs way faster because it does fewer pushes and pops and all the SKIPF-induced NOPs are gone, since I tightened things up a bit.

You still need to manage all those masks tho, and store them somewhere...

Your comment syntax could, with small changes, lend itself to parsing to auto-extract the masks by name.

The main issue I think is the right-most tag is not valid name chars, but a function name could be ?

Which brings up the point: is now a good time to deprecate the syntactic-soup that is Spin operators, and introduce a cleaner functionname() approach, that would allow other front-ends on the Byte-code generator ?
Final code size is no different, but source is (far) easier to scan and read, and has far less culture shock coming from other languages.

When the ASM is easier to read than the HLL, it is time to pause and re-think.

potatohead · 2017-04-02 21:36

Let's not do that. Those operators are well understood, useful and brief.

Heater. · 2017-04-02 22:43

Spin's operators should be brought into line with those of C, C++, C#, Java, Javascript.

I see no particular reason they are different. It confuses and catches out those coming from those other languages. I.e everybody.

By all means have some extra exotic operators as well.

jmg · 2017-04-02 23:12

Heater. wrote: »

Spin's operators should be brought into line with those of C, C++, C#, Java, Javascript.

I see no particular reason they are different. It confuses and catches out those coming from those other languages. I.e everybody.

By all means have some extra exotic operators as well.

Yup.
The way other tools manage this, is with a simple mode switch, if you find some that are mutually exclusive.
ie you can compile both forms, with a modicum of effort.

I was looking at Chip's tables, and thinking how to automate that, and there, the cryptic-names bite.
Without those, it is not too hard to Automate.
Some clearer name is needed, for automation, so why not push that up to the source, since it is then in the docs.

Python 3 moves more from the 'language' into function syntax, which has many benefits.

Fast Bytecode Interpreter

Comments