Fast Bytecode Interpreter

lonesock · 2018-02-05 18:05

cgracey wrote: »
lonesock wrote: »
cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

In P1 code:
              sub       lim2, val wz
        if_nz sub       lim1, val wz
              xor       lim1, lim2
              shl       lim1, #1 wc
The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

Jonathan
Jonathan, this is hard to think about it. I will make the effort to ingest this soon.

My fault, I should have stated the logic behind it.

I convert the limits to deltas from the tested value, and if the tested value is in the middle of the 2 limits, the delta values should have opposite signs (XOR of the highest (sign) bits should be 1). i.e. (lim1 - val) xor (lim2 - val) should be negative.

There is also the test case where the limit equals the test value, so I use the Z flag to check for that case. Otherwise you can get a 0 delta, and a positive delta, which does not trip the "opposing signs" logic.

thanks,
Jonathan

cgracey · 2018-02-06 06:10

lonesock wrote: »
cgracey wrote: »
lonesock wrote: »
cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

In P1 code:
              sub       lim2, val wz
        if_nz sub       lim1, val wz
              xor       lim1, lim2
              shl       lim1, #1 wc
The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

Jonathan
Jonathan, this is hard to think about it. I will make the effort to ingest this soon.
My fault, I should have stated the logic behind it.

I convert the limits to deltas from the tested value, and if the tested value is in the middle of the 2 limits, the delta values should have opposite signs (XOR of the highest (sign) bits should be 1). i.e. (lim1 - val) xor (lim2 - val) should be negative.

There is also the test case where the limit equals the test value, so I use the Z flag to check for that case. Otherwise you can get a 0 delta, and a positive delta, which does not trip the "opposing signs" logic.

thanks,
Jonathan

Okay. That makes sense. I would still have to study it to fully understand it.

twm47099 · 2018-02-09 02:51

I am not sure if this is the correct thread to ask this.

With the Prop 1 I use WAITPEQ and WAITPNE to put a cog to sleep until one or more pins (as defined in the WAITP__ statement) are set 1 or 0 by another cog or an external sensor to wake the the cog and then by decoding the pattern of pins branch to different routines.

How would something like that be done in Spin2?

Thanks
Tom

cgracey · 2018-02-09 03:09

twm47099 wrote: »

I am not sure if this is the correct thread to ask this.

With the Prop 1 I use WAITPEQ and WAITPNE to put a cog to sleep until one or more pins (as defined in the WAITP__ statement) are set 1 or 0 by another cog or an external sensor to wake the the cog and then by decoding the pattern of pins branch to different routines.

How would something like that be done in Spin2?

Thanks
Tom

You would do a SETPAT instruction with D=mask, S=value, Z=1 for equal or 0 for not equal, and C = 0 for INA or 1 for INB. Then, you could do WAITPAT, POLLPAT WC, or JPAT/JNPAT to wait for the event, poll it, or jump by it.

TonyB_ · 2018-02-16 23:49

XBYTE issue:
Chip, could you please look at this matter again?

cgracey wrote: »
I ran into a limitation while working on Spin2 with XBYTE...

XBYTE uses SETQ to establish the bytecode base and LSB/MSB indexing, usually at the start of bytecode execution:
_RET_   SETQ    #$000           'set bytecode base to $000 in LUT, stack = $1F8, do XBYTE
After that, _RET_+Instruction's are done at the end of each code snippet to execute the next bytecode.

What if you have an alternate set of bytecodes, though, that you want to execute through a portal bytecode? You could do a new _RET_+SETQ, but then you permanently change the base and MSB/LSB indexing for subsequent _RET_+instruction's.

There needs to be a way to do a temporary base-and-MSB/LSB-indexing change that only works on the next bytecode, so that alternate sets of bytecodes could be used to run lower-usage code, without any need for a restorative _RET_+SETQ instruction later.

SETQ2 will now serve this purpose:
_RET_   SETQ2   #$100           'set bytecode base to $100 in LUT for next bytecode only, stack = $1F8, do XBYTE
This will add 10 flops to each cog, but make alternate bytecode sets easy to engage without the need for post clean-up.

Firstly, this a welcome improvement for switching between alternate bytecodes.

Secondly, having enough room in the LUT for two full sets of 256 bytecodes compared to only one is great but the problem (and one I have right now) is there no space in the LUT for any code. Although 512 jump addresses and skip patterns can fit in the LUT, there is only two-thirds of the cog code space to deal with twice as many bytecodes.

One solution is to have two-stage decoding for the alternate set. Let's suppose these alternate bytecodes are in the form %yyyyyzzz, where yyyyy and zzz are independent bit groups. A lot of space could be saved in half the LUT by decoding only yyyyy with XBYTE, which leaves zzz to be decoded later. Chip's example above has the stack at $1F8 for both main and alternate sets, but in this case they would be at $1F8 and $1FB:

Stack	Bits	SETQ		LUT base	LUT index 	LUT EXECF	Bytecode
				address				address		set
 
$1F8	8	%Axxxxxxxx	%A00000000	yyyyyzzz	%Ayyyyyzzz	main
$1FB	5	%AAAAxxxx1	%AAAA00000	yyyyy		%AAAAyyyyy	alternate

Only 32 entries now required in the LUT for the alternate set, a big saving of 224 longs that could make the difference between an interpreter fitting in one cog or not.

Q1.
Will it be as easy to switch between these two sets when the stacks are different?
It would be wrong to assume that the LUT base will always be the same for both.

Q2.
Is the following the quickest way to decode low-order bits zzz into skip patterns?


' bytecode = %yyyyyzzz at $1f6

zzz_decode	shl	bytecode,#1		' bytecode = yyyyzzz0
		setnib	skip_ins,bytecode,#2	' change nibble #2 of skipf opcode
skip_ins	skipf	skip_0			' skipf D[2:0] = zzz	
		...

' skip patterns for zzz, skip_0 aligned to 8 long boundary

skip_0		long	...
skip_1		long	...
skip_2		long	...
skip_3		long	...
skip_4		long	...
skip_5		long	...
skip_6		long	...
skip_7		long	...

EDIT:
Deleted two questions.

TonyB_ · 2018-02-17 00:09

.

TonyB_ · 2018-02-17 20:09

The main issue here is that the new SETQ2 can change the LUT base address but it cannot change the LUT index to a different number of bytecode bits because that it is set by the return address.

I think I have a solution. Instead of SETQ2 D simply replacing the original SETQ D, two operations occur for the next bytecode only:

1. LUT base address = {SETQ[8:1],0} XOR {SETQ2[8:1],0}

2. SETQ2[8:1] select LUT index width, SETQ2[0] selects right/left aligned, as follows:

SETQ2		LUT index	Equivalent to temporary
				return address on stack of

100000000	bytecode[7:0]	$1F8

x10000000	bytecode[6:0]	$1F9
x10000001	bytecode[7:1]	$1F9

xx1000000	bytecode[5:0]	$1FA
xx1000001	bytecode[7:2]	$1FA

xxx100000	bytecode[4:0]	$1FB
xxx100001	bytecode[7:3]	$1FB

xxxx10000	bytecode[3:0]	$1FC
xxxx10001	bytecode[7:4]	$1FC

xxxxx1000	bytecode[2:0]	$1FD
xxxxx1001	bytecode[7:5]	$1FD

xxxxxx100	bytecode[1:0]	$1FE
xxxxxx101	bytecode[7:6]	$1FE

xxxxxxx10	bytecode[0]	$1FF
xxxxxxx11	bytecode[7]	$1FF

SETQ2[1] = 1 is equivalent to $1FF on stack.
SETQ2[2:1] = 10 is equivalent to $1FE on stack, etc.

Thus SETQ2[8:1] are priority encoded to same 3-bit value as stack[2:0] would be.

Example 1

' $1F8 on stack

_RET_   SETQ    #$000		'LUT base = $000, index = bytecode[7:0], do XBYTE
...
_RET_   SETQ2   #$100		'LUT base = $100, index = bytecode[7:0], do XBYTE once

Example 2

' $1F8 on stack

_RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do XBYTE
...
_RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do XBYTE once

Example 3

' $1F8 on stack

_RET_   SETQ    #%011100000	'LUT base = $000, index = bytecode[7:0], do XBYTE
...
_RET_   SETQ2   #%111100001     'LUT base = $100, index = bytecode[7:3], do XBYTE once

Example 4

' $1F8 on stack

_RET_   SETQ    #%000100000	'LUT base = $000, index = bytecode[7:0], do XBYTE
...
_RET_   SETQ2   #%100100001     'LUT base = $100, index = bytecode[7:3], do XBYTE once

TonyB_ · 2018-02-17 20:30

I forgot CZ write enable: SETQ[9] XOR SETQ2[9] might would be better than SETQ2[9].

TonyB_ · 2018-02-17 21:11

If the new 10-bit SETQ2 register were cleared before SETQ2 and after the one-time XBYTE, then SETQ[9:0] XOR SETQ2[9:0] = SETQ[9:0] and the XOR logic could apply all the time.

EDIT:
Before the recent debugging change, there were sufficient XBYTE status bits to debug a modified XBYTE as proposed above. I hope it is possible to partially revert the changes to make this possible again.

TonyB_ · 2018-02-21 23:49

No response - that's not a good sign but I'm not giving up yet.

I think the new SETQ2 is not as useful as it could be. Instead of being able to do this:

' $1F8 on stack

_RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do normal XBYTE
...
_RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do special XBYTE once
...
_RET_	XXXX			'do any non-branching instruction, then do normal XBYTE

users would have to do this

' $1F8 on stack

_RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do normal XBYTE
...
	PUSH	#$1FB
_RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do special XBYTE once
...
_RET_	POP	temp		'remove $1FB from stack, then do normal XBYTE

and the whole point of SETQ2 is rather negated.

cgracey · 2018-02-23 19:12

TonyB_,

I've beeen working on the layout with Treehouse, and now I'm doing it MYSELF. Something new. They needed to redirect their layout people to one critical task, so they quit working on Prop2 for what would be a few weeks. I figured I could do the job, so I fired up our old Tanner EDA tools and now I'm on it. This saves about $8k/week. Next week, I'm taking the layout to OnSemi in Idaho for final DRC and LVS, using their Cadence tools. so, I have been in a black hole of sorts.

I had seen your message, but I could not formulate a response, with a head full of layout. This layout work is very engrossing. it is kind of interesting that you need not know much about electronics to do the job. It's certainly an example of specialization. Yesterday, I finally went to bed after working 35 hours straight. I didn't want any details to spill out of my head, before sleeping.

So, everything is still moving forward. I've just not had much time for The Forum. I will address your question as soon as I get this layout stuff out of the way. Thanks for your input and ideas. They've been making things really nice.

TonyB_ · 2018-02-24 00:01

Thanks for the update, Chip.

msrobots · 2018-02-24 02:15

35 hours straight does not sound good.

Cluso99 · 2018-02-24 03:24

Chip,
Please take it easy! You have had your warning, so take heed as you may not be so lucky next time!!!

Rayman · 2018-02-24 03:39

Well, I'm not sure 35 hours straight is really good...
They say we're the sharpest just after waking up.
I know that it is how it is for me.
Sure, you can do more work without sleep, but I don't think for most of us that it is quality work..

Still, people are different. There was that great movie "Real Genius" will a girl who was a genius and didn't sleep...

Rayman · 2018-02-24 03:40

Wow, I can post again! Thought I was banned for a while

Anybody see anything different now?

Tubular · 2018-02-24 04:14

Thanks for doing the "hard yards" for us, Chip.

Its interesting how old tools still come out and make themselves useful. I used a 3 1/2 " disk head cleaner during the week, it was the easiest way to move data from an old machine.

cgracey · 2018-02-25 05:36

After another 30-hour push, things seems to have finally come together. I'll be at OnSemi this week doing final design-rule checks and layout-versus-schematic checks.

I like doing layout work. For the amount of money we spent on this layout and the long time that it took, I'm absolutely doing this myself next time. I can see how Beau used to be so into this. There are so many details to juggle, you can get completely lost in it. I wasn't even hungry.

Tubular · 2018-02-25 05:50

Must be fun, if you're already thinking about next time

Is it possible to do the layout work for the other 2/4 cog variants while this is relatively fresh in your mind? Or must the synthesis work be completed first?

cgracey · 2018-02-25 05:53

Tubular wrote: »

Must be fun, if you're already thinking about next time

Is it possible to do the layout work for the other 2/4 cog variants while this is relatively fresh in your mind? Or must the synthesis work be completed first?

No new layout is needed for those variants. It's the same pads, just in a different order, and fewer of them. The layout work is 99.9% in the pads, themselves.

HydraHacker · 2018-02-26 02:39

cgracey wrote: »
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

Jonathan

In P1 code:
              sub       lim2, val wz
        if_nz sub       lim1, val wz
              xor       lim1, lim2
              shl       lim1, #1 wc

HydraHackers code below:

InRange       cmps n,r1  wz,wc     ' checks for n>=r1   
IF_NZ_AND_NC  cmps r2,n  wz,wc     ' if n>=r1 then check if n<=r2       
IF_NC         jmp #InRange_ret     ' if n>=r1 and n<=r2 then exit    
              cmps r1,n  wz,wc     ' checks for n<=r1   
IF_NZ_AND_NC  cmps n,r2  wz,wc     ' if n<=r1 then check if n>=r2   
InRange_ret   ret

' defines what flags mean upon subroutine exit

' z=1&c=0 means that n is equal to r1 or n is equal to r2
' z=0&c=0 means that n is between r1 and r2, but does not include r1 or r2
' z=0&c=1 means that n is not in range!

Congratulations Lonesock your algorithm to tell if a value is between two numbers is one instruction less than mine. The 'ret' instruction in my code is not needed, if you inline the code a 'jmp' instruction is still needed to allow code to execute if value is in range or branch to code past the in range code.

HydraHacker

cgracey · 2018-03-30 11:40

TonyB_ had an idea for increasing the flexibiliy of the bytecode interpreter. We wound up exchanging some pm's and came up with a really nice way to just use SETQ/SETQ2 to define all modes in just 9 bits of data, so immediate values will always suffice.

Now, instead of the return address on the stack needing to be from $1F8..$1FF, it just needs to be $1FF and SETQ(2) handles the base address AND sizing. We got rid of the 3, 2, 1-bit modes, since they weren't very useful and used up a lot of space:

// setq/setq2 value on ret/ret_auto to $001FF sets xbyte mode:
//
//	ABBBB00xF	= 8-bit opcode, 0-bit data, upper sets of 16 can be collapsed into individual lut addresses
//	AAxx001MF	= 7-bit opcode, 1-bit data
//	AAAx101MF	= 6-bit opcode, 2-bit data
//	AAAAx10MF	= 5-bit opcode, 3-bit data
//	AAAAA11MF	= 4-bit opcode, 4-bit data
//
// A = base address in lut, msb-justified
// B = collapse sets of 16 starting at ABBBB0000 if BBBB > 0
// M = use msb's of bytecode for index, not lsb's
// F = set c,z to index[1:0]

For the 8-bit mode, you can now "collapse" sets of 16 identical LUT entries into just one LUT entry. I had four such cases in the Spin2 interpreter that read constants and handled local variables, and had identical LUT values. The code snippets they branched to would use the lower nibble of the bytecode deposited into PA. Now instead of 64 LUT entries spanning addresses $0C0..$0FF, there are just four entries from $0C0..$0C3. This saves 60 LUT locations and frees up space for alternate bytecodes which can reside at $0C4..$0FF, accessed after a 'SETQ2 #$000' which turns off the "collapse" mode for the next bytecode.

I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.

I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.

Hopefully, some of you will be ready to run your code through some tests with the final version this weekend.

Roy Eltham · 2018-03-30 14:16

It's Easter this weekend, I'm out of town at Mom's. I except others will be less available this weekend also.

jmg · 2018-03-30 19:40

cgracey wrote: »

I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.

Fingers crossed...

cgracey wrote: »

I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.

Do the various current source/sink modes and high speed DACs all work 'as expected' ?
Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?

cgracey · 2018-03-31 21:19

jmg wrote: »

cgracey wrote: »

I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.

Fingers crossed...

cgracey wrote: »

I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.

Do the various current source/sink modes and high speed DACs all work 'as expected' ?
Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?

The I/O pad seems to work perfectly. Voltages are within 1% of targets. Resistances and currents are within 2%-3%.

The current modes are constant inside 0.7V of the power and ground rails.

Only the PLL needed some attention, which it got.

Seairth · 2018-03-31 21:48

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.

Fingers crossed...

cgracey wrote: »

I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.

Do the various current source/sink modes and high speed DACs all work 'as expected' ?
Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?

The I/O pad seems to work perfectly. Voltages are within 1% of targets. Resistances and currents are within 2%-3%.

The current modes are constant inside 0.7V of the power and ground rails.

Only the PLL needed some attention, which it got.

WTG!

cgracey · 2018-09-12 12:03

I've got everything into the Spin2 bytecode interpreter that I think is needed, except the bytecode snippet for starting a new Spin2 process. Once I get everything else working, it will become more apparent how to handle that.

I need to go through this and tidy it up a little bit, but I don't feel like anything needs changing. After that, I'll start working on the compiler to generate the bytecode from source code.

David Betz · 2018-09-12 12:12

cgracey wrote: »

I've got everything into the Spin2 bytecode interpreter that I think is needed, except the bytecode snippet for starting a new Spin2 process. Once I get everything else working, it will become more apparent how to handle that.

I need to go through this and tidy it up a little bit, but I don't feel like anything needs changing. After that, I'll start working on the compiler to generate the bytecode from source code.

Cool! Mind if I look through your code to understand better how to create a fast byte code interpreter for the P2? I'm working on my own for P1 that I'd like to run on P2 as well.

cgracey · 2018-09-12 12:16

That's what it's there for.

Cluso99 · 2018-09-12 12:16

WTG Chip

Fast Bytecode Interpreter

Comments