Fast Bytecode Interpreter

evanh · 2017-09-02 03:02

Dave,
Intriguing, so the FIFO stops being used when the code path enters CogExec, the FIFO is then being used for a buffered block write, then automatically refills with HubExec code when the RET is hit.

There's probably a FIFO flush delay at this point but "That's so easy!"

Now for the block copy version ... it'll need to chop it up into chunks like Chip has.

cgracey · 2017-09-02 03:16

One nice thing about SETQ+RDLONG/WRLONG is that the FIFO is not used. Data is streamed directly between the cog and the hub at one clock per long. FIFO status and mode are unchanged.

evanh · 2017-09-02 03:41

So many ways in the Prop2.

I've been using Dave's comport code for my testing. It's simple bit-bashing code (presumably been like that since Prop2Hot days) but it saved me going through the trial and error effort myself. Had to hack it around a little to remove the GCC parts. I wanted all the printf() code too but that was just too hard to extract. So I settled for making my own itoa() that uses the putch() primitive.

cgracey · 2017-09-02 04:53

Dave Hein wrote: »
Chip, that looks great. However, I wonder if you could get a speed improvement by separating xxxxFILL from xxxxMOVE. Also, because the P2 does unaligned reads and writes the BYTE and WORD operations could be almost as fast as the LONG operations. It just requires performing 0 to 3 BYTE accesses at the beginning, and then doing LONG accesses after that. This is the code I used for memset() in p2gcc, which is basically the same as BYTEFILL() in Spin.
' This code resides in cog RAM
__LONGFILL
        wrfast  #0, r0
        rep     #1, r2
        wflong  r1
        ret
// This code resides in hub RAM
// ptr, val and num are in r0, r1 and r2
void memset(void *ptr, int val, int num)
{
    __asm__("       mov     r3, #3");
    __asm__("       and     r3, r2 wz");
    __asm__(" if_z  jmp     #label1");
    __asm__("label2 wrbyte  r1, r0");
    __asm__("       add     r0, #1");
    __asm__("       djnz    r3, #label2");
    __asm__("label1 setnib  r1, r1, #1");
    __asm__("       setbyte r1, r1, #1");
    __asm__("       setword r1, r1, #1");
    __asm__("       shr     r2, #2 wz");
    __asm__(" if_nz call    #\\__LONGFILL");
}
I could speed it up a bit by moving all the code to cog RAM, but p2gcc C code currently calls functions that only reside in hub RAM.

Dave, there was a lot of overlap between the move and fill functions. So, I melded them together. I didn't want one full-length SETQ+WRLONG fill, since interrupt latency would have been compromised.

If you don't care about interrupt latency, you can do one SETQ+WRLONG combo, using a constant for the WRLONG's D. It will fill at the rate of 1 long per clock.

Dave Hein · 2017-09-02 12:53

I'll have to change it to use SETQ+WRLONG. Since it doesn't use the FIFO I should be able to do this from hub RAM and I won't need to use cog RAM. I guess if I need to worry about interrupt latency in the future I could break the SETQ+WRLONG into to smaller chunks.

cgracey · 2017-09-07 04:10

Work on the interpreter is going slowly, but almost done. It is really challenging for me because there are so many subtle things that can be done to create optimal code, that it's enough to drive one nuts.

We almost need some kind of compiler that can optimize variable placement and coordinate it with SETQ+RDLONG activity to get the big speed improvements possible.

I've been thinking for years now that it would be really neat to have a tool where you can define some kind of outcome that you want, based on inputs, and the tool, using random instruction selection and simulation, comes up with really efficient solutions that would have been too difficult for a human to produce in as timely a way.

jmg · 2017-09-07 05:20

cgracey wrote: »

Work on the interpreter is going slowly, but almost done. It is really challenging for me because there are so many subtle things that can be done to create optimal code, that it's enough to drive one nuts.

We almost need some kind of compiler that can optimize variable placement and coordinate it with SETQ+RDLONG activity to get the big speed improvements possible.

I've been thinking for years now that it would be really neat to have a tool where you can define some kind of outcome that you want, based on inputs, and the tool, using random instruction selection and simulation, comes up with really efficient solutions that would have been too difficult for a human to produce in as timely a way.

So you mean sort of like a Place & Route in FPGA, only for Opcodes... ?
Could be interesting to try, but likely to quickly become exponentially slower, as the task increases.

Easier could be more-automated tools for your masked skip opcodes.
ie fairly simple ones that re-create masks as lines are added and removed.

Jeff Haas · 2017-09-07 06:51

Chip, that's interesting...a Monte Carlo approach to coding.

https://en.m.wikipedia.org/wiki/Monte_Carlo_method

cgracey · 2017-09-07 14:26

Yes, it seems almost like an FPGA compilation problem, but with a larger-grain and a more-random set of building blocks that cover a different set of functions. It's like chess, whereas FPGA's are more like Legos.

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

cgracey · 2017-09-07 15:33

Jeff Haas wrote: »

Chip, that's interesting...a Monte Carlo approach to coding.

https://en.m.wikipedia.org/wiki/Monte_Carlo_method

That didn't occur to me, at first, but that would be the best way to do it. You could eliminate bad combos right away with that approach. A full parametric sweep would be impossible.

cgracey · 2017-09-07 19:58

Maybe the simplest way to make such a thing is to give the tool some code sequence that works, and have it try to find a shorter/faster sequence. That way, there's less abstraction needed on defining the goal. It just simulates the model while it tries to find a better one.

thej · 2017-09-07 20:23

cgracey wrote: »

Maybe the simplest way to make such a thing is to give the tool some code sequence that works, and have it try to find a shorter/faster sequence. That way, there's less abstraction needed on defining the goal. It just simulates the model while it tries to find a better one.

Sounds like your moving toward genetic algorithms :-)

cgracey · 2017-09-07 20:41

The fastest way to build this and the fastest way to run it, is on the prop2, itself. No need for a simulator, then:

1) generate new random input values
2) run 'model' routine with generated random input values, capture outputs
4) run 'test' routine with generated random input values, capture outputs
5) compare outputs and iterate 'test' routine if mismatch
6) repeat until n cycles with no mismatch

msrobots · 2017-09-07 22:43

isn't that exactly what neuronal networks do?

provide them with input-data and a goal output to train them?

nothing new here

Mike

thej · 2017-09-08 04:51

msrobots wrote: »

isn't that exactly what neuronal networks do?
Mike

No, thats exactly what Genetic Algorithms do.

"The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions."

HydraHacker · 2017-09-16 22:58

cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.

InRange       cmps n,r1  wc,wz        
IF_NC_AND_NZ  cmps r2,n  wc,wz            
IF_NC         jmp #InRange_ret        
              cmps r1,n  wc,wz        
IF_NC_AND_NZ  cmps n,r2  wc,wz       
InRange_ret   ret

HydraHacker

cgracey · 2017-09-17 05:00

HydraHacker wrote: »
cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
InRange       cmps n,r1  wc,wz        
IF_NC_AND_NZ  cmps r2,n  wc,wz            
IF_NC         jmp #InRange_ret        
              cmps r1,n  wc,wz        
IF_NC_AND_NZ  cmps n,r2  wc,wz       
InRange_ret   ret
HydraHacker

Neat! I was wondering if something like this was possible - checking the range in two directions. It hurt my head so bad that I couldn't get on top of it. Looks like you got there, though.

Meanwhile, here is what I came up with (comments are first, to state the goal of the code):

'
'
' LOOKDOWN(target : ,,,value1..value2,,,)
'
'  if value1 <= value2
'    delta = value2 - value1
'    if value1 <= target <= value2
'      result = - value1 + index + target
'      branch to address
'    else index += delta + 1
'
'  if value1 > value2
'    delta = value1 - value2
'    if value2 <= target <= value1
'      result = value1 + index - target
'      branch to address
'    else index += delta + 1
'
'
'  x		value2		x
'  ptra[-1]	value1		t
'  ptra[-2]	index		u
'  ptra[-3]	target		v
'  ptra[-4]	address		w
'
'
lookdnr		setq	#3		'pop value1+index+target+address into t+u+v+w
		rdlong	w,--ptra[4]

		cmps	x,t	wc	'z = (value1 > value2)
		modcz	0,_c	wz

  if_nz		cmps	v,t	wc	'if nz, nc = (value1 <= target <= value2)
  if_nz_and_nc	cmps	x,v	wc

  if_z		cmps	v,x	wc	'if z, nc = (value2 <= target <= value1)
  if_z_and_nc	cmps	t,v	wc

  if_nc		negnz	x,t		'if target within range,
  if_nc		add	x,u		'..get result on stack top
  if_nc		sumz	x,v
  if_nc		jmp	#casex		'..and branch

  if_nz		sub	x,t		'else, get new index on stack top
  if_z		subr	x,t
		addx	x,u		'(c=1, results in +1)

  _ret_		add	ptra,#2*4	'..and unpop address+target

This code gets the direction of the compare into Z, so that it can still be used after C indicates if we are in range, or not.

HydraHacker · 2017-09-17 16:36

cgracey wrote: »
HydraHacker wrote: »
cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
InRange       cmps n,r1  wc,wz        
IF_NC_AND_NZ  cmps r2,n  wc,wz            
IF_NC         jmp #InRange_ret        
              cmps r1,n  wc,wz        
IF_NC_AND_NZ  cmps n,r2  wc,wz       
InRange_ret   ret
HydraHacker
Neat! I was wondering if something like this was possible - checking the range in two directions. It hurt my head so bad that I couldn't get on top of it. Looks like you got there, though.

Meanwhile, here is what I came up with (comments are first, to state the goal of the code):
'
'
' LOOKDOWN(target : ,,,value1..value2,,,)
'
'  if value1 <= value2
'    delta = value2 - value1
'    if value1 <= target <= value2
'      result = - value1 + index + target
'      branch to address
'    else index += delta + 1
'
'  if value1 > value2
'    delta = value1 - value2
'    if value2 <= target <= value1
'      result = value1 + index - target
'      branch to address
'    else index += delta + 1
'
'
'  x		value2		x
'  ptra[-1]	value1		t
'  ptra[-2]	index		u
'  ptra[-3]	target		v
'  ptra[-4]	address		w
'
'
lookdnr		setq	#3		'pop value1+index+target+address into t+u+v+w
		rdlong	w,--ptra[4]

		cmps	x,t	wc	'z = (value1 > value2)
		modcz	0,_c	wz

  if_nz		cmps	v,t	wc	'if nz, nc = (value1 <= target <= value2)
  if_nz_and_nc	cmps	x,v	wc

  if_z		cmps	v,x	wc	'if z, nc = (value2 <= target <= value1)
  if_z_and_nc	cmps	t,v	wc

  if_nc		negnz	x,t		'if target within range,
  if_nc		add	x,u		'..get result on stack top
  if_nc		sumz	x,v
  if_nc		jmp	#casex		'..and branch

  if_nz		sub	x,t		'else, get new index on stack top
  if_z		subr	x,t
		addx	x,u		'(c=1, results in +1)

  _ret_		add	ptra,#2*4	'..and unpop address+target
This code gets the direction of the compare into Z, so that it can still be used after C indicates if we are in range, or not.

Thanks chip for looking at my code, I really busted my brain! Trying to make a range check that works in two directions. I’ll add some comments soon to my code, I just have to figure out a good way to explain it. Chip your code really looks good and because you write such efficient code sometimes I have trouble understanding what you wrote :-)

HydraHacker

HydraHacker · 2017-10-08 17:40

Hi Chip,

I have documented my code that performs a range check in two directions.

dat

{
if n>=r1                r1<r2 direction
   if n<=r2
      return true       n is in range of r1 and r2
elseif n<=r1            r1>r2 direction
   if n>=r2
      return true       n is in range of r1 and r2
else
   return false         n is not in range of r1 and r2
}
    
InRange       cmps n,r1  wz,wc     ' checks for n>=r1   
IF_NZ_AND_NC  cmps r2,n  wz,wc     ' if n>=r1 then check if n<=r2       
IF_NC         jmp #InRange_ret     ' if n>=r1 and n<=r2 then exit    
              cmps r1,n  wz,wc     ' checks for n<=r1   
IF_NZ_AND_NC  cmps n,r2  wz,wc     ' if n<=r1 then check if n>=r2   
InRange_ret   ret

' defines what flags mean upon subroutine exit

' z=1&c=0 means that n is equal to r1 or n is equal to r2
' z=0&c=0 means that n is between r1 and r2, but does not include r1 or r2
' z=0&c=1 means that n is not in range!

cgracey · 2017-10-08 22:04

Hydrahacker,

Thanks for documenting that. Looks great! Stuff like this is hard to think about.

If that code is used as a subroutine, it could be quickened a little:

    
InRange       cmps n,r1  wz,wc     ' checks for n>=r1   
IF_NZ_AND_NC  cmps r2,n  wz,wc     ' if n>=r1 then check if n<=r2       
IF_NC         ret                  ' if n>=r1 and n<=r2 then exit    

              cmps r1,n  wz,wc     ' checks for n<=r1   
IF_NZ_AND_NC  cmps n,r2  wz,wc     ' if n<=r1 then check if n>=r2   
              ret

cgracey · 2018-02-01 09:37

I ran into a limitation while working on Spin2 with XBYTE...

XBYTE uses SETQ to establish the bytecode base and LSB/MSB indexing, usually at the start of bytecode execution:

_RET_   SETQ    #$000           'set bytecode base to $000 in LUT, stack = $1F8, do XBYTE

After that, _RET_+Instruction's are done at the end of each code snippet to execute the next bytecode.

What if you have an alternate set of bytecodes, though, that you want to execute through a portal bytecode? You could do a new _RET_+SETQ, but then you permanently change the base and MSB/LSB indexing for subsequent _RET_+instruction's.

There needs to be a way to do a temporary base-and-MSB/LSB-indexing change that only works on the next bytecode, so that alternate sets of bytecodes could be used to run lower-usage code, without any need for a restorative _RET_+SETQ instruction later.

SETQ2 will now serve this purpose:

_RET_   SETQ2   #$100           'set bytecode base to $100 in LUT for next bytecode only, stack = $1F8, do XBYTE

This will add 10 flops to each cog, but make alternate bytecode sets easy to engage without the need for post clean-up.

Roy Eltham · 2018-02-01 09:51

Chip,
Very cool, that will make emulation of some CPUs and VMs nicer too!

cgracey · 2018-02-01 09:54

Roy Eltham wrote: »

Chip,
Very cool, that will make emulation of some CPUs and VMs nicer too!

Man! I was just thinking that I wanted you to see this, because you would probably like it.

lonesock · 2018-02-01 18:25

cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

In P1 code:

              sub       lim2, val wz
        if_nz sub       lim1, val wz
              xor       lim1, lim2
              shl       lim1, #1 wc

The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

Jonathan

TonyB_ · 2018-02-02 22:28

Deleted.

Please see this post and later

TonyB_ · 2018-02-03 00:41

.

TonyB_ · 2018-02-03 15:37

.

cgracey · 2018-02-04 09:14

lonesock wrote: »
cgracey wrote: »

Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.

In P1 code:
              sub       lim2, val wz
        if_nz sub       lim1, val wz
              xor       lim1, lim2
              shl       lim1, #1 wc
The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

Jonathan

Jonathan, this is hard to think about it. I will make the effort to ingest this soon.

cgracey · 2018-02-04 09:15

TonyB_ wrote: »
Further, refined thoughts on two options for a possible new instruction as introduced above. The existing instructions that would be replaced by the new one are as follows:

1. Minimal option
	MOV	temp,bytecode		'preserve bytecode for later
	AND	temp,#andmask		'andmask = %11 {1 of 4} or %111 {1 of 8}
	DECOD	temp			'temp = %0001...%1000 or %00000001...%10000000
	XOR	temp,#xormask		'xormask = %1111 or %11111111
	SHL	temp,#1			'temp = %1110_0...%0111_0 or %11111110_0...%01111111_0
	SKIPF	temp			'temp = skip pattern, unused high bits all zero
2. Ideal option
	MOV	temp,bytecode
	SHR	temp,#shift		'right shift desired group to low bits
	AND	temp,#andmask
	DECOD	temp
	XOR	temp,#xormask
	SHL	temp,#1
	SKIPF	temp
The ideal option assumes the existing shift logic can be used for the right shift, the decode logic afterwards is trivial. The left shift inserts a '0' as the first skip bit, so that one instruction is always executed before the four or eight skippable ones.

As the instruction after SKIPF can only be cancelled if unused, not skipped, the leading '0' avoids 3 of 4 or 7 of 8 options incurring a two-cycle penalty, it keeps the timing consistent and creates space for a preliminary jump instruction. The new skip instruction, SKIPX for want of a better name, could be placed one instruction early if a branch is not needed.

The minimal option would require a right shift beforehand if the lowest bytecode bits are not the desired ones and a D-only instruction could suffice, with an opcode bit selecting 1 of 4 / 1 of 8. The ideal option would need a D,S slot, where S[4:0] could select the right shift and S[8] 1 of 4 / 1 of 8.

As can be seen, SKIPX could replace up to six or seven instructions, a better ratio than some other instructions. It would be a direct binary-to-skip-fast instruction, which is currently missing.

TonyB_,

I'm not readily understanding your idea. I'm going to need to look carefully when I have more time.

Right now, I'm getting to leave in the morning to go to Treehouse in Colorado, so that we can get the layout changes done once and for all, hopefully.

TonyB_ · 2018-02-04 20:10

cgracey wrote: »

TonyB_,

I'm not readily understanding your idea. I'm going to need to look carefully when I have more time.

Right now, I'm getting to leave in the morning to go to Treehouse in Colorado, so that we can get the layout changes done once and for all, hopefully.

After more consideration, what I'm trying to do could be achieved with two or three existing instructions, not six or seven, therefore no real need for a new one.

There is a real issue here, though, related to your recent SETQ2 addition to XBYTE. I hope I can explain it better when you get back from your trip.

Fast Bytecode Interpreter

Comments