Fast Bytecode Interpreter

12425272930

Comments

  • cgracey wrote: »
    lonesock wrote: »
    cgracey wrote: »
    Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
    In P1 code:
                  sub       lim2, val wz
            if_nz sub       lim1, val wz
                  xor       lim1, lim2
                  shl       lim1, #1 wc
    
    The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

    Jonathan

    Jonathan, this is hard to think about it. I will make the effort to ingest this soon.
    My fault, I should have stated the logic behind it.

    I convert the limits to deltas from the tested value, and if the tested value is in the middle of the 2 limits, the delta values should have opposite signs (XOR of the highest (sign) bits should be 1). i.e. (lim1 - val) xor (lim2 - val) should be negative.

    There is also the test case where the limit equals the test value, so I use the Z flag to check for that case. Otherwise you can get a 0 delta, and a positive delta, which does not trip the "opposing signs" logic.

    thanks,
    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • lonesock wrote: »
    cgracey wrote: »
    lonesock wrote: »
    cgracey wrote: »
    Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
    In P1 code:
                  sub       lim2, val wz
            if_nz sub       lim1, val wz
                  xor       lim1, lim2
                  shl       lim1, #1 wc
    
    The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

    Jonathan

    Jonathan, this is hard to think about it. I will make the effort to ingest this soon.
    My fault, I should have stated the logic behind it.

    I convert the limits to deltas from the tested value, and if the tested value is in the middle of the 2 limits, the delta values should have opposite signs (XOR of the highest (sign) bits should be 1). i.e. (lim1 - val) xor (lim2 - val) should be negative.

    There is also the test case where the limit equals the test value, so I use the Z flag to check for that case. Otherwise you can get a 0 delta, and a positive delta, which does not trip the "opposing signs" logic.

    thanks,
    Jonathan

    Okay. That makes sense. I would still have to study it to fully understand it.
  • I am not sure if this is the correct thread to ask this.

    With the Prop 1 I use WAITPEQ and WAITPNE to put a cog to sleep until one or more pins (as defined in the WAITP__ statement) are set 1 or 0 by another cog or an external sensor to wake the the cog and then by decoding the pattern of pins branch to different routines.

    How would something like that be done in Spin2?

    Thanks
    Tom
  • twm47099 wrote: »
    I am not sure if this is the correct thread to ask this.

    With the Prop 1 I use WAITPEQ and WAITPNE to put a cog to sleep until one or more pins (as defined in the WAITP__ statement) are set 1 or 0 by another cog or an external sensor to wake the the cog and then by decoding the pattern of pins branch to different routines.

    How would something like that be done in Spin2?

    Thanks
    Tom

    You would do a SETPAT instruction with D=mask, S=value, Z=1 for equal or 0 for not equal, and C = 0 for INA or 1 for INB. Then, you could do WAITPAT, POLLPAT WC, or JPAT/JNPAT to wait for the event, poll it, or jump by it.
  • TonyB_TonyB_ Posts: 684
    edited February 17 Vote Up0Vote Down
    XBYTE issue:
    Chip, could you please look at this matter again?
    cgracey wrote: »
    I ran into a limitation while working on Spin2 with XBYTE...

    XBYTE uses SETQ to establish the bytecode base and LSB/MSB indexing, usually at the start of bytecode execution:
    _RET_   SETQ    #$000           'set bytecode base to $000 in LUT, stack = $1F8, do XBYTE
    

    After that, _RET_+Instruction's are done at the end of each code snippet to execute the next bytecode.

    What if you have an alternate set of bytecodes, though, that you want to execute through a portal bytecode? You could do a new _RET_+SETQ, but then you permanently change the base and MSB/LSB indexing for subsequent _RET_+instruction's.

    There needs to be a way to do a temporary base-and-MSB/LSB-indexing change that only works on the next bytecode, so that alternate sets of bytecodes could be used to run lower-usage code, without any need for a restorative _RET_+SETQ instruction later.

    SETQ2 will now serve this purpose:
    _RET_   SETQ2   #$100           'set bytecode base to $100 in LUT for next bytecode only, stack = $1F8, do XBYTE
    

    This will add 10 flops to each cog, but make alternate bytecode sets easy to engage without the need for post clean-up.

    Firstly, this a welcome improvement for switching between alternate bytecodes.

    Secondly, having enough room in the LUT for two full sets of 256 bytecodes compared to only one is great but the problem (and one I have right now) is there no space in the LUT for any code. Although 512 jump addresses and skip patterns can fit in the LUT, there is only two-thirds of the cog code space to deal with twice as many bytecodes.

    One solution is to have two-stage decoding for the alternate set. Let's suppose these alternate bytecodes are in the form %yyyyyzzz, where yyyyy and zzz are independent bit groups. A lot of space could be saved in half the LUT by decoding only yyyyy with XBYTE, which leaves zzz to be decoded later. Chip's example above has the stack at $1F8 for both main and alternate sets, but in this case they would be at $1F8 and $1FB:
    Stack	Bits	SETQ		LUT base	LUT index 	LUT EXECF	Bytecode
    				address				address		set
     
    $1F8	8	%Axxxxxxxx	%A00000000	yyyyyzzz	%Ayyyyyzzz	main
    $1FB	5	%AAAAxxxx1	%AAAA00000	yyyyy		%AAAAyyyyy	alternate
    

    Only 32 entries now required in the LUT for the alternate set, a big saving of 224 longs that could make the difference between an interpreter fitting in one cog or not.

    Q1.
    Will it be as easy to switch between these two sets when the stacks are different?
    It would be wrong to assume that the LUT base will always be the same for both.

    Q2.
    Is the following the quickest way to decode low-order bits zzz into skip patterns?
    ' bytecode = %yyyyyzzz at $1f6
    
    zzz_decode	shl	bytecode,#1		' bytecode = yyyyzzz0
    		setnib	skip_ins,bytecode,#2	' change nibble #2 of skipf opcode
    skip_ins	skipf	skip_0			' skipf D[2:0] = zzz	
    		...
    
    ' skip patterns for zzz, skip_0 aligned to 8 long boundary
    
    skip_0		long	...
    skip_1		long	...
    skip_2		long	...
    skip_3		long	...
    skip_4		long	...
    skip_5		long	...
    skip_6		long	...
    skip_7		long	...
    

    EDIT:
    Deleted two questions.
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 684
    edited February 17 Vote Up0Vote Down
    .
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 684
    edited February 18 Vote Up0Vote Down
    The main issue here is that the new SETQ2 can change the LUT base address but it cannot change the LUT index to a different number of bytecode bits because that it is set by the return address.

    I think I have a solution. Instead of SETQ2 D simply replacing the original SETQ D, two operations occur for the next bytecode only:

    1. LUT base address = {SETQ[8:1],0} XOR {SETQ2[8:1],0}

    2. SETQ2[8:1] select LUT index width, SETQ2[0] selects right/left aligned, as follows:
    SETQ2		LUT index	Equivalent to temporary
    				return address on stack of
    
    100000000	bytecode[7:0]	$1F8
    
    x10000000	bytecode[6:0]	$1F9
    x10000001	bytecode[7:1]	$1F9
    
    xx1000000	bytecode[5:0]	$1FA
    xx1000001	bytecode[7:2]	$1FA
    
    xxx100000	bytecode[4:0]	$1FB
    xxx100001	bytecode[7:3]	$1FB
    
    xxxx10000	bytecode[3:0]	$1FC
    xxxx10001	bytecode[7:4]	$1FC
    
    xxxxx1000	bytecode[2:0]	$1FD
    xxxxx1001	bytecode[7:5]	$1FD
    
    xxxxxx100	bytecode[1:0]	$1FE
    xxxxxx101	bytecode[7:6]	$1FE
    
    xxxxxxx10	bytecode[0]	$1FF
    xxxxxxx11	bytecode[7]	$1FF
    

    SETQ2[1] = 1 is equivalent to $1FF on stack.
    SETQ2[2:1] = 10 is equivalent to $1FE on stack, etc.

    Thus SETQ2[8:1] are priority encoded to same 3-bit value as stack[2:0] would be.


    Example 1
    ' $1F8 on stack
    
    _RET_   SETQ    #$000		'LUT base = $000, index = bytecode[7:0], do XBYTE
    ...
    _RET_   SETQ2   #$100		'LUT base = $100, index = bytecode[7:0], do XBYTE once
    

    Example 2
    ' $1F8 on stack
    
    _RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do XBYTE
    ...
    _RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do XBYTE once
    

    Example 3
    ' $1F8 on stack
    
    _RET_   SETQ    #%011100000	'LUT base = $000, index = bytecode[7:0], do XBYTE
    ...
    _RET_   SETQ2   #%111100001     'LUT base = $100, index = bytecode[7:3], do XBYTE once
    

    Example 4
    ' $1F8 on stack
    
    _RET_   SETQ    #%000100000	'LUT base = $000, index = bytecode[7:0], do XBYTE
    ...
    _RET_   SETQ2   #%100100001     'LUT base = $100, index = bytecode[7:3], do XBYTE once
    
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 684
    edited February 17 Vote Up0Vote Down
    I forgot CZ write enable: SETQ[9] XOR SETQ2[9] might would be better than SETQ2[9].
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 684
    edited March 12 Vote Up0Vote Down
    If the new 10-bit SETQ2 register were cleared before SETQ2 and after the one-time XBYTE, then SETQ[9:0] XOR SETQ2[9:0] = SETQ[9:0] and the XOR logic could apply all the time.

    EDIT:
    Before the recent debugging change, there were sufficient XBYTE status bits to debug a modified XBYTE as proposed above. I hope it is possible to partially revert the changes to make this possible again.
    Formerly known as TonyB
  • No response - that's not a good sign but I'm not giving up yet.

    I think the new SETQ2 is not as useful as it could be. Instead of being able to do this:
    ' $1F8 on stack
    
    _RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do normal XBYTE
    ...
    _RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do special XBYTE once
    ...
    _RET_	XXXX			'do any non-branching instruction, then do normal XBYTE
    

    users would have to do this
    ' $1F8 on stack
    
    _RET_   SETQ    #%000000000	'LUT base = $000, index = bytecode[7:0], do normal XBYTE
    ...
    	PUSH	#$1FB
    _RET_   SETQ2   #%111100001     'LUT base = $1E0, index = bytecode[7:3], do special XBYTE once
    ...
    _RET_	POP	temp		'remove $1FB from stack, then do normal XBYTE
    

    and the whole point of SETQ2 is rather negated.
    Formerly known as TonyB
  • TonyB_,

    I've beeen working on the layout with Treehouse, and now I'm doing it MYSELF. Something new. They needed to redirect their layout people to one critical task, so they quit working on Prop2 for what would be a few weeks. I figured I could do the job, so I fired up our old Tanner EDA tools and now I'm on it. This saves about $8k/week. Next week, I'm taking the layout to OnSemi in Idaho for final DRC and LVS, using their Cadence tools. so, I have been in a black hole of sorts.

    I had seen your message, but I could not formulate a response, with a head full of layout. This layout work is very engrossing. it is kind of interesting that you need not know much about electronics to do the job. It's certainly an example of specialization. Yesterday, I finally went to bed after working 35 hours straight. I didn't want any details to spill out of my head, before sleeping.

    So, everything is still moving forward. I've just not had much time for The Forum. I will address your question as soon as I get this layout stuff out of the way. Thanks for your input and ideas. They've been making things really nice.
  • Thanks for the update, Chip.
    Formerly known as TonyB
  • 35 hours straight does not sound good.
    I am just another Code Monkey.
    A determined coder can write COBOL programs in any language. -- Author unknown.
    Press any key to continue, any other key to quit

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this post are to be interpreted as described in RFC 2119.
  • Chip,
    Please take it easy! You have had your warning, so take heed as you may not be so lucky next time!!!
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Well, I'm not sure 35 hours straight is really good...
    They say we're the sharpest just after waking up.
    I know that it is how it is for me.
    Sure, you can do more work without sleep, but I don't think for most of us that it is quality work..

    Still, people are different. There was that great movie "Real Genius" will a girl who was a genius and didn't sleep...
    Prop Info and Apps: http://www.rayslogic.com/
  • RaymanRayman Posts: 8,717
    edited February 24 Vote Up0Vote Down
    Wow, I can post again! Thought I was banned for a while :)

    Anybody see anything different now?
    Prop Info and Apps: http://www.rayslogic.com/
  • Thanks for doing the "hard yards" for us, Chip.

    Its interesting how old tools still come out and make themselves useful. I used a 3 1/2 " disk head cleaner during the week, it was the easiest way to move data from an old machine.
  • cgraceycgracey Posts: 9,503
    edited February 25 Vote Up0Vote Down
    After another 30-hour push, things seems to have finally come together. I'll be at OnSemi this week doing final design-rule checks and layout-versus-schematic checks.

    I like doing layout work. For the amount of money we spent on this layout and the long time that it took, I'm absolutely doing this myself next time. I can see how Beau used to be so into this. There are so many details to juggle, you can get completely lost in it. I wasn't even hungry.
  • Must be fun, if you're already thinking about next time

    Is it possible to do the layout work for the other 2/4 cog variants while this is relatively fresh in your mind? Or must the synthesis work be completed first?
  • Tubular wrote: »
    Must be fun, if you're already thinking about next time

    Is it possible to do the layout work for the other 2/4 cog variants while this is relatively fresh in your mind? Or must the synthesis work be completed first?

    No new layout is needed for those variants. It's the same pads, just in a different order, and fewer of them. The layout work is 99.9% in the pads, themselves.
  • cgracey wrote: »
    Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.


    The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.

    Jonathan
    In P1 code:
                  sub       lim2, val wz
            if_nz sub       lim1, val wz
                  xor       lim1, lim2
                  shl       lim1, #1 wc
    


    HydraHackers code below:
    InRange       cmps n,r1  wz,wc     ' checks for n>=r1   
    IF_NZ_AND_NC  cmps r2,n  wz,wc     ' if n>=r1 then check if n<=r2       
    IF_NC         jmp #InRange_ret     ' if n>=r1 and n<=r2 then exit    
                  cmps r1,n  wz,wc     ' checks for n<=r1   
    IF_NZ_AND_NC  cmps n,r2  wz,wc     ' if n<=r1 then check if n>=r2   
    InRange_ret   ret
    
    ' defines what flags mean upon subroutine exit
    
    ' z=1&c=0 means that n is equal to r1 or n is equal to r2
    ' z=0&c=0 means that n is between r1 and r2, but does not include r1 or r2
    ' z=0&c=1 means that n is not in range!
    

    Congratulations Lonesock your algorithm to tell if a value is between two numbers is one instruction less than mine. The 'ret' instruction in my code is not needed, if you inline the code a 'jmp' instruction is still needed to allow code to execute if value is in range or branch to code past the in range code.

    HydraHacker
  • cgraceycgracey Posts: 9,503
    edited March 30 Vote Up0Vote Down
    TonyB_ had an idea for increasing the flexibiliy of the bytecode interpreter. We wound up exchanging some pm's and came up with a really nice way to just use SETQ/SETQ2 to define all modes in just 9 bits of data, so immediate values will always suffice.

    Now, instead of the return address on the stack needing to be from $1F8..$1FF, it just needs to be $1FF and SETQ(2) handles the base address AND sizing. We got rid of the 3, 2, 1-bit modes, since they weren't very useful and used up a lot of space:
    // setq/setq2 value on ret/ret_auto to $001FF sets xbyte mode:
    //
    //	ABBBB00xF	= 8-bit opcode, 0-bit data, upper sets of 16 can be collapsed into individual lut addresses
    //	AAxx001MF	= 7-bit opcode, 1-bit data
    //	AAAx101MF	= 6-bit opcode, 2-bit data
    //	AAAAx10MF	= 5-bit opcode, 3-bit data
    //	AAAAA11MF	= 4-bit opcode, 4-bit data
    //
    // A = base address in lut, msb-justified
    // B = collapse sets of 16 starting at ABBBB0000 if BBBB > 0
    // M = use msb's of bytecode for index, not lsb's
    // F = set c,z to index[1:0]
    

    For the 8-bit mode, you can now "collapse" sets of 16 identical LUT entries into just one LUT entry. I had four such cases in the Spin2 interpreter that read constants and handled local variables, and had identical LUT values. The code snippets they branched to would use the lower nibble of the bytecode deposited into PA. Now instead of 64 LUT entries spanning addresses $0C0..$0FF, there are just four entries from $0C0..$0C3. This saves 60 LUT locations and frees up space for alternate bytecodes which can reside at $0C4..$0FF, accessed after a 'SETQ2 #$000' which turns off the "collapse" mode for the next bytecode.

    I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.

    I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.

    Hopefully, some of you will be ready to run your code through some tests with the final version this weekend.

  • It's Easter this weekend, I'm out of town at Mom's. I except others will be less available this weekend also.
  • jmgjmg Posts: 12,037
    cgracey wrote: »
    I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.
    Fingers crossed...
    cgracey wrote: »
    I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.
    Do the various current source/sink modes and high speed DACs all work 'as expected' ?
    Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?
  • jmg wrote: »
    cgracey wrote: »
    I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.
    Fingers crossed...
    cgracey wrote: »
    I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.
    Do the various current source/sink modes and high speed DACs all work 'as expected' ?
    Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?

    The I/O pad seems to work perfectly. Voltages are within 1% of targets. Resistances and currents are within 2%-3%.

    The current modes are constant inside 0.7V of the power and ground rails.

    Only the PLL needed some attention, which it got.
  • cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    I'm compiling new FPGA images now for a v32 release. This is going to be the last one, barring a potential bug fix, before On Semi does the final synthesis.
    Fingers crossed...
    cgracey wrote: »
    I tested out the analog test chip very thoroughly and it looks really good. Just one problem needed fixing with the new PLL, but even it worked pretty well.
    Do the various current source/sink modes and high speed DACs all work 'as expected' ?
    Did you do any Icc/Vcc plots of the pin buffers to find the current adders for linear pin operation ?

    The I/O pad seems to work perfectly. Voltages are within 1% of targets. Resistances and currents are within 2%-3%.

    The current modes are constant inside 0.7V of the power and ground rails.

    Only the PLL needed some attention, which it got.

    WTG!
  • cgraceycgracey Posts: 9,503
    edited September 12 Vote Up0Vote Down
    I've got everything into the Spin2 bytecode interpreter that I think is needed, except the bytecode snippet for starting a new Spin2 process. Once I get everything else working, it will become more apparent how to handle that.

    I need to go through this and tidy it up a little bit, but I don't feel like anything needs changing. After that, I'll start working on the compiler to generate the bytecode from source code.

  • cgracey wrote: »
    I've got everything into the Spin2 bytecode interpreter that I think is needed, except the bytecode snippet for starting a new Spin2 process. Once I get everything else working, it will become more apparent how to handle that.

    I need to go through this and tidy it up a little bit, but I don't feel like anything needs changing. After that, I'll start working on the compiler to generate the bytecode from source code.
    Cool! Mind if I look through your code to understand better how to create a fast byte code interpreter for the P2? I'm working on my own for P1 that I'd like to run on P2 as well.
  • That's what it's there for.
  • WTG Chip :)
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
Sign In or Register to comment.