Fast Bytecode Interpreter

1192022242530

Comments

  • WOW! That's a fantastic achievement Chip.

    I know that the maths side will give a huge improvement too. So really looking forward to see how the remaining portions go too. ;)
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • jmgjmg Posts: 13,603
    cgracey wrote: »
    repeat
      outa++
    

    Curious what speeds these give :
    repeat
      outa++
      outb++
    
    repeat
      outa := outa+1
      outb := outb+1
    

  • jmgjmg Posts: 13,603
    potatohead wrote: »
    At that speed, P2 SPIN will be comparable to P1 PASM.
    P1 ASM would do the above in also 2 lines, for ~ 8x faster than P2 Spin.
    P2 ASM could be the above via REP + one line, (a somewhat special case), and be ~8x faster than P1 ASM.

  • JasonDorieJasonDorie Posts: 1,930
    edited 2017-04-26 - 23:36:28
    potatohead wrote: »
    At that speed, P2 SPIN will be comparable to P1 PASM.

    2 pin toggles per microsecond puts that two instruction loop at about 500ns per iteration, right?
    P1 PASM is 20 instructions per microsecond, so it's still an order of magnitude slower than ASM, but still an order of magnitude faster than P1 Spin. (I'm not complaining :) )

    (edit: yeah, what jmg said)
  • potatoheadpotatohead Posts: 9,733
    edited 2017-04-27 - 01:07:24
    At 160 Mhz, would it be 4 toggles?

    Any yes, which is why I wrote "comparable" For a lot of things, the performance would largely be there. Point being, people could reasonably attempt things in P2 SPIN that would definitely be PASM on P1.

    Add some well placed inline P2 PASM, and it could be golden. ;D
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • cgraceycgracey Posts: 11,263
    edited 2017-04-27 - 00:25:01
    Note that the scope picture was taken at 2us/division.
  • And, I think I heard Spin2 will let you put ASM calls in cog, right?
    Prop Info and Apps: http://www.rayslogic.com/
  • Rayman wrote: »
    And, I think I heard Spin2 will let you put ASM calls in cog, right?

    Yes.
  • potatoheadpotatohead Posts: 9,733
    edited 2017-04-27 - 00:39:24
    This is gonna rock Chip!

    @all, interpreter numbers!
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • I'm curious to see how ASM in SPIN2 cached to cog is going to be implemented. Seems like there could be a million ways to do it...
    Prop Info and Apps: http://www.rayslogic.com/
  • cgraceycgracey Posts: 11,263
    edited 2017-04-27 - 02:02:33
    I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

    I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
    %xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
    %xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
    %xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
    

    With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.
  • jmgjmg Posts: 13,603
    cgracey wrote: »
    I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

    I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
    %xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
    %xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
    %xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
    
    With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.
    Not sure I follow ?
    You have an 8 bit field to the left of the table examples, is this a new bytecode proposed, or a new P2 opcode, or 3 new opcodes ?
  • cgraceycgracey Posts: 11,263
    edited 2017-04-27 - 02:25:28
    Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.
  • jmgjmg Posts: 13,603
    cgracey wrote: »
    Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.

    Ah, ok, how does that mesh with the suggestions ersmith made here, to make porting the RISC-V and ZPU interpreters easier. ?
    http://forums.parallax.com/discussion/comment/1409433/#Comment_1409433

  • I am kind of following this with interest but I have a simple loop in Tachyon running from hub RAM with the P1 @80Mhz.
    2 PIN -1 FOR H L NEXT
    
    Now I know I'm not toggling but I already measure 833kHz. I am hoping that Spin2 could do better than what I am doing already on P1.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • I am kind of following this with interest but I have a simple loop in Tachyon running from hub RAM with the P1 @80Mhz.
    2 PIN -1 FOR H L NEXT
    
    Now I know I'm not toggling but I already measure 833kHz. I am hoping that Spin2 could do better than what I am doing already on P1.

    I suspect the Spin2 interpreter has more complex variable setup than Tachyon. Also, the Spin program loops for each toggle, so there is a branch.
  • Peter JakackiPeter Jakacki Posts: 8,414
    edited 2017-04-27 - 03:00:30
    Yes, but there is a branch with NEXT so Tachyon is continually reading those instructions from hub RAM between FOR NEXT, so 3 instructions including NEXT per loop in fact. -1 FOR is a simple way of setting up a very long repeat. I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • You loop after H and L. I loop after OUTA++. So, I'm looping every other instruction. That takes time with the FIFO.

    I don't know how I could make my bytecodes any leaner. What kinds of things are you thinking about?
  • jmgjmg Posts: 13,603
    ... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?
    There is room for inline ASM ? :)
    A byte-code is byte sized here, so it's hard to make that leaner and meaner.
    In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
    I think you effectively did that, with the move to a 16b bytecode ?

  • jmg wrote: »
    ... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?
    There is room for inline ASM ? :)
    A byte-code is byte sized here, so it's hard to make that leaner and meaner.
    In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
    I think you effectively did that, with the move to a 16b bytecode ?
    I can have my V3 bytecode do the same as what I did here, so that's not the problem. The advantages of wordcode over bytecode in regards to Tachyon are more apparent as the system grows whereas bytecode is definitely more memory efficient when the system is still small.

    I can see the cog memory being made available by Chip's compact bytecode is a great advantage too. If Spin1 bytecode interpreter had even just a little bit of cog memory available for user code it would have made a huge difference. Even Tachyon reserves about 20 something longs for special modules too.


    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • cgracey wrote: »
    I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

    I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
    %xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
    %xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
    %xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
    

    With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

    Can you show a snippet of the code you're trying to replace/shorten?
  • cgraceycgracey Posts: 11,263
    edited 2017-04-27 - 06:15:47
    The 18 bytecodes (3*6) for...
    '
    ' Branches - jmp, jz, jnz
    '
    bra_b	rfbyte	pa	'	b
    bra_w	rfword	pa	'	| w
    bra_l	rflong	pa	'	| | l
            test	x   wz	'	| | c d e f	a: branch fwd
            popa	x	'	| | c d e f	b: branch rev
    if_nz	ret     	'	| | c d | |	c: test, pop, branch fwd if z
    if_z	ret     	'	| | | | e f	d: test, pop, branch rev if z
            add	pb,pa	'	a | c | e |	e: test, pop, branch fwd if nz
            sub	pb,pa	'	| b | d | f	f: test, pop, branch rev if nz
    _ret_	rdfast	#0,pb	'	a b c d e f 
    

    Would become 3 bytecodes for...
    '
    ' Branches - jmp, jz, jnz
    '
    bra	rfdata	pa	'	a c e 
            test	x   wz	'	| c e	a: branch
            popa	x	'	| c e
    if_nz	ret	        '	| c |	c: test, pop, branch if z
    if_z	ret	        '	| | e
            add	pb,pa	'	a c e	e: test, pop, branch if nz
    _ret_	rdfast	#0,pb	'	a c e
    

    Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.

    There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
  • jmgjmg Posts: 13,603
    edited 2017-04-27 - 06:04:19
    cgracey wrote: »
    The 18 bytecodes (3*6) for...
    Would become 3 bytecodes for...

    More than shorten code snippets, it reduces the number of bytecodes needed for each snippet. There would be really huge savings in the variable setups.
    Oh, yes, that sounds significant. It means room to add smarter bytecodes ?
    I think that also means other byte-code designs, can more easily support types of BYTE WORD LONG ?
    One pain in the trend to PCs is 'type laziness', where languages round up to larger types, because 'hey, everyone has 8GB now, right' ?

  • Right. It would allow conservation of bytecodes without performance reduction.
  • jmgjmg Posts: 13,603
    cgracey wrote: »
    There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
    How many does that leave spare from the 256 ?

  • cgracey wrote: »
    I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.

    I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
    %xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
    %xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
    %xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
    

    With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

    If the tag bit or bits were placed starting at MSB, the encoding scheme would be similar to UTF-8.

    This is proof that it's quite a general purpose mechanism --- definitely useful outside bytecode engines.

    The existence of UTF-8 also suggests that having MSB as the first tag bit would be better,
    unless the bytecode really needs the tag to be at the LSB end.
  • Of course, UTF-8 encodes unsigned integers (codepoints);
    there's no sign extension.

    The need for sign extension might change the perspective
    so having the tag bit(s) at the LSB end might be more "natural":
    fewer surprises when interpreting the values, the sign bit would be in same place.
  • cgracey wrote: »
    I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
    %xxxxxxx0 = RFBYTE --> %ssssssss_ssssssss_ssssssss_sxxxxxxx
    %xxxxxx01 = RFWORD --> %ssssssss_ssssssss_ssxxxxxx_xxxxxxxx
    %xxxxxx11 = RFLONG --> %ssxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx
    

    With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.

    Separate from the tag bits position (LSB or MSB end),
    I would suggest a different mnemonic for the instruction:

    RFVARS

    This kind of encoding was named "VARint" (VARiable-length INTeger) in several designs, SQLite being only one example.

    The trailing "S" stands for "Signed" or "Sign-extended".

    "VARS" looks like a plural noun so it might be confusing;
    "SVAR" could fix this.

    RFSVAR then?
  • An "RFUVAR" instruction ('U' = "Unsigned") without sign extension also makes sense.

    But the need for the signed variant is obvious in the context of bytecode engines, so "RFSVAR" should be the winner if only one fits.
  • Good grief. Somewhere around here I suggested that some fantasy CPU could suck up UTF-8 as it's native machine code. A great way to keep short instructions short and longer ones, longer as needed.

    I thought I was joking.

    I'm not sure the signed/unsigned thing is an issue. UTF-8 is all about encoding bits. What they actually mean is another question.




Sign In or Register to comment.