Fast Bytecode Interpreter

1235725

Comments

  • Re: timings

    We do have 16 COGS. As Chip says, in COG timing isn't hard. It's a lot like a P1, with some spiffy new hardware. Doing things like we did on P1 will continue to be a reasonable option.

    The difference is one can load up a COG, interrupts, etc... that will take a bit more thought. I don't see it being overly difficult.

    Streamer will help a lot. It's the general purpose device we all thought WAITVID would become in the P2.

    In HUB land, it going to vary a lot more. Personally, I see people putting critical stuff into a COG, P1 style.

    Inline will be hardware access and some speed boost. People can code for worst case and where that becomes an issue, into a COG it goes!

    And we all understood the HUB tradeoffs. Moving to a more conservative, design gets us HUBEXEC, and a lot of speed for big Programs.
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • Heater.Heater. Posts: 19,354
    edited March 16 Vote Up0Vote Down
    I like ozpropdev's suggestion:
            if_nc_and_nz    add r0,r1
            ret_after       add r2,r3
    

    We should not be worrying about an extra tab stop. By the time I have labels and conditionals in the the instructions are already far in to the line.

    Of course actual tab characters should never exist in a source file :)
  • Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.
  • cgraceycgracey Posts: 7,744
    edited March 16 Vote Up0Vote Down
    Seairth wrote: »
    Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.

    Hey, no problem. I really didn't even notice.

    I was working out the variable-related bytecodes last night and they are probably going to be 10x faster than Spin on Prop1. A lot is getting done now with very few instructions. Plus, the snippets are more specialized, as opposed to general-case, so efficiency is way higher.
  • dMajodMajo Posts: 717
    edited March 18 Vote Up0Vote Down
    ozpropdev wrote: »
    Maybe two aliases for the %0000 condition like
    	else_ret	djnz	count,#loop
    'or
    	ret_after	add	r0,r1
    
    cgracey wrote: »
    I really like RET_AFTER. Is there any way to get that under 8 characters, though, so that it can sit at the tab stop just prior to the instruction? Then, it aligns with most IF_x prefixes.
    I really like ozpropdev's suggestions (double alias), smart and clean. If 7 chars are the limit then:
    RETNEXT
    RETELSE

    Propeller Object Exchange (last Publications / Updates) --- Oldbitcollector's guest map
    JustForMe
  • Was just looking at Lua...
    Looks like cross between C and Spin. The description uses the words bytecode and interpreter...
    Compiles to 100kB.

    Is this something that could work for P2?
    Prop Info and Apps: http://www.rayslogic.com/
  • Yes, Rayman, a lua runtime could be made for P2. I think a Python one is possible also.

    Also, I am looking forward to trying to get OpenSpin compiled/running on P2 using propgcc when it's up and running.
  • Lua claims that 100kB has the complete library.

    Says that something like Python would have to be stripped of much of the library to fit in small systems...
    Prop Info and Apps: http://www.rayslogic.com/
  • Python does contain a lot of built in "library" but the core language runtime and basic library stuff should be doable.
  • jmgjmg Posts: 10,247
    Rayman wrote: »
    Lua claims that 100kB has the complete library.

    Says that something like Python would have to be stripped of much of the library to fit in small systems...

    Both would be good for P2, and even a smaller Python is OK.
    As P2 continues to gestate, other offerings advance, recent news was the WiFi Pi Zero W for $10, and
    this news from NXP:

    http://www.nxp.com/products/microcontrollers-and-processors/arm-processors/kinetis-cortex-m-mcus/k-series-performance-m4/k2x-usb/kinetis-k28-150-mhz-2x-usb-core-voltage-bypass-2mb-flash-1mb-sram-mcus-based-on-arm-cortex-m4:K28_150#overviewExpand

    That part is a M2 core, 150MHz with 1MBytes of SRAM and 2MB of Flash, and QuadSPI XIP support ( & HS USB).

    The 1MB of SRAM moves it ahead of P2's target memory, and the 150MHz speed is similar to P2 target (which I suspect will drop before release)
    Price wise, the K28 is likely to be more than P2, but is a possible companion device.

    Both PiZW and K28 can run 'larger' language implementations, should users need that.
    It will be important to allow easy movement of code across such platform boundaries.
    Spin is rather too niche, but can spawn other efforts.

    In other news, I see the highest end Multi-Core Infineon MCU parts, now included an 8-bit MCU in the corner,

    I'm looking forward to see numbers that P2 Byte-code fetch can reach, when fed from Streamer in XIP applications.
    That should be part of this byte-code core testing.

  • Yep, MicroPython runs on pretty small devices:
    https://micropython.org/

  • Seems to me a ton of library code could be in external storage too. A few COGS dedicated to paging and a compiler able to break all that out in a useful way could perform well.

    Too diverse of library use would grind away, but a lot if more focused uses may do just fine.

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • I think most of that library code is not needed for the kinds of things people do with micro-controllers. Rather like we can write C++ code for the Propeller but we don't expect the luxury of the gigantic C++ standard library to be available.

    This is the approach taken with MicroPython and the tiny Javascript engines.

  • I realized that when popping or pushing multiple values, SETQ+RDLONG/WRLONG can be used.

    The bytecode dispatcher jumps right to these and they return to the bytecode loop. This is going to be 10x faster than Spin on the current Prop1:
    '
    '
    ' Dual-result procedures
    '
    op_polcar	setq	#2-1		'POLCAR(r,t : x,y)
    		rdlong	x,ptra[--2]
    		qrotate	x,y
    		jmp	#op_dual_result
    
    op_carpol	setq	#2-1		'CARPOL(x,y : r,t)
    		rdlong	x,ptra[--2]
    		qvector	x,y
    		jmp	#op_dual_result
    
    op_rotcar	setq	#3-1		'ROTCAR(x,y,t : x,y)
    		rdlong	x,ptra[--3]
    		setq	y
    		qrotate	x,z
    
    op_dual_result	getqx	y		'push results
    		getqy	x
    		setq	#2-1
    	_ret_	wrlong	x,ptra[2++]
    
  • cgracey wrote: »
    I realized that when popping or pushing multiple values, SETQ+RDLONG/WRLONG can be used.

    The bytecode dispatcher jumps right to these and they return to the bytecode loop. This is going to be 10x faster than Spin on the current Prop1:
    '
    '
    ' Dual-result procedures
    '
    op_polcar	setq	#2-1		'POLCAR(r,t : x,y)
    		rdlong	x,ptra[--2]
    		qrotate	x,y
    		jmp	#op_dual_result
    
    op_carpol	setq	#2-1		'CARPOL(x,y : r,t)
    		rdlong	x,ptra[--2]
    		qvector	x,y
    		jmp	#op_dual_result
    
    op_rotcar	setq	#3-1		'ROTCAR(x,y,t : x,y)
    		rdlong	x,ptra[--3]
    		setq	y
    		qrotate	x,z
    
    op_dual_result	getqx	y		'push results
    		getqy	x
    		setq	#2-1
    	_ret_	wrlong	x,ptra[2++]
    
    That's quite cool. I just finished reading through your P2 document and there are some amazing capabilities there. It will be a challenge to find a way to use them from the GCC code generator.

  • cgraceycgracey Posts: 7,744
    edited March 19 Vote Up0Vote Down
    I made another change to the Verilog, but it's very minor.

    The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

    This automatic rotation allows for new C and Z values with each SETCZ. It comes in very handy for the memory-read/write/modify code in the bytecode interpreter. Note that 'p' is the bytecode:
    '
    ' Read memory - no index
    '
    rdmem		setcz	p	wc,wz	'get offset
    	if_00	rfbyte	m
    	if_01	rfword	m
    	if_10	rflong	m
    
    		setcz	p	wc,wz	'add base
    	if_00	add	m,pbase
    	if_01	add	m,vbase
    	if_10	add	m,dbase
    
    		setcz	p	wc,wz	'read memory
    	if_00	rdbyte	x,m
    	if_01	rdword	x,m
    	if_10	rdlong	x,m
    
    	_ret_	pusha	x
    

    Here is the read with indexing:
    '
    ' Read memory - with index
    '
    rdmemi		setcz	p	wc,wz	'get offset
    	if_00	rfbyte	m
    	if_01	rfword	m
    	if_10	rflong	m
    
    		setcz	p	wc,wz	'add base
    	if_00	add	m,pbase
    	if_01	add	m,vbase
    	if_10	add	m,dbase
    
    		setcz	p	wc,wz	'handle index
    		popa	x
    	if_01	shl	x,#1
    	if_10	shl	x,#2
    		add	m,x
    
    	if_00	rdbyte	x,m		'read memory
    	if_01	rdword	x,m
    	if_10	rdlong	x,m
    
    	_ret_	pusha	x
    
  • cgracey wrote: »
    I made another change to the Verilog, but it's very minor.

    The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).
    Neat!
    Reminds me of the old POPZC instruction in P2-Hot. :)


    Melbourne, Australia
  • cgracey wrote: »
    I made another change to the Verilog, but it's very minor.

    The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

    Clever! So, to be clear, it also means that the D[1:0] is rotated to D[31:30]?

    It might be worthwhile to make this two separate instructions (same encoding, though):

    SETCZ #
    RORCZ D

    This makes it a bit clearer when reading code. Also, that aligns it with "ROR D, #1 WC" (which itself might benefit from an alias RORC).
  • Seairth wrote: »
    cgracey wrote: »
    I made another change to the Verilog, but it's very minor.

    The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

    Clever! So, to be clear, it also means that the D[1:0] is rotated to D[31:30]?

    It might be worthwhile to make this two separate instructions (same encoding, though):

    SETCZ #
    RORCZ D

    This makes it a bit clearer when reading code. Also, that aligns it with "ROR D, #1 WC" (which itself might benefit from an alias RORC).

    Yes, D[1:0] are rotated into the top two bits.
  • cgraceycgracey Posts: 7,744
    edited March 21 Vote Up0Vote Down
    Code density is going way up!

    Last night I was thinking about a challenge I have in implementing the interpreter. I've been tempted to write out separate routines for all the different memory variable accesses, since they could be fast and concise that way.

    For example, reading a long variable with an 8-bit offset from dbase would look like this:
    		rfbyte	m
    		add	m,dbase
    		rdlong	x,m
    	_ret_	pusha	x
    

    That's pretty simple, but all the different routines needed would be huge.

    A general-case routine is several times slower, as it must handle every possibility. That would really slow the interpreter down, but save memory.

    I was thinking about how it may be possible to selectively SKIP certain instructions in a long pattern of instructions, in order to get what you want out of the sequence.

    Look at this code. It contains every instruction needed to execute all the hub read/write operations. To do a specific operation, you'd want to skip most of these instructions and execute only a few select ones:
    '
    ' Read/write hub memory
    '
    rw_mem		rfbyte	m		'one of these (offset)		3 x
    		rfword	m
    		rflong	m
    
    		add	m,pbase		'one of these (base)		3 x
    		add	m,vbase
    		add	m,dbase
    
    		popa	x		'maybe this (index)		2 x (on/off)
    		shl	x,#1		'...and maybe this
    		shl	x,#2		'...or maybe this
    		add	m,x		'...and this
    
    		rdbyte	x,m		'one of these (read)		6 x
    		rdword	x,m
    		rdlong	x,m
    	_ret_	pusha	x		'...and this
    
    		popa	x		'or this (write)
    	_ret_	wrbyte	x,m		'...and one of these
    	_ret_	wrword	x,m
    	_ret_	wrlong	x,m
    

    There are 108 permutations possible. That would be a lot of separate small, fast routines, or one big slow one that does it all.

    Well, we can now get ultimate compactness AND highest speed:
    		skip	##%001111110111100	'execute jmp/rfbyte/add/rdlong/pusha
    		jmp	#rw_mem
    
    ...then this executes, pieced together from the rw_mem code...
    
    		rfbyte	m
    		add	m,dbase
    		rdlong	x,m
    	_ret_	pusha	x		'only the four instructions we wanted, zero time overhead!
    

    SKIP uses bits in D, LSB first, to selectively cancel instructions in the order they occur in memory, starting with the next instruction. It works through branches, too. And it doesn't just cancel instructions. In cog-exec mode, it actually increments the PC by 1..8, not just by 1, in order to get to the next instruction, saving time. It works in hub-exec mode by just cancelling instructions, always stepping by 4, to not reset the FIFO.

    This is going to allow for really compact and fast code. This will shrink the interpreter down to almost nothing and it's going to be as fast as can be. For each bytecode, I'll look up a branch address and a SKIP pattern, and do a SKIP+JMP to execute the code and pattern needed by the bytecode. It will be very efficient.

    I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.

    When I started working on this, I first had a 32-bit encoder to look ahead at the SKIP pattern, so that it could increment the PC by 1..33. That encoder was big and slow, and then a big shifter was needed. Not practical. By making it only look eight SKIP bits ahead, the encoder and shifter were 1/4 the size and fast enough to work in time. It was very little logic to implement. If you do wind up skipping 32 instructions, it takes a total of 8 clocks: 1st instruction is cancelled, then the 9th, 17th, and 25th are cancelled, with PC incrementing by 8, 8, 8, then 8, dropping you off at the end of the sequence. Interrupts are disabled during SKIP. The SKIP pattern can be as short as you want, as leading 0's are implied. Once the 1's are exhausted, it's over.
  • That's really neat Chip!
    Small and fast, fabulous! :)
    Melbourne, Australia
  • It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?
  • David Betz wrote: »
    It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?

    You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.
  • cgracey wrote: »
    David Betz wrote: »
    It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?

    You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.
    I still think it might be better to just rearchitect the byte code VM instruction set to fit into COG/LUT RAM. Are you attempting to keep the byte codes the same as the ones used by the P1 VM?
  • cgracey wrote: »
    I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.
    FPGA's warmed up and ready to exercise V17. :cool:
    Melbourne, Australia
  • RaymanRayman Posts: 8,241
    edited March 21 Vote Up0Vote Down
    This is going to allow for some really obfuscated assembly code...

    What happens if there's a second skip when one is already in effect?
    Prop Info and Apps: http://www.rayslogic.com/
  • WORTH IT
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • I still think it will generate lots of ugly code that is hard to understand and bug prone.
  • Rayman wrote: »
    This is going to allow for some really obfuscated assembly code...

    What happens if there's a second skip when one is already in effect?

    That depends on whether the first SKIP skipped the second SKIP. :)
  • David Betz,
    I think that ship already sailed with the self modifying and all the ALT* instructions. This is a neat code density trick, and with a little smarts an editor could show the executed snippets in a tooltip or some form of expanded view mode. Heck, we might even be able to let you edit the "expanded" form and have it auto-generate the condensed form.

    Maybe I am biased, since I have to deal with templates, lambdas, and what not in C++ all the time, and those are equally hard (if not far worse in some cases) to parse and debug. I agree that it will make PASM2 a bit harder to read and debug when it has them, but I think the benefits far outweigh that cost.
Sign In or Register to comment.