Fast Bytecode Interpreter

potatohead · 2017-03-16 04:35

Re: timings

We do have 16 COGS. As Chip says, in COG timing isn't hard. It's a lot like a P1, with some spiffy new hardware. Doing things like we did on P1 will continue to be a reasonable option.

The difference is one can load up a COG, interrupts, etc... that will take a bit more thought. I don't see it being overly difficult.

Streamer will help a lot. It's the general purpose device we all thought WAITVID would become in the P2.

In HUB land, it going to vary a lot more. Personally, I see people putting critical stuff into a COG, P1 style.

Inline will be hardware access and some speed boost. People can code for worst case and where that becomes an issue, into a COG it goes!

And we all understood the HUB tradeoffs. Moving to a more conservative, design gets us HUBEXEC, and a lot of speed for big Programs.

Heater. · 2017-03-16 11:59

I like ozpropdev's suggestion:

        if_nc_and_nz    add r0,r1
        ret_after       add r2,r3

We should not be worrying about an extra tab stop. By the time I have labels and conditionals in the the instructions are already far in to the line.

Of course actual tab characters should never exist in a source file

Seairth · 2017-03-16 12:59

Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.

cgracey · 2017-03-16 15:19

Seairth wrote: »

Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.

Hey, no problem. I really didn't even notice.

I was working out the variable-related bytecodes last night and they are probably going to be 10x faster than Spin on Prop1. A lot is getting done now with very few instructions. Plus, the snippets are more specialized, as opposed to general-case, so efficiency is way higher.

dMajo · 2017-03-18 16:29

ozpropdev wrote: »
Maybe two aliases for the %0000 condition like
	else_ret	djnz	count,#loop
'or
	ret_after	add	r0,r1

cgracey wrote: »

I really like RET_AFTER. Is there any way to get that under 8 characters, though, so that it can sit at the tab stop just prior to the instruction? Then, it aligns with most IF_x prefixes.

I really like ozpropdev's suggestions (double alias), smart and clean. If 7 chars are the limit then:
RETNEXT
RETELSE

Rayman · 2017-03-18 19:27

Was just looking at Lua...
Looks like cross between C and Spin. The description uses the words bytecode and interpreter...
Compiles to 100kB.

Is this something that could work for P2?

Roy Eltham · 2017-03-18 19:48

Yes, Rayman, a lua runtime could be made for P2. I think a Python one is possible also.

Also, I am looking forward to trying to get OpenSpin compiled/running on P2 using propgcc when it's up and running.

Rayman · 2017-03-18 19:55

Lua claims that 100kB has the complete library.

Says that something like Python would have to be stripped of much of the library to fit in small systems...

Roy Eltham · 2017-03-18 19:58

Python does contain a lot of built in "library" but the core language runtime and basic library stuff should be doable.

jmg · 2017-03-18 20:33

Rayman wrote: »

Lua claims that 100kB has the complete library.

Says that something like Python would have to be stripped of much of the library to fit in small systems...

Both would be good for P2, and even a smaller Python is OK.
As P2 continues to gestate, other offerings advance, recent news was the WiFi Pi Zero W for $10, and
this news from NXP:

http://www.nxp.com/products/microcontrollers-and-processors/arm-processors/kinetis-cortex-m-mcus/k-series-performance-m4/k2x-usb/kinetis-k28-150-mhz-2x-usb-core-voltage-bypass-2mb-flash-1mb-sram-mcus-based-on-arm-cortex-m4:K28_150#overviewExpand

That part is a M2 core, 150MHz with 1MBytes of SRAM and 2MB of Flash, and QuadSPI XIP support ( & HS USB).

The 1MB of SRAM moves it ahead of P2's target memory, and the 150MHz speed is similar to P2 target (which I suspect will drop before release)
Price wise, the K28 is likely to be more than P2, but is a possible companion device.

Both PiZW and K28 can run 'larger' language implementations, should users need that.
It will be important to allow easy movement of code across such platform boundaries.
Spin is rather too niche, but can spawn other efforts.

In other news, I see the highest end Multi-Core Infineon MCU parts, now included an 8-bit MCU in the corner,

I'm looking forward to see numbers that P2 Byte-code fetch can reach, when fed from Streamer in XIP applications.
That should be part of this byte-code core testing.

Heater. · 2017-03-18 20:35

Yep, MicroPython runs on pretty small devices:
https://micropython.org/

potatohead · 2017-03-18 21:11

Seems to me a ton of library code could be in external storage too. A few COGS dedicated to paging and a compiler able to break all that out in a useful way could perform well.

Too diverse of library use would grind away, but a lot if more focused uses may do just fine.

Heater. · 2017-03-18 21:27

I think most of that library code is not needed for the kinds of things people do with micro-controllers. Rather like we can write C++ code for the Propeller but we don't expect the luxury of the gigantic C++ standard library to be available.

This is the approach taken with MicroPython and the tiny Javascript engines.

cgracey · 2017-03-18 23:46

I realized that when popping or pushing multiple values, SETQ+RDLONG/WRLONG can be used.

The bytecode dispatcher jumps right to these and they return to the bytecode loop. This is going to be 10x faster than Spin on the current Prop1:

'
'
' Dual-result procedures
'
op_polcar	setq	#2-1		'POLCAR(r,t : x,y)
		rdlong	x,ptra[--2]
		qrotate	x,y
		jmp	#op_dual_result

op_carpol	setq	#2-1		'CARPOL(x,y : r,t)
		rdlong	x,ptra[--2]
		qvector	x,y
		jmp	#op_dual_result

op_rotcar	setq	#3-1		'ROTCAR(x,y,t : x,y)
		rdlong	x,ptra[--3]
		setq	y
		qrotate	x,z

op_dual_result	getqx	y		'push results
		getqy	x
		setq	#2-1
	_ret_	wrlong	x,ptra[2++]

David Betz · 2017-03-19 00:47

cgracey wrote: »
I realized that when popping or pushing multiple values, SETQ+RDLONG/WRLONG can be used.

The bytecode dispatcher jumps right to these and they return to the bytecode loop. This is going to be 10x faster than Spin on the current Prop1:
'
'
' Dual-result procedures
'
op_polcar	setq	#2-1		'POLCAR(r,t : x,y)
		rdlong	x,ptra[--2]
		qrotate	x,y
		jmp	#op_dual_result

op_carpol	setq	#2-1		'CARPOL(x,y : r,t)
		rdlong	x,ptra[--2]
		qvector	x,y
		jmp	#op_dual_result

op_rotcar	setq	#3-1		'ROTCAR(x,y,t : x,y)
		rdlong	x,ptra[--3]
		setq	y
		qrotate	x,z

op_dual_result	getqx	y		'push results
		getqy	x
		setq	#2-1
	_ret_	wrlong	x,ptra[2++]

That's quite cool. I just finished reading through your P2 document and there are some amazing capabilities there. It will be a challenge to find a way to use them from the GCC code generator.

cgracey · 2017-03-19 07:38

I made another change to the Verilog, but it's very minor.

The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

This automatic rotation allows for new C and Z values with each SETCZ. It comes in very handy for the memory-read/write/modify code in the bytecode interpreter. Note that 'p' is the bytecode:

'
' Read memory - no index
'
rdmem		setcz	p	wc,wz	'get offset
	if_00	rfbyte	m
	if_01	rfword	m
	if_10	rflong	m

		setcz	p	wc,wz	'add base
	if_00	add	m,pbase
	if_01	add	m,vbase
	if_10	add	m,dbase

		setcz	p	wc,wz	'read memory
	if_00	rdbyte	x,m
	if_01	rdword	x,m
	if_10	rdlong	x,m

	_ret_	pusha	x

Here is the read with indexing:

'
' Read memory - with index
'
rdmemi		setcz	p	wc,wz	'get offset
	if_00	rfbyte	m
	if_01	rfword	m
	if_10	rflong	m

		setcz	p	wc,wz	'add base
	if_00	add	m,pbase
	if_01	add	m,vbase
	if_10	add	m,dbase

		setcz	p	wc,wz	'handle index
		popa	x
	if_01	shl	x,#1
	if_10	shl	x,#2
		add	m,x

	if_00	rdbyte	x,m		'read memory
	if_01	rdword	x,m
	if_10	rdlong	x,m

	_ret_	pusha	x

ozpropdev · 2017-03-19 09:02

cgracey wrote: »

I made another change to the Verilog, but it's very minor.

The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

Neat!
Reminds me of the old POPZC instruction in P2-Hot.

Seairth · 2017-03-19 11:48

cgracey wrote: »

I made another change to the Verilog, but it's very minor.

The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

Clever! So, to be clear, it also means that the D[1:0] is rotated to D[31:30]?

It might be worthwhile to make this two separate instructions (same encoding, though):

SETCZ #
RORCZ D

This makes it a bit clearer when reading code. Also, that aligns it with "ROR D, #1 WC" (which itself might benefit from an alias RORC).

cgracey · 2017-03-21 11:30

Seairth wrote: »

cgracey wrote: »

I made another change to the Verilog, but it's very minor.

The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).

Clever! So, to be clear, it also means that the D[1:0] is rotated to D[31:30]?

It might be worthwhile to make this two separate instructions (same encoding, though):

SETCZ #
RORCZ D

This makes it a bit clearer when reading code. Also, that aligns it with "ROR D, #1 WC" (which itself might benefit from an alias RORC).

Yes, D[1:0] are rotated into the top two bits.

cgracey · 2017-03-21 12:13

Code density is going way up!

Last night I was thinking about a challenge I have in implementing the interpreter. I've been tempted to write out separate routines for all the different memory variable accesses, since they could be fast and concise that way.

For example, reading a long variable with an 8-bit offset from dbase would look like this:

		rfbyte	m
		add	m,dbase
		rdlong	x,m
	_ret_	pusha	x

That's pretty simple, but all the different routines needed would be huge.

A general-case routine is several times slower, as it must handle every possibility. That would really slow the interpreter down, but save memory.

I was thinking about how it may be possible to selectively SKIP certain instructions in a long pattern of instructions, in order to get what you want out of the sequence.

Look at this code. It contains every instruction needed to execute all the hub read/write operations. To do a specific operation, you'd want to skip most of these instructions and execute only a few select ones:

'
' Read/write hub memory
'
rw_mem		rfbyte	m		'one of these (offset)		3 x
		rfword	m
		rflong	m

		add	m,pbase		'one of these (base)		3 x
		add	m,vbase
		add	m,dbase

		popa	x		'maybe this (index)		2 x (on/off)
		shl	x,#1		'...and maybe this
		shl	x,#2		'...or maybe this
		add	m,x		'...and this

		rdbyte	x,m		'one of these (read)		6 x
		rdword	x,m
		rdlong	x,m
	_ret_	pusha	x		'...and this

		popa	x		'or this (write)
	_ret_	wrbyte	x,m		'...and one of these
	_ret_	wrword	x,m
	_ret_	wrlong	x,m

There are 108 permutations possible. That would be a lot of separate small, fast routines, or one big slow one that does it all.

Well, we can now get ultimate compactness AND highest speed:

		skip	##%001111110111100	'execute jmp/rfbyte/add/rdlong/pusha
		jmp	#rw_mem

...then this executes, pieced together from the rw_mem code...

		rfbyte	m
		add	m,dbase
		rdlong	x,m
	_ret_	pusha	x		'only the four instructions we wanted, zero time overhead!

SKIP uses bits in D, LSB first, to selectively cancel instructions in the order they occur in memory, starting with the next instruction. It works through branches, too. And it doesn't just cancel instructions. In cog-exec mode, it actually increments the PC by 1..8, not just by 1, in order to get to the next instruction, saving time. It works in hub-exec mode by just cancelling instructions, always stepping by 4, to not reset the FIFO.

This is going to allow for really compact and fast code. This will shrink the interpreter down to almost nothing and it's going to be as fast as can be. For each bytecode, I'll look up a branch address and a SKIP pattern, and do a SKIP+JMP to execute the code and pattern needed by the bytecode. It will be very efficient.

I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.

When I started working on this, I first had a 32-bit encoder to look ahead at the SKIP pattern, so that it could increment the PC by 1..33. That encoder was big and slow, and then a big shifter was needed. Not practical. By making it only look eight SKIP bits ahead, the encoder and shifter were 1/4 the size and fast enough to work in time. It was very little logic to implement. If you do wind up skipping 32 instructions, it takes a total of 8 clocks: 1st instruction is cancelled, then the 9th, 17th, and 25th are cancelled, with PC incrementing by 8, 8, 8, then 8, dropping you off at the end of the sequence. Interrupts are disabled during SKIP. The SKIP pattern can be as short as you want, as leading 0's are implied. Once the 1's are exhausted, it's over.

ozpropdev · 2017-03-21 12:41

That's really neat Chip!
Small and fast, fabulous!

David Betz · 2017-03-21 13:04

It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?

cgracey · 2017-03-21 13:09

David Betz wrote: »

It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?

You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.

David Betz · 2017-03-21 13:19

cgracey wrote: »

David Betz wrote: »

It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?

You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.

I still think it might be better to just rearchitect the byte code VM instruction set to fit into COG/LUT RAM. Are you attempting to keep the byte codes the same as the ones used by the P1 VM?

ozpropdev · 2017-03-21 13:36

cgracey wrote: »

I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.

FPGA's warmed up and ready to exercise V17. :cool:

Rayman · 2017-03-21 13:46

This is going to allow for some really obfuscated assembly code...

What happens if there's a second skip when one is already in effect?

potatohead · 2017-03-21 16:53

WORTH IT

David Betz · 2017-03-21 17:33

I still think it will generate lots of ugly code that is hard to understand and bug prone.

Seairth · 2017-03-21 17:40

Rayman wrote: »

This is going to allow for some really obfuscated assembly code...

What happens if there's a second skip when one is already in effect?

That depends on whether the first SKIP skipped the second SKIP.

Roy Eltham · 2017-03-21 18:00

David Betz,
I think that ship already sailed with the self modifying and all the ALT* instructions. This is a neat code density trick, and with a little smarts an editor could show the executed snippets in a tooltip or some form of expanded view mode. Heck, we might even be able to let you edit the "expanded" form and have it auto-generate the condensed form.

Maybe I am biased, since I have to deal with templates, lambdas, and what not in C++ all the time, and those are equally hard (if not far worse in some cases) to parse and debug. I agree that it will make PASM2 a bit harder to read and debug when it has them, but I think the benefits far outweigh that cost.

Fast Bytecode Interpreter

Comments