Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

jmg · 2017-04-04 22:40

JasonDorie wrote: »

Doesn't CIL exclude argument types on its instructions? I looked into it once as a possibility for a runtime scripting language, and that was one of the major pain points because it means every add/sub/mil/div/etc needs to type-check and decide what routines to call.

Chip's EXEC may be able to support some of that ?
If Spin2 is going to support native float, it will have somewhat similar issues ?

Here are some bytecodes - looks like int8,int16 morph into int32 ( I'm not sure about u4 -> i4 ? )
Maybe a default can be <= u4.i4.r4, and an extension can include i8,u8,r8

0x90	ldelem.i1	Load the element with type int8 at index onto the top of the stack as an int32.	Object model instruction
0x92	ldelem.i2	Load the element with type int16 at index onto the top of the stack as an int32.	Object model instruction
0x94	ldelem.i4	Load the element with type int32 at index onto the top of the stack as an int32.	Object model instruction
0x96	ldelem.i8	Load the element with type int64 at index onto the top of the stack as an int64.	Object model instruction
0x98	ldelem.r4	Load the element with type float32 at index onto the top of the stack as an F	Object model instruction
0x99	ldelem.r8	Load the element with type float64 at index onto the top of the stack as an F.
0x91	ldelem.u1	Load the element with type unsigned int8 at index onto the top of the stack as an int32.	Object model instruction
0x93	ldelem.u2	Load the element with type unsigned int16 at index onto the top of the stack as an int32.	Object model instruction
0x95	ldelem.u4	Load the element with type unsigned int32 at index onto the top of the stack as an int32.	Object model instruction
0x96	ldelem.u8	Load the element with type unsigned int64 at index onto the top of the stack as an int64 (alias for ldelem.i8).	Object model instruction

JasonDorie · 2017-04-04 23:14

Most interpreters (or native chips) use instructions that imply type. Like fadd and add, for example. Since the CIL doesn't do that, you have to inspect the types of the arguments on the stack to decide what type of op to perform, whether conversions are required, and so on. It's really messy and it surprised me that it was done that way. Most operations use ldarg.# to load function arguments.

It's possible, for example, to do:

ldarg.0
ldarg.1
div

The only way to know what type of output you need to generate is to check the source types. They might be int32, int64, or float. And unless you know what type you're storing the result into, you'd probably have to do the math as float and only do the conversion on store.

You could conceivably only support a subset of the instructions, but then the idea of having access to a lot of existing code / compilers goes out the window, which is why I dropped the idea of using it more or less immediately.

potatohead · 2017-04-04 23:16

I would rather have the speed and simplicity over native floating point.

Actually, adding fixed point would be preferable to floating point, but only if doing so would maintain the current speed and simplicity aspects of SPIN.

jmg · 2017-04-04 23:42

potatohead wrote: »

I would rather have the speed and simplicity over native floating point.

See the post here, of a way to get simplicity, and floating point.

http://forums.parallax.com/discussion/comment/1406456/#Comment_1406456

The bytecode engine does not change here, just the front end IQ.

David Betz · 2017-04-05 03:31

msrobots wrote: »

It is really cool how fast this now is. I personally was drawn to the Propeller thru the work of MPARC and his self hosted Spin. It was like being back in the 80' running my small Atari 128 XE. And doing that with the PE-Kit on a breadboard was amazing.

I used PropGCC also, but never really programmed in C professional. After the Mainframe-time with COBOL I went quite abruptly into C# so I really would like to see a CIL bytecode interpreter, but sadly I am not (yet?) able to write one.

David and Eric seem to be a little frustrated, on one side because of the constant nagging about C here in the forums, but also because of constant op-code changes. What I think is needed (besides financial support from Parallax) is some cheering on to get those two highly qualified persons to start working on it again.

C on the P1 was a stepping stone to have C running on the P2 from the beginning, and even if I have a hard time with C by myself, supporting C on the P2 is a main goal for the money making education support Parallax does.

Dave mentioned a couple of times that the CMM memory model could benefit from some decode instruction to expand CMM-Byte-Codes to PASM instructions. I just loosely followed the PropGCC discussions at that time, but CMM on the P1 is usually faster as Spin on the P1 and has a comparable small memory footprint.

Since I never wrote a compiler by myself, and am not versatile in C, I have just to guess how much work is needed to support C on the P2.

But slowly I get the feeling that Chip is finalizing the op-code changes, so it might be REALLY important to explain to him what changes would be helpful to support the code generator of GCC. One thing I remember was the LINK register, and that is done. But besides the CMM decode instruction there might be other things able to be streamlined in Verilog to support GCC code generation better.

So writing SPIN2 showed some new needed instructions/shortcuts to really speed up things. For that reason I think it is quite important, needed and necessary to work on GCC before any Verilog code freeze.

There might be a equivalent speed gain possible for GCC, and nobody knows it now because nobody is working on it.

just my 2 cents.

MIke

Instruction encoding changes aren't really a problem for me anymore now that I just parse Chip's instruction spreadsheet to generate an opcode table for gas.

David Betz · 2017-04-05 03:33

jmg wrote: »

msrobots wrote: »

But slowly I get the feeling that Chip is finalizing the op-code changes, so it might be REALLY important to explain to him what changes would be helpful to support the code generator of GCC. One thing I remember was the LINK register, and that is done. But besides the CMM decode instruction there might be other things able to be streamlined in Verilog to support GCC code generation better.

So writing SPIN2 showed some new needed instructions/shortcuts to really speed up things. For that reason I think it is quite important, needed and necessary to work on GCC before any Verilog code freeze.
There might be a equivalent speed gain possible for GCC, and nobody knows it now because nobody is working on it.

Note that Spin2 must use a byte-code engine, whilst GCC's main mode I expect will be native opcodes for COG/HUBEXEC.
Of course that does not exclude also doing some other code generation for C.

Alternatively, if P2 is able to run CIL bytecodes, it will pick up a shipload of languages right there.

Why *must* Spin2 use byte codes? Eric already has a Spin compiler for P2 that generates PASM.

jmg · 2017-04-05 03:42

David Betz wrote: »

Why *must* Spin2 use byte codes? Eric already has a Spin compiler for P2 that generates PASM.

OK, to clarify, Chip's Version of Spin 2 will be using ByteCodes (hence this thread).
Once Spin2 is defined and stable, maybe others will be able to do a corresponding/compatible Spin2 to PASM.

JasonDorie · 2017-04-05 05:26

David Betz wrote: »

Why *must* Spin2 use byte codes? Eric already has a Spin compiler for P2 that generates PASM.

Because Chip has stated that it will, and he's bound by what he knows. It ends up being significantly more compact, which is a nice side effect though.

Rayman · 2017-04-05 10:40

Although 512kB seems nearly infinite compared to P1's 32 kB, I'm sure some will find a way to fill that up. Then, we'll start looking to bytecode options for sure....

XGA screen buffer at just 4 bpp take up 384 kB (if I did the math right). Then, you just have 128 kB for code...

Seairth · 2017-04-05 10:44

JasonDorie wrote: »

David Betz wrote: »

Why *must* Spin2 use byte codes? Eric already has a Spin compiler for P2 that generates PASM.

Because Chip has stated that it will, and he's bound by what he knows. It ends up being significantly more compact, which is a nice side effect though.

Right, it's software-based thumb mode.

David Betz · 2017-04-05 12:35

Rayman wrote: »

Although 512kB seems nearly infinite compared to P1's 32 kB, I'm sure some will find a way to fill that up. Then, we'll start looking to bytecode options for sure....

XGA screen buffer at just 4 bpp take up 384 kB (if I did the math right). Then, you just have 128 kB for code...

I wasn't arguing with a byte code version of Spin. I was just surprised when it was suggested that Spin must use byte code.

cgracey · 2017-04-05 18:36

All yesterday, I was down at Parallax, so I didn't get anything done on Prop2, though I figured out some stuff while driving.

This morning I made the bytecode executor do two more things within the existing 6-clock framework:

1) write the bytecode to PA ($1F6)
2) write the hub pointer to PB ($1F7)

#1 was necessary, so that the code snippets can see their bytecode. #2 was a luxury, as it only saves a GETPTR instruction, but it could be had for only a few gates, so I added it. Adding these two writes required no new mux's, just several gates of steering logic.

Seairth · 2017-04-05 19:04

cgracey wrote: »

All yesterday, I was down at Parallax, so I didn't get anything done on Prop2, though I figured out some stuff while driving.

This morning I made the bytecode executor do two more things within the existing 6-clock framework:

1) write the bytecode to PA ($1F6)
2) write the hub pointer to PB ($1F7)

#1 was necessary, so that the code snippets can see their bytecode. #2 was a luxury, as it only saves a GETPTR instruction, but it could be had for only a few gates, so I added it. Adding these two writes required no new mux's, just several gates of steering logic.

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

cgracey · 2017-04-05 20:29

Seairth wrote: »

cgracey wrote: »

All yesterday, I was down at Parallax, so I didn't get anything done on Prop2, though I figured out some stuff while driving.

This morning I made the bytecode executor do two more things within the existing 6-clock framework:

1) write the bytecode to PA ($1F6)
2) write the hub pointer to PB ($1F7)

#1 was necessary, so that the code snippets can see their bytecode. #2 was a luxury, as it only saves a GETPTR instruction, but it could be had for only a few gates, so I added it. Adding these two writes required no new mux's, just several gates of steering logic.

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

Writing those values to PA and PB do not preclude their use for any other purpose. Does that allay your concern?

jmg · 2017-04-05 20:36

Seairth wrote: »

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

Does that then limit the size of hub pointer ? - could matter for off-chip cases ?

Seairth · 2017-04-05 20:37

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

All yesterday, I was down at Parallax, so I didn't get anything done on Prop2, though I figured out some stuff while driving.

This morning I made the bytecode executor do two more things within the existing 6-clock framework:

1) write the bytecode to PA ($1F6)
2) write the hub pointer to PB ($1F7)

#1 was necessary, so that the code snippets can see their bytecode. #2 was a luxury, as it only saves a GETPTR instruction, but it could be had for only a few gates, so I added it. Adding these two writes required no new mux's, just several gates of steering logic.

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

Writing those values to PA and PB do not preclude their use for any other purpose. Does that allay your concern?

They don't preclude their use between XBYTE calls, but you can't leave data in them across XBYTE calls. I know this isn't an issue for the spin interpreter, but figured the option would be nice. As I said, though, it's a minor suggestion. I'm not attached to the idea one way or the other.

Edit: as I suggested earlier, an RFLAST instruction would be more versatile.

jmg · 2017-04-05 20:42

Rayman wrote: »

Although 512kB seems nearly infinite compared to P1's 32 kB, I'm sure some will find a way to fill that up. Then, we'll start looking to bytecode options for sure....

XGA screen buffer at just 4 bpp take up 384 kB (if I did the math right). Then, you just have 128 kB for code...

Even that 512kB is currently merely a 'target', but I think this is not locked to a binary-multiple ?
Is it just set to a multiple of 16 ?

If luck runs our way, and P&R gives spare room, values like 576k, 640k, 768k may fit ?

(I see NXP now have MCUs with 1MByte SRAM)

Seairth · 2017-04-05 20:44

jmg wrote: »

Seairth wrote: »

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

Does that then limit the size of hub pointer ? - could matter for off-chip cases ?

I don't see how, unless the Propeller adds hardware-level off-chip support. And at 24 bits, the addresses portion of the field is already 4 bits greater than the maximum allowable hub size. Still, I'm not attached to the idea. Just putting it out there...

David Betz · 2017-04-05 20:45

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

All yesterday, I was down at Parallax, so I didn't get anything done on Prop2, though I figured out some stuff while driving.

This morning I made the bytecode executor do two more things within the existing 6-clock framework:

1) write the bytecode to PA ($1F6)
2) write the hub pointer to PB ($1F7)

#1 was necessary, so that the code snippets can see their bytecode. #2 was a luxury, as it only saves a GETPTR instruction, but it could be had for only a few gates, so I added it. Adding these two writes required no new mux's, just several gates of steering logic.

One minor suggestion: pack both into PB. That way, PA can still be used as PA, if needed.

Writing those values to PA and PB do not preclude their use for any other purpose. Does that allay your concern?

I thought the Spin2 interpreter used one of those as a hub stack pointer.

cgracey · 2017-04-05 21:08

Here are the cog registers:

Cog Registers

addr	read		write		name
----------------------------------------------------------

000-1EF	RAM		RAM		user

1F0	RAM		RAM		user/IJMP3
1F1	RAM		RAM		user/IRET3
1F2	RAM		RAM		user/IJMP2
1F3	RAM		RAM		user/IRET2
1F4	RAM		RAM		user/IJMP1
1F5	RAM		RAM		user/IRET1
1F6	RAM		RAM		user/PA
1F7	RAM		RAM		user/PB

1F8	PTRA		RAM+PTRA	PTRA
1F9	PTRB		RAM+PTRB	PTRB
1FA	RAM		RAM+DIRA	DIRA
1FB	RAM		RAM+DIRB	DIRB
1FC	RAM		RAM+OUTA	OUTA
1FD	RAM		RAM+OUTB	OUTB
1FE	INA/RAM*	RAM*		INA/IJMP0*
1FF	INB/RAM*	RAM*		INB/IRET0*

* if enabled, debug interrupt vector

PA and PB are separate registers from PTRA and PTRB. PA and PB are just RAM registers which have special use under CALLPA, CALLPB, and LOC instructions. PTRA and PTRB can be used as automatic pointers to hub memory for the RDxxxx/WRxxxx instructions.

Is there still any concern about writing the bytecode to PA and the hub pointer to PB on a bytecode-execute? I don't think there's any conflict here. Also, it would take extra instructions to separate the byte and hub pointer if they were written to one register, together. It's very nice to have them separated, ready to use.

David Betz · 2017-04-05 21:12

Ah, okay. I didn't realize PTRA and PA were different registers.

K2 · 2017-04-05 22:04

Seairth wrote: »

Right, it's software-based thumb mode.

An apt analogy for those among us stooled to the rogue, if I may refer to ARM in such a manner.

Seairth · 2017-04-05 22:19

cgracey wrote: »

Is there still any concern about writing the bytecode to PA and the hub pointer to PB on a bytecode-execute? I don't think there's any conflict here. Also, it would take extra instructions to separate the byte and hub pointer if they were written to one register, together. It's very nice to have them separated, ready to use.

I wasn't concerned in the first place. More"thinking out loud" than anything else.

Leave it as it is.

cgracey · 2017-04-06 01:52

I've got the XBYTE thing wrapped up, I believe.

When doing a RET/_RET_ to $001F8..$001FF, XBYTE happens. That range of 8 addresses actually selects the number of LSBs from the hidden RFBYTE that will be used as an index into the LUT table to get the EXECF long:

$1F8 = RFBYTE result [7:0] --> 256 LUT entries
$1F9 = RFBYTE result [6:0] --> 128 LUT entries, 1 unused MSB
$1FA = RFBYTE result [5:0] --> 64 LUT entries, 2 unused MSBs
$1FB = RFBYTE result [4:0] --> 32 LUT entries, 3 unused MSBs
$1FC = RFBYTE result [3:0] --> 16 LUT entries, 4 unused MSBs
$1FD = RFBYTE result [2:0] --> 8 LUT entries, 5 unused MSBs
$1FE = RFBYTE result [1:0] --> 4 LUT entries, 6 unused MSBs
$1FF = RFBYTE result [0] --> 2 LUT entries, 7 unused MSBs

'_RET_ SETQ {#}D' is used to set the base of the LUT table that the RFBYTE LSBs will be used as an index for. This should be used to kick off XBYTE with the (initial) LUT base, but any snippet may end with '_RET SETQ {#}D' to change the LUT base.

Being able to set size, as Roy suggested, enables different configurations. Say you want each byte to have a 4-bit opcode field and a 4-bit operand field (PUSH #$1FC to init and SHR PA,#4 to get operand within bytecode routines). With changeable LUT base address, dynamic interpreters can be made.

Here is the prior example, but using $1F8 to set full 8-bit mode and then locating the LUT base to $100, so that the bytecode EXECF longs take the last half of the LUT:

'
' ** XBYTE Demo **
' Automatically executes bytecodes via RET/_RET_ to $1F8..$1FF.
' Overhead is 6 clocks, including _RET_ at end of bytecode routines.
'
dat		org

		setq2	#$FF		'load bytecode table into lut $100..$1FF
		rdlong	$100,#bytetable

		rdfast	#0,#bytecodes	'init fifo read at start of bytecodes

		push	#$1F8		'push $1F8 for xbyte with 8-bit lut index
	_ret_	setq	#$100		'start xbyte with lut base = $100, no stack pop
'
' Bytecode routines
'
r0	_ret_	drvn	#0		'toggle pin 0

r1	_ret_	drvn	#1		'toggle pin 1

r2	_ret_	drvn	#2		'toggle pin 2

r3	_ret_	drvn	#3		'toggle pin 3

r4		rfbyte	y		'get byte offset  |
		rfword	y		'get word offset  | one of these three
		rflong	y		'get long offset  |
		add	pb,y		'add offset  | one of these two
		sub	pb,y		'sub offset  |
	_ret_	rdfast	#0,pb		'init fifo read at new address
'
' Variables
'
x		res	1
y		res	1

		orgh
'
' Bytecodes that form program
'
bytecodes	byte	0		'toggle pin 0
		byte	1		'toggle pin 1
		byte	2		'toggle pin 2
		byte	3		'toggle pin 3
		byte	7, $-bytecodes	'reverse byte branch, loop to bytecodes
'
' Bytecode EXECF table gets moved into lut
'
bytetable	long	r0			'#0	toggle pin 0
		long	r1			'#1	toggle pin 1
		long	r2			'#2	toggle pin 2
		long	r3			'#3	toggle pin 3
		long	r4 | %0_10_110 << 10	'#4	forward byte branch
		long	r4 | %0_10_101 << 10	'#5	forward word branch
		long	r4 | %0_10_011 << 10	'#6	forward long branch
		long	r4 | %0_01_110 << 10	'#7	reverse byte branch
		long	r4 | %0_01_101 << 10	'#8	reverse word branch
		long	r4 | %0_01_011 << 10	'#9	reverse long branch

I'm going to get a new v18 release out with these changes and the SKIPF/EXECF change that adapts to hub exec.

Roy Eltham · 2017-04-06 02:30

Chip,
That is excellent! Better than I had asked for earlier with the mask.

This will make building emulators and bytecode engines really easy!

Another thing that I believe this will allow for is fast compression/decompression of data/streams/etc. Imagine a decompress that just runs code snippets via byte codes to produce the uncompressed result!

Seairth · 2017-04-06 02:39

Roy Eltham wrote: »

Another thing that I believe this will allow for is fast compression/decompression of data/streams/etc. Imagine a decompress that just runs code snippets via byte codes to produce the uncompressed result!

Great idea!

David Betz · 2017-04-06 02:45

Very nice. Can't wait to try using it to implement my ebasic VM.

Conga · 2017-04-06 11:31

cgracey wrote: »

I've got the XBYTE thing wrapped up, I believe.

When doing a RET/_RET_ to $001F8..$001FF, XBYTE happens. That range of 8 addresses actually selects the number of LSBs from the hidden RFBYTE that will be used as an index into the LUT table to get the EXECF long:

$1F8 = RFBYTE result [7:0] --> 256 LUT entries
$1F9 = RFBYTE result [6:0] --> 128 LUT entries, 1 unused MSB
$1FA = RFBYTE result [5:0] --> 64 LUT entries, 2 unused MSBs
$1FB = RFBYTE result [4:0] --> 32 LUT entries, 3 unused MSBs
$1FC = RFBYTE result [3:0] --> 16 LUT entries, 4 unused MSBs
$1FD = RFBYTE result [2:0] --> 8 LUT entries, 5 unused MSBs
$1FE = RFBYTE result [1:0] --> 4 LUT entries, 6 unused MSBs
$1FF = RFBYTE result [0] --> 2 LUT entries, 7 unused MSBs

'_RET_ SETQ {#}D' is used to set the base of the LUT table that the RFBYTE LSBs will be used as an index for. This should be used to kick off XBYTE with the (initial) LUT base, but any snippet may end with '_RET SETQ {#}D' to change the LUT base.

Fantastic!

What is the result of an attempt to execute directly a register in the 1F8--1FF range?

By "directly" I mean jumping to 1F8--1FF,
as opposed to "falling into it" by RETurning via internal call stack.

What is the result of an attempt to return via hub stack?
(RETA / RETB when the value of the hub long pointed by PTRA / PTRB happens to be in the 1F8--1FF range)

ozpropdev · 2017-04-06 13:41

Conga wrote: »

What is the result of an attempt to execute directly a register in the 1F8--1FF range?

By "directly" I mean jumping to 1F8--1FF,
as opposed to "falling into it" by RETurning via internal call stack.

What is the result of an attempt to return via hub stack?
(RETA / RETB when the value of the hub long pointed by PTRA / PTRB happens to be in the 1F8--1FF range)

Currently with V17 if registers $1F8 to $1FD contain a valid instruction and are jumped to or returned to from hub stack they execute like any other register <$1F0.
I suspect V18 will be the same.

Conga · 2017-04-07 02:36

ozpropdev wrote: »

Conga wrote: »

What is the result of an attempt to execute directly a register in the 1F8--1FF range?

By "directly" I mean jumping to 1F8--1FF,
as opposed to "falling into it" by RETurning via internal call stack.

What is the result of an attempt to return via hub stack?
(RETA / RETB when the value of the hub long pointed by PTRA / PTRB happens to be in the 1F8--1FF range)

Currently with V17 if registers $1F8 to $1FD contain a valid instruction and are jumped to or returned to from hub stack they execute like any other register <$1F0.
I suspect V18 will be the same.

Thanks for checking this!

First: *Should* these locations be executed when are jumped to or returned to from hub stack?

Second: What do you mean by "if registers [...] contain a valid instruction"?
Is there something that reports *invalid* instructions?
Does the P2 cog logic cover this?

I was not aware of an "illegal instruction" flag or similar mechanism.
I do think it's important for extensibility:
what you don't forbid/reject from the beginning can *not* be reinterpreted later,
people will rely on it doing nothing, or some field (like DDDDDDDDD) being ignored.
I commented before on this.

Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

Comments