Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

cgracey · 2017-04-07 02:54

Conga wrote: »

ozpropdev wrote: »

Conga wrote: »

What is the result of an attempt to execute directly a register in the 1F8--1FF range?

By "directly" I mean jumping to 1F8--1FF,
as opposed to "falling into it" by RETurning via internal call stack.

What is the result of an attempt to return via hub stack?
(RETA / RETB when the value of the hub long pointed by PTRA / PTRB happens to be in the 1F8--1FF range)

Currently with V17 if registers $1F8 to $1FD contain a valid instruction and are jumped to or returned to from hub stack they execute like any other register <$1F0.
I suspect V18 will be the same.

Thanks for checking this!

First: *Should* these locations be executed when are jumped to or returned to from hub stack?

Second: What do you mean by "if registers [...] contain a valid instruction"?
Is there something that reports *invalid* instructions?
Does the P2 cog logic cover this?

I was not aware of an "illegal instruction" flag or similar mechanism.
I do think it's important for extensibility:
what you don't forbid/reject from the beginning can *not* be reinterpreted later,
people will rely on it doing nothing, or some field (like DDDDDDDDD) being ignored.
I commented before on this.

Conga, sorry I didn't respond earlier. My experience is that software bugs can cause problems in thousands of different ways. Trying to trap a few obvious things, like the PC being obviously where it shouldn't be is likely to cover only 1/10,000th of possible problems. So, I don't see much value in adding logic to detect and report these kinds of things.

In systems with protected memory, there are many obvious errors that can be detected, but in our case, there are only a few things that would obviously be a problem, and they represent a very small fraction of likely bugs.

On the other hand, I can see the value of flagging bad behaviors that may cause hardware-compatibility problems in future versions of the chip.

msrobots · 2017-04-07 03:37

I was thinking for a while that a exception/interrupt fired if memory access is out of bounds of the available HUB memory could lead to some nice way to serve external memory.

Now I think that any external memory anyways has to be loaded in chunks, somewhere into HUB or COGs since there is no way to run XIP in external memory like Flash or Hyper-Ram. So the programmer/compiler has to check the boundaries anyways and a interrupt/exception is not needed there.

"illegal instruction" if byte-code or real instruction does not exist makes even less sense. On byte codes it might be able to happen, but on real instructions, there are not really much op-codes left undefined.

So about any given long is some valid instruction, useful at that situation or not.

But this opens up a question for me: What exactly does P2 (or P1 in that matter) if it hits a not as instruction defined (decodable) long while executing? Just skipping it?

I remember you could get some interesting side effects on Z80 and 6502 using 'not defined' instructions.

I am still smiling about Potatoheads Idea to use the new random function to provide the values for the skip instruction. One could go further and let one Cog write the lut of another one with random op-codes and let them execute.

Joke aside, what exactly happens when instructions are executed not making any sense to the COG?

Enjoy!

Mike

cgracey · 2017-04-07 03:48

Mike, undefined instructions only exist the same block as CLKSET, having only D operands. They won't do anything, though they could clear C or Z if those instruction bits were set.

msrobots · 2017-04-07 04:11

thanks Chip, I even wasn't sure if there where any undefined ones left.

Your byte-code engine support will for sure help for all the emulators Heater will have to (re-)write now. Sometimes the good parts just fall in place, as soon as one sees it.

And I am really glad that you talk to the PropGCC guys now, to get them going also. This is really exciting to see software developing and being able to fine-adjust some things before going into production.

Like the P1 the P2 you are building is just beautiful.

Enjoy!

Mike

evanh · 2017-04-07 13:03

msrobots wrote: »

Now I think that any external memory anyways has to be loaded in chunks, somewhere into HUB or COGs since there is no way to run XIP in external memory like Flash or Hyper-Ram.

Note to anyone pondering XIP: Even if there was hardware support, XIP over a single external bus would be hobbled and awkward on a Prop simply because only one Cog at a time could be effective. Internal RAM bandwidth is very high.

Software chunky solutions like overlays is sufficient.

Conga · 2017-04-07 15:10

cgracey wrote: »

Trying to trap a few obvious things, like the PC being obviously where it shouldn't be is likely to cover only 1/10,000th of possible problems. So, I don't see much value in adding logic to detect and report these kinds of things.

Thanks Chip,

Regarding Cog RAM locations to not execute ($1F8 to $1FF), you convinced me.
There's little value in preventing execution there.

jmg · 2017-04-07 20:28

evanh wrote: »

msrobots wrote: »

Now I think that any external memory anyways has to be loaded in chunks, somewhere into HUB or COGs since there is no way to run XIP in external memory like Flash or Hyper-Ram.

Note to anyone pondering XIP: Even if there was hardware support, XIP over a single external bus would be hobbled and awkward on a Prop simply because only one Cog at a time could be effective. Internal RAM bandwidth is very high.

Software chunky solutions like overlays is sufficient.

Err, yes, a speed reduction is rather self-evident.

The P2 already has a Speed/Location/Size continuum, XIP is another locus on that.

COG code is always fastest, but has the lowest code ceiling.
HUB code slower, but with a higher code ceiling.
Byte-Codes are slower again, but allow more in the same memory.
Code in External memory pretty much removes any code ceiling, but is slower again.

User projects will decide what mix of all of this they use.

Ignoring checking the operation of P2 with off-chip memory, seems a good way to ensure the outcome is "hobbled and awkward".

I've seen recent announcements of MCUs with XIP hardware that has Bus managers and 32k Cache all bundled.

Of course, P2 does not need to push to those limits, but talking to external memory should not be ignored/dismissed as a pure software problem 'to be solved later'.

Bill Henning · 2017-04-08 00:21

Chip,

This XBYTE is absolutely brilliant!

makes emulating other instruction sets MUCH easier, especially with saving the original byte code and its address.

Re/ the masking - I know why you did it, easier to do the lookup - however most byte codes I've played with would have unused LSB's, not MSB's, ditto for most 8 bit instruction sets.

cgracey wrote: »
I've got the XBYTE thing wrapped up, I believe.

When doing a RET/_RET_ to $001F8..$001FF, XBYTE happens. That range of 8 addresses actually selects the number of LSBs from the hidden RFBYTE that will be used as an index into the LUT table to get the EXECF long:

$1F8 = RFBYTE result [7:0] --> 256 LUT entries
$1F9 = RFBYTE result [6:0] --> 128 LUT entries, 1 unused MSB
$1FA = RFBYTE result [5:0] --> 64 LUT entries, 2 unused MSBs
$1FB = RFBYTE result [4:0] --> 32 LUT entries, 3 unused MSBs
$1FC = RFBYTE result [3:0] --> 16 LUT entries, 4 unused MSBs
$1FD = RFBYTE result [2:0] --> 8 LUT entries, 5 unused MSBs
$1FE = RFBYTE result [1:0] --> 4 LUT entries, 6 unused MSBs
$1FF = RFBYTE result [0] --> 2 LUT entries, 7 unused MSBs

'_RET_ SETQ {#}D' is used to set the base of the LUT table that the RFBYTE LSBs will be used as an index for. This should be used to kick off XBYTE with the (initial) LUT base, but any snippet may end with '_RET SETQ {#}D' to change the LUT base.

Being able to set size, as Roy suggested, enables different configurations. Say you want each byte to have a 4-bit opcode field and a 4-bit operand field (PUSH #$1FC to init and SHR PA,#4 to get operand within bytecode routines). With changeable LUT base address, dynamic interpreters can be made.

Here is the prior example, but using $1F8 to set full 8-bit mode and then locating the LUT base to $100, so that the bytecode EXECF longs take the last half of the LUT:
'
' ** XBYTE Demo **
' Automatically executes bytecodes via RET/_RET_ to $1F8..$1FF.
' Overhead is 6 clocks, including _RET_ at end of bytecode routines.
'
dat		org

		setq2	#$FF		'load bytecode table into lut $100..$1FF
		rdlong	$100,#bytetable

		rdfast	#0,#bytecodes	'init fifo read at start of bytecodes

		push	#$1F8		'push $1F8 for xbyte with 8-bit lut index
	_ret_	setq	#$100		'start xbyte with lut base = $100, no stack pop
'
' Bytecode routines
'
r0	_ret_	drvn	#0		'toggle pin 0

r1	_ret_	drvn	#1		'toggle pin 1

r2	_ret_	drvn	#2		'toggle pin 2

r3	_ret_	drvn	#3		'toggle pin 3

r4		rfbyte	y		'get byte offset  |
		rfword	y		'get word offset  | one of these three
		rflong	y		'get long offset  |
		add	pb,y		'add offset  | one of these two
		sub	pb,y		'sub offset  |
	_ret_	rdfast	#0,pb		'init fifo read at new address
'
' Variables
'
x		res	1
y		res	1

		orgh
'
' Bytecodes that form program
'
bytecodes	byte	0		'toggle pin 0
		byte	1		'toggle pin 1
		byte	2		'toggle pin 2
		byte	3		'toggle pin 3
		byte	7, $-bytecodes	'reverse byte branch, loop to bytecodes
'
' Bytecode EXECF table gets moved into lut
'
bytetable	long	r0			'#0	toggle pin 0
		long	r1			'#1	toggle pin 1
		long	r2			'#2	toggle pin 2
		long	r3			'#3	toggle pin 3
		long	r4 | %0_10_110 << 10	'#4	forward byte branch
		long	r4 | %0_10_101 << 10	'#5	forward word branch
		long	r4 | %0_10_011 << 10	'#6	forward long branch
		long	r4 | %0_01_110 << 10	'#7	reverse byte branch
		long	r4 | %0_01_101 << 10	'#8	reverse word branch
		long	r4 | %0_01_011 << 10	'#9	reverse long branch
I'm going to get a new v18 release out with these changes and the SKIPF/EXECF change that adapts to hub exec.

Roy Eltham · 2017-04-08 00:42

Chip,
Bill it right, it would be better in most cases if the unused bits (data) in the bytecodes was the LSBs not the MSBs.

The 6502 is an oddball where it's "opcode" is defined by the 2 lowests bits and the 3 highest bits, the 3 middle bits are the addressing mode. However, many of it's instructions would be best handled by have the 3 MSBs be the bytecode and the lower 5 bits being the data/unused.

Other bytecode setups I have seen often have the lowest bits being some form of index or offset, so they would prefer the MSBs being the code and the LSBs being the data.

jmg · 2017-04-08 01:10

Roy Eltham wrote: »

Other bytecode setups I have seen often have the lowest bits being some form of index or offset, so they would prefer the MSBs being the code and the LSBs being the data.

CIL and Java have the small constants and groups, all adjacent coded, so that means LSBs set the index.
However, pushing the bytecode itself to higher bits, makes the tables more sparse, and larger ?

cgracey · 2017-04-08 01:40

Ok. So, it would be good to use the MSB's and not the LSB's as the index. I'll make that change before doing all the compiles.

cgracey · 2017-04-08 01:47

cgracey wrote: »

Ok. So, it would be good to use the MSB's and not the LSB's as the index. I'll make that change before doing all the compiles.

Done. Compile underway.

cgracey · 2017-04-08 01:49

jmg wrote: »

Roy Eltham wrote: »

Other bytecode setups I have seen often have the lowest bits being some form of index or offset, so they would prefer the MSBs being the code and the LSBs being the data.

CIL and Java have the small constants and groups, all adjacent coded, so that means LSBs set the index.
However, pushing the bytecode itself to higher bits, makes the tables more sparse, and larger ?

If you don't care about trying to compress the bytecode table, you can always just use the whole byte as the index.

jmg · 2017-04-08 01:57

cgracey wrote: »

If you don't care about trying to compress the bytecode table, you can always just use the whole byte as the index.

True enough, there is no law saying some cannot point to the same routine, so you take savings in the destination area.

msrobots · 2017-04-08 02:53

The ** XBYTE Demo ** showed how small a interpreter can be. very nice example..

Even I can understand that just by reading. Since Spin2 will not be in ROM we will have variations of it pretty soon.

I was thinking about the user side. If using SPIN2 your first program needs the interpreter build in. Therefore all SPIN2 Programs will include their own byte code engine.

And soon also slightly different ones, since we all will play with that option.

I don't remember who exactly came up with that idea, but ROY said a 'modular' compiler, just putting in used byte-codes would not be a big problem from his standpoint.

Because 512k sounds big now, compared to 32k, but it isn't.

We all fought with reloading cog images from eeprom to be able to use the HUB occupied by the cog-image for buffers or what not.

Now Spin2 and C/C++ are in the flux, could we maybe avoid the problems between C/C++ and Spin by planning from the beginning to support some interoperability, So that - for example - it is not so painful to use some PASM from a Spin driver in C by converting it into a blob or gas.

Even co-existence of some Spin2Cog and some C program running in whatever mode on the same chip but in different cogs is doable and I think a worthy goal to explore.

and

$1F8 = RFBYTE result [7:0] --> 256 LUT entries
$1F9 = RFBYTE result [7:x] --> 128 LUT entries, 1 unused LSB
$1FA = RFBYTE result [7:x] --> 64 LUT entries, 2 unused LSBs
$1FB = RFBYTE result [7:x] --> 32 LUT entries, 3 unused LSBs
$1FC = RFBYTE result [7:x] --> 16 LUT entries, 4 unused LSBs

$1FD = RFBYTE result [6:0] --> 128 LUT entries, 1 unused MSBs
$1FE = RFBYTE result [5:0] --> 64 LUT entries, 2 unused MSBs
$1FF = RFBYTE result [4:0] --> 32 LUT entries, 3 unused MSBs

could give us both ends of the byte and who needs 8 or less byte codes?

Enjoy!

Mike

Roy Eltham · 2017-04-08 03:37

msrobots,
I certainly see uses for 5-7 unused LSBs. If it was possible to have a setting to decide which way they are arranged (MSBs or LSBs for unused) that would be ideal, but I dunno if it can fit and be clean.

evanh · 2017-04-08 04:31

jmg wrote: »

I've seen recent announcements of MCUs with XIP hardware that has Bus managers and 32k Cache all bundled.

Of course, P2 does not need to push to those limits, but talking to external memory should not be ignored/dismissed as a pure software problem 'to be solved later'.

That's exactly what the Prop2 would need, if done in hardware, to manage concurrent execution. It's a big ask.

potatohead · 2017-04-08 04:48

Roy Eltham wrote: »

Chip,
Bill it right, it would be better in most cases if the unused bits (data) in the bytecodes was the LSBs not the MSBs.

The 6502 is an oddball where it's "opcode" is defined by the 2 lowests bits and the 3 highest bits, the 3 middle bits are the addressing mode. However, many of it's instructions would be best handled by have the 3 MSBs be the bytecode and the lower 5 bits being the data/unused.

Other bytecode setups I have seen often have the lowest bits being some form of index or offset, so they would prefer the MSBs being the code and the LSBs being the data.

I third this. Great suggestion.

potatohead · 2017-04-08 04:49

cgracey wrote: »

cgracey wrote: »

Ok. So, it would be good to use the MSB's and not the LSB's as the index. I'll make that change before doing all the compiles.

Done. Compile underway.

Catching up... cool. NVM my earlier comment.

potatohead · 2017-04-08 04:51

evanh wrote: »

jmg wrote: »

I've seen recent announcements of MCUs with XIP hardware that has Bus managers and 32k Cache all bundled.

Of course, P2 does not need to push to those limits, but talking to external memory should not be ignored/dismissed as a pure software problem 'to be solved later'.

That's exactly what the Prop2 would need, if done in hardware, to manage concurrent execution. It's a big ask.

Can't we use the events to do this with a data fetch and queue COG or two? Not as fast as hardware, but a lot should be possible.

jmg · 2017-04-08 04:59

potatohead wrote: »

Can't we use the events to do this with a data fetch and queue COG or two? Not as fast as hardware, but a lot should be possible.

Yes, I envision a mix of HW (eg Streamer, (DDR ?) ) plus COG to manage external memory.
Just how compact this code can be, is tbf, but it may be small enough to make a dedicated COG optional (eg for when speed is highest priority)
I think Chip has said the streamer can manage N-byte or nibble bursts, so appears you could configure one burst for Header.Address+dummy and a following burst for data IO.

The point I was making is, this needs to be tested sooner, rather than treated as a software only problem 'for later'.

For an example of what HW problems might be lurking, see this thread
http://forums.parallax.com/discussion/comment/1405989/#Comment_1405989
To me this shows how same-edge sampling can have issues by the time it travels thru routing and IO cells.

msrobots · 2017-04-08 05:16

I hope with a shared LUT one cog could produce code for the second cog to run.

For sure interesting to try that.

Enjoy!

Mike

Roy Eltham · 2017-04-08 05:55

I imagine some interesting things with one cog decompressing a bitstrean that feeds another cog running either a larger byte/wordcode or even hubexec pasm.
So, hubexec PASM speed/flexibility, but with compactness of bytecode.

cgracey · 2017-04-08 06:00

msrobots wrote: »

...
$1F8 = RFBYTE result [7:0] --> 256 LUT entries
$1F9 = RFBYTE result [7:x] --> 128 LUT entries, 1 unused LSB
$1FA = RFBYTE result [7:x] --> 64 LUT entries, 2 unused LSBs
$1FB = RFBYTE result [7:x] --> 32 LUT entries, 3 unused LSBs
$1FC = RFBYTE result [7:x] --> 16 LUT entries, 4 unused LSBs

$1FD = RFBYTE result [6:0] --> 128 LUT entries, 1 unused MSBs
$1FE = RFBYTE result [5:0] --> 64 LUT entries, 2 unused MSBs
$1FF = RFBYTE result [4:0] --> 32 LUT entries, 3 unused MSBs

could give us both ends of the byte and who needs 8 or less byte codes?

What about a case where you have 4 bytecodes, but you can change the table base around, like a state machine. For 4 bytecodes, you could have 64 different states. You could even have 3 dimensions of 4 states (4 x 4 x 4 = 64 sets of a 4-bytecode language).

Or imagine a 2-bytecode language where you have seven bits of state (7 bits = 128 sets of a 2-bytecode language). A snippet could toggle bit7..bit1 of the SETQ value to enter a new state, thereby selecting new 0/1 table entries.

cgracey · 2017-04-08 06:10

I just tested out the MSB-based indexing for XBYTE and it works fine. Also, I checked out the AUGS+RDxxxx/WRxxxx and it works, too. I've got the big Prop123_A9 compile done, so now I'll start the BeMicro_A9 compile. Once that's going, I need to work out some syntax for AUGS+RDxxxx/WRxxxx.

cgracey · 2017-04-08 06:22

One thing I forgot to mention:

When you do a 'RDFAST #blocks,#address', the 'blocks' value selects how many sets of 64 bytes (16 longs, really) will be read before automatically looping back to the first byte. Your first byte needs to be long-aligned for this to work, but it will seamlessly wrap around when the last byte is read. This means that if you wrote a bytecode program which ran from beginning to end, automatically looping, with consistently-timed snippets, you would have deterministic timing.

evanh · 2017-04-08 07:44

potatohead wrote: »

Can't we use the events to do this with a data fetch and queue COG or two? Not as fast as hardware, but a lot should be possible.

Ya, that's why I've been saying no need to ask for XIP. Doing it in hardware without a large cache per Cog would be terrible at best. And obviously, with the Prop, a large set of caches means the end of HubRAM, which is not the same objective any longer.

ozpropdev · 2017-04-08 07:54

cgracey wrote: »

Once that's going, I need to work out some syntax for AUGS+RDxxxx/WRxxxx.

Current format is

	rdlong	myreg,ptra[imm_offset]

change to

	rdlong	myreg,ptra[#imm_offset]	'offset is always a immediate value

This makes the AUGx syntax the same for all cases.

	rdlong	myreg,ptra[##imm_offset]   '20 bit offset

cgracey · 2017-04-08 08:04

ozpropdev wrote: »
cgracey wrote: »

Once that's going, I need to work out some syntax for AUGS+RDxxxx/WRxxxx.

Current format is
	rdlong	myreg,ptra[imm_offset]
change to
	rdlong	myreg,ptra[#imm_offset]	'offset is always a immediate value
This makes the AUGx syntax the same for all cases.
	rdlong	myreg,ptra[##imm_offset]   '20 bit offset

I was thinking the exact same thing for the AUGS version, but leaving the 5-bit index as is, without needing a "#". Remember that there's a behavior difference, too, with the normal 5-bit index getting scaled, while the AUGS 20-bit index is not scaled. Therefore, I think that just ## for AUGS would be appropriate, as it would signal not just a bigger index, but something different, as well.

cgracey · 2017-04-08 10:13

I got the assembler working with the new syntax for AUGS+RDxxxx/WRxxxx-PTRA/PTRB.

Here is a test program that jumps PTRA/PTRB around by $1000:

dat	org

	not	dira

	rdbyte	outa,++ptra[##$1000]
	rdbyte	outa,++ptra[##$1000]
	rdbyte	outa,++ptra[##$1000]
	rdbyte	outa,ptra[##$1000]
	rdbyte	outa,ptra
	rdbyte	outa,--ptra[##$1000]
	rdbyte	outa,ptra++[##$1000]
	rdbyte	outa,ptra--[##$1000]
	rdbyte	outa,ptra--[##$1000]
	rdbyte	outa,ptra--[##$1000]
	wrlong	##$7FFF_FFFE,ptrb[##$4000]
	rdlong	outa,ptrb[##$4000]

	rdbyte	outa,##$1000
	rdbyte	outa,##$2000
	rdbyte	outa,##$3000
	rdlong	outa,##$4000

	jmp	#$


	orgh	$1000
	byte	$01

	orgh	$2000
	byte	$02

	orgh	$3000
	byte	$04

	orgh	$4000
	byte	$08

Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

Comments