Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

cgracey · 2017-04-04 08:01

I had an idea over the weekend. No, there is not some new instruction. It's more devious than that. And it's already implemented and tested.

We can now interpret bytecodes from hub with only 6 clocks of overhead, including the _RET_ at the end of the bytecode routine. This means that minimal bytecode routines, which consist of only one 2-clock instruction with a _RET_, can execute at the pace of 8 clocks per bytecode, with bytecodes being fetched from hub and the bytecode routines located in the cog and LUT.

Here's how it works:

If a RET/_RET_ occurs when the stack holds $001F8 or $001F9 (that's PTRA or PTRB, which you'd never want to actually RETurn to), the stack is not popped and a special sequence begins that does a hidden RFBYTE to fetch the next bytecode, a hidden RDLUT to look up its table entry, and a hidden EXECF to branch to its routine with a SKIPF pattern initiated.

Here is the sequence, starting from the last clock of the last instruction in the current bytecode routine (or the instruction that kicks off the process):

clock	phase	hidden		description
-------------------------------------------------------------------------------------------------------
1	go	RFBYTE		last clock of instruction which is executing a RET/_RET_ to $1F8/$1F9

2	get	RDLUT		1st clock of 1st cancelled instruction
3	go	LUT --> D	2nd clock of 1st cancelled instruction
4	get	EXECF (D = LUT)	1st clock of 2nd cancelled instruction
5	go	EXECF branch	2nd clock of 2nd cancelled instruction
6	get	flush pipe	1st clock of 3rd cancelled instruction
7	go	flush pipe	2nd clock of 3rd cancelled instruction

8	get			1st clock of 1st instruction of bytecode routine, clock 1 next if _RET_

If you look at this program for 30 seconds, you'll see how it's working:

'
' ** XBYTE Demo **
' Automatically executes bytecodes via RET/_RET_ to $1F8.
' Overhead is 6 clocks, including _RET_ at end of routines.
'
dat		org

		setq2	#$FF		'load bytecode table into lut $000..$0FF
		rdlong	0,#bytetable

		rdfast	#0,#bytecodes	'init fifo read at start of bytecodes

	_ret_	push	#$1F8		'start xbyte, no stack pop if $1F8/$1F9
'
' Bytecode routines
'
r0	_ret_	drvn	#0		'toggle pin 0

r1	_ret_	drvn	#1		'toggle pin 1

r2	_ret_	drvn	#2		'toggle pin 2

r3	_ret_	drvn	#3		'toggle pin 3

r4		getptr	x		'get current rdfast address
		rfbyte	y		'get byte offset  |
		rfword	y		'get word offset  | one of these three
		rflong	y		'get long offset  |
		add	x,y		'add offset  | one of these two
		sub	x,y		'sub offset  |
	_ret_	rdfast	#0,x		'init fifo read at new address
'
' Variables
'
x		res	1
y		res	1

		orgh
'
' Bytecodes that form program
'
bytecodes	byte	0		'toggle pin 0
		byte	1		'toggle pin 1
		byte	2		'toggle pin 2
		byte	3		'toggle pin 3
		byte	7, $-bytecodes	'reverse byte branch, loop to bytecodes
'
' Bytecode EXECF table gets moved into lut
'
bytetable	long	r0			'#0	toggle pin 0
		long	r1			'#1	toggle pin 1
		long	r2			'#2	toggle pin 2
		long	r3			'#3	toggle pin 3
		long	r4 | %0_10_110_0 << 10	'#4	forward byte branch
		long	r4 | %0_10_101_0 << 10	'#5	forward word branch
		long	r4 | %0_10_011_0 << 10	'#6	forward long branch
		long	r4 | %0_01_110_0 << 10	'#7	reverse byte branch
		long	r4 | %0_01_101_0 << 10	'#8	reverse word branch
		long	r4 | %0_01_011_0 << 10	'#9	reverse long branch

And here is a scope picture of it executing. You can see that those pin toggles are occurring every 100ns, or 8 clocks at 80MHz. The branch takes longer, since it must reinitialize the FIFO read:

If you use $1F8 as the RETurn address, the first half of the LUT will be used as the EXECF table. If you use $1F9, the second half of the LUT will be used. This means you can shift gears any time by doing either a '_RET_ PUSH #$1F8' or a '_RET_ PUSH #$1F9' to switch between two different tables.

This XBYTE mechanism is going to make Spin2 really fast.

Roy Eltham · 2017-04-04 08:12

Chip,
You are on a roll, with these clever things to make bytecode execution super fast. It's almost to the point of having a bytecode engine in hardware, except better because it's fully customizable to whatever bytecode you want.

cgracey · 2017-04-04 08:20

Roy Eltham wrote: »

Chip,
You are on a roll, with these clever things to make bytecode execution super fast. It's almost to the point of having a bytecode engine in hardware, except better because it's fully customizable to whatever bytecode you want.

Yes, this is going to make it really easy to build bytecode interpreters. It automates the bare-bones behavior and you just write snippets to give bytes their personalities. For all those guys that like to make Z-80 and 6502 emulators...

Heater. · 2017-04-04 08:20

You have totally lost me there Chip.

But, what?!

"...those pin toggles are occurring every 100ns, or 8 clocks at 80MHz."

The RISC V core I'm playing with takes 4 clocks to execute just one of it's native code instructions. Loads and stores are a bit more.

cgracey · 2017-04-04 08:27

Heater. wrote: »

You have totally lost me there Chip.

But, what?!

"...those pin toggles are occurring every 100ns, or 8 clocks at 80MHz."

The RISC V core I'm playing with takes 4 clocks to execute just one of it's native code instructions. Loads and stores are a bit more.

This XBYTE mechanism can read a byte from hub, lookup that byte's routine address and skip pattern in the LUT, and start executing it, all with a 6-clock overhead. For routines which are just a 2-clock instruction with a _RET_, the whole affair takes 8 clocks per byte. You get to write the routines for each byte. Very simple.

jmg · 2017-04-04 09:16

cgracey wrote: »

This XBYTE mechanism is going to make Spin2 really fast.

cgracey wrote: »

This XBYTE mechanism can read a byte from hub, lookup that byte's routine address and skip pattern in the LUT, and start executing it, all with a 6-clock overhead. For routines which are just a 2-clock instruction with a _RET_, the whole affair takes 8 clocks per byte.
..
this is going to make it really easy to build bytecode interpreters.

Very nifty, impressive speed....

Did you look at CIL bytecodes using this engine yet ?
https://en.wikipedia.org/wiki/List_of_CIL_instructions

cgracey wrote: »

I had an idea over the weekend. No, there is not some new instruction. It's more devious than that. And it's already implemented and tested.

Was this a hardware change, or pure software optimizing ?

cgracey wrote: »

You can see that those pin toggles are occurring every 100ns, or 8 clocks at 80MHz. The branch takes longer, since it must reinitialize the FIFO read:

That FIFO phase looks to be ~ 375ns or appx 30 SysCLKs ?

HyperRAM or even QuadSPI.DDR could deliver high byte-code feeding speeds, in a straight line.
80MHz DDR would be able to do up to 160MBytes/sec, burst.

Conga · 2017-04-04 11:25

It's brilliant.

I can see how nice it is to be able to define quickly a meaning for an "I/O list" (actually an "I/O program") = a custom mini-language.

But to compensate for the level of "magic" in this
maybe it would be useful to block actual attempts to execute *any* address in the 1F8--1FF range.

And in general, to trap (halt cog?) any attempt to execute illegal instructions.
Example: RETA with DDDDDDDDD != 000000000.

This could be important for extensibility:
what you don't forbid/reject from the beginning can *not* be reinterpreted later,
people will rely on it doing nothing, or,
in the above example, the DDDDDDDDD value being ignored.

The Linux kernel developers learnt the lesson the hard way: the unused/reserved flags could not be used/re-purposed later if they were not rejected from the beginning.
They had to define new system calls, even if the old syscall would have been perfectly fine (had all needed parameters, no addition necessary).
User programs started to rely on having the kernel ignore "junk" values in the unused/reserved flags,
so it became unsafe to give a meaning to an unused flag since the new meaning/effect might be triggered by unsuspecting user programs.

Rejecting illegal instructions seems important for any attempt to discuss IEC60730
(I don't know what it actually requires, but intuitively detecting and rejecting illegal instructions should be part of it).

ersmith · 2017-04-04 11:37

This sounds very interesting. It's like being able to code microcode for a bytecode interpreter.

Is there a way for the executing routine to access the original byte that was automatically fetched? Bytecodes sometimes have data encoded in them, and it would be more space efficient to read that data right out of the bytecode rather than having to encode different versions of the instruction for each variant of data.

Also; loath as I am to suggest anything that would delay the chip, it seems like it would be prudent to implement at least one bytecode interpreter whose instruction set was not designed with this mechanism in mind, just to see what (if any) gotchas arise. The JVM or CIL are probably way to complicated for that, but perhaps you could try out the ZPU bytecode? @Heater had that running on P1, and it has a GCC port (so it would be a quick way to get C up on the P2).

Seairth · 2017-04-04 11:46

ersmith wrote: »

Is there a way for the executing routine to access the original byte that was automatically fetched? Bytecodes sometimes have data encoded in them, and it would be more space efficient to read that data right out of the bytecode rather than having to encode different versions of the instruction for each variant of data.

You could conceivably embed data in the bytecode stream itself. Since he's using RFxxx, you could simply call RFBYTE (etc.). In the the case where the embedded data is optional, no problem: you have two different byte codes that jump to the same routine, but with two different skip tables (one that calls RFBYTE, and the other that doesn't).

Actually, I think he's doing exactly that in the "r4" routine above.

Heater. · 2017-04-04 12:55

eric,

What? You want to wake up Zog again?

ersmith · 2017-04-04 13:10

Seairth wrote: »

You could conceivably embed data in the bytecode stream itself. Since he's using RFxxx, you could simply call RFBYTE (etc.). In the the case where the embedded data is optional, no problem: you have two different byte codes that jump to the same routine, but with two different skip tables (one that calls RFBYTE, and the other that doesn't).

Sure, if you define the bytecode format yourself you could do that. But if you're trying to emulate an existing CPU or virtual machine you don't get to define the bytecodes. Moreover, for some really common operations (like pushing small constants) you can get much better code density by putting the data into the bytecode.

ersmith · 2017-04-04 13:11

Heater. wrote: »

What? You want to wake up Zog again?

Why not? It's a pretty simple bytecode, seems like a good fit for this.

Did you ever get the little-endian version of zpu-gcc working? Byte swapping seems like it could be a significant bottleneck in a P2 ZPU interpreter.

Eric

Heater. · 2017-04-04 13:51

Yep, the ZPU instruction set is about as small as one can sensibly get.

Why not?

I might give it a look. Thing is I already started day dreaming about a RISC V emulator for the P2. That has a pretty small instruction set. Although it has some funky instruction formats to deal with. What better way to get acquainted with the P2 instruction set than write an emulator? Also the RISC-V is always going to have well supported and up to date GCC and LLVM compilers.

Actually I don't recall anything about ZPU endianness now.

ersmith · 2017-04-04 14:52

Heater. wrote: »

Yep, the ZPU instruction set is about as small as one can sensibly get.
show previous quotes

Why not?

I might give it a look.

Oh, I wasn't trying to put that on you. But I definitely think *some* other bytecode should be interpreted on P2 if Chip wants to validate that the new XBYTE/SKIP mechanism is suitably general. It would be very easy to specify the hardware for Spin2 and overlook some essential feature that it would make it useful for other bytecodes. And as you say, the ZPU bytecode is about as simple as one can get! But a UCSD P-System bytecode interpreter, or a 6502 interpreter, would do just as well. Anything that wasn't designed originally for this mechanism.

Thing is I already started day dreaming about a RISC V emulator for the P2. That has a pretty small instruction set. Although it has some funky instruction formats to deal with. What better way to get acquainted with the P2 instruction set than write an emulator?

The RISC-V ISA is nice, but the instruction layout seems to be optimized for hardware decode rather than software. Might be a bit tricky. Of course, that's never stopped you

.

Actually I don't recall anything about ZPU endianness now.

The ZPU is big-endian by default, and so in ZOG you byte-swapped all the longs and then had to apply some hackery to the instruction fetch to get the correct bytes. Near the end of the ZOG thread I think you said you had a little-endian version of the compiler that would avoid that unpleasantness, but then the thread kind of drifted off to sleep and it never got posted (I guess that was when PropGCC was becoming a useful thing). I've googled a bit but the only mention I see of a little-endian zpu gcc is a comment in the gcc commit log that little endian was being disabled because it required a library rebuild.

Eric

Rayman · 2017-04-04 15:30

Would it be better to add a new instruction rather that hijack "_ret_ push #$1F8" ?

Or, maybe just a new alias name for that...

Heater. · 2017-04-04 15:37

The field layouts of the RISC-V instructions are a bit messy.

But, can't we use the new funky SKIP mechanism to selectively execute the required codes that extract the fields from the different instruction types? Thus saving a lot of decisions and jumps.

I have not looked at the details, just speculating...

David Betz · 2017-04-04 15:43

ersmith wrote: »

show previous quotes

Heater. wrote: »

Yep, the ZPU instruction set is about as small as one can sensibly get.

Why not?

I might give it a look.

Oh, I wasn't trying to put that on you. But I definitely think *some* other bytecode should be interpreted on P2 if Chip wants to validate that the new XBYTE/SKIP mechanism is suitably general. It would be very easy to specify the hardware for Spin2 and overlook some essential feature that it would make it useful for other bytecodes. And as you say, the ZPU bytecode is about as simple as one can get! But a UCSD P-System bytecode interpreter, or a 6502 interpreter, would do just as well. Anything that wasn't designed originally for this mechanism.
show previous quotes

Thing is I already started day dreaming about a RISC V emulator for the P2. That has a pretty small instruction set. Although it has some funky instruction formats to deal with. What better way to get acquainted with the P2 instruction set than write an emulator?

The RISC-V ISA is nice, but the instruction layout seems to be optimized for hardware decode rather than software. Might be a bit tricky. Of course, that's never stopped you .
show previous quotes

Actually I don't recall anything about ZPU endianness now.

The ZPU is big-endian by default, and so in ZOG you byte-swapped all the longs and then had to apply some hackery to the instruction fetch to get the correct bytes. Near the end of the ZOG thread I think you said you had a little-endian version of the compiler that would avoid that unpleasantness, but then the thread kind of drifted off to sleep and it never got posted (I guess that was when PropGCC was becoming a useful thing). I've googled a bit but the only mention I see of a little-endian zpu gcc is a comment in the gcc commit log that little endian was being disabled because it required a library rebuild.

Eric

I'm not sure a simple JVM implementation would be all that hard. We did one for the MPE pretty quickly.

Rayman · 2017-04-04 16:43

I just looked up the Pac-Man Z-80 CPU speed and it is 3.072 MHz.

Is this new feature going to help with real time emulation of that?

Conga · 2017-04-04 16:45

Rayman wrote: »

Would it be better to add a new instruction rather that hijack "_ret_ push #$1F8" ?

If I understand correctly, it's *any* RETurn to one of the special addresses #$1F8 or #$1F9 that triggers the "magic".

"_ret_ push #$1F8" just primes the pump, so to speak.

This is why I think it would be useful to block actual attempts to execute *any* address (register, more correctly?) in the 1F8--1FF range.
These may contain a code address, but not an instruction.

The idea is to detect obvious "junk" values in the internal call stack.

potatohead · 2017-04-04 16:45

Yes. Chip has a 10mhz loop going at 80mhz

potatohead · 2017-04-04 16:57

Jmg, it's a hardware change. We get it on next FPGA update.

A 6502 should fly using this.

cgracey · 2017-04-04 17:55

I'll have it save the bytecode to PA or PB ($1F6 or $1F7).

Making this into an instruction would actually slow it down. I realized that to cut the cycles down, it needed to be overlapping pipeline cancellation cycles. It's as fast as I can make it now.

Seairth · 2017-04-04 18:17

If there were an "RFLAST D" instruction (and a 32-bit internal register that held the last RFxxx value), he could just call that. Heck, that might even have some uses elsewhere.

Roy Eltham · 2017-04-04 20:00

Keep in mind that it's only 8 clocks for very simple things that only need one pasm instruction. If your bytecode needs to do more, then it's more clocks. I'm guessing that z80 or 6502 emulation will have some that are just one instruction, but some that are several, so it probably averages out to something like 10-12 cycles. However, if the final P2 runs at 160Mhz, then that's still plenty fast for "faster than the real chips" emulation of z80s and 6502s.

Chip, an interesting variant of this would be a case where you could get it to read the bytecode, and mask it before doing the LUT lookup. So that the, for example, lower 4 bits of the bytecode did the lookup, but then the upper 4 bits could be read as data in the routine (from the PA/PB registers). Then you don't have to fill the LUT with a bunch of the same values to get the same result. Perhaps the mask could be something preloaded with a setq or similar?

jmg · 2017-04-04 20:05

Conga wrote: »

Rejecting illegal instructions seems important for any attempt to discuss IEC60730
(I don't know what it actually requires, but intuitively detecting and rejecting illegal instructions should be part of it).

Yes, IEC60730 is becoming more important, and P2 certainly should be 'checked against' that before release.
P2 does have two oscillators, but I think no hardware watchdog & no oscillator fail detection.

It may require a complete COG, dedicated to COP tasks, to give some IEC60730 tick-boxes.

potatohead · 2017-04-04 20:06

Agreed. Most will be a couple instructions, often a few. Update flags, etc...

Of the two, 6502 should be the simplest and fastest. Z80 has a few pretty complicated instructions. Well, maybe. There is decimal mode on 6502, which may take some thinking.

When we get this, I may do part of a 6502. I think it will be over 2Mhz, and that's roughly a 4Mhz Z80.

jmg · 2017-04-04 20:18

Roy Eltham wrote: »

Chip, an interesting variant of this would be a case where you could get it to read the bytecode, and mask it before doing the LUT lookup. So that the, for example, lower 4 bits of the bytecode did the lookup, but then the upper 4 bits could be read as data in the routine (from the PA/PB registers). Then you don't have to fill the LUT with a bunch of the same values to get the same result. Perhaps the mask could be something preloaded with a setq or similar?

That could help some bytecode designs.

Below are the CIL ones, with constant groups, these do not seem to fit a simple mask step, but they are sequential, so look to need an add/subtract somewhere, along with the mask ?

0x02	ldarg.0	Load argument 0 onto the stack.	Base instruction
0x03	ldarg.1	Load argument 1 onto the stack.	Base instruction
0x04	ldarg.2	Load argument 2 onto the stack.	Base instruction
0x05	ldarg.3	Load argument 3 onto the stack.	Base instruction
0x0E	ldarg.s <uint8 (num)>	Load argument numbered num onto the stack, short form.	Base instruction
..
0x16	ldc.i4.0	Push 0 onto the stack as int32.	Base instruction
0x17	ldc.i4.1	Push 1 onto the stack as int32.	Base instruction
0x18	ldc.i4.2	Push 2 onto the stack as int32.	Base instruction
0x19	ldc.i4.3	Push 3 onto the stack as int32.	Base instruction
0x1A	ldc.i4.4	Push 4 onto the stack as int32.	Base instruction
0x1B	ldc.i4.5	Push 5 onto the stack as int32.	Base instruction
0x1C	ldc.i4.6	Push 6 onto the stack as int32.	Base instruction
0x1D	ldc.i4.7	Push 7 onto the stack as int32.	Base instruction
0x1E	ldc.i4.8	Push 8 onto the stack as int32.	Base instruction
0x15	ldc.i4.m1	Push -1 onto the stack as int32.	Base instruction
0x06	ldloc.0	Load local variable 0 onto stack.	Base instruction
0x07	ldloc.1	Load local variable 1 onto stack.	Base instruction
0x08	ldloc.2	Load local variable 2 onto stack.	Base instruction
0x09	ldloc.3	Load local variable 3 onto stack.	Base instruction
0x0A	stloc.0	Pop a value from stack into local variable 0.	Base instruction
0x0B	stloc.1	Pop a value from stack into local variable 1.	Base instruction
0x0C	stloc.2	Pop a value from stack into local variable 2.	Base instruction
0x0D	stloc.3	Pop a value from stack into local variable 3.	Base instruction

jmg · 2017-04-04 20:24

David Betz wrote: »

I'm not sure a simple JVM implementation would be all that hard. We did one for the MPE pretty quickly.

It certainly is a good idea to test other bytecodes, and most bytecodes will be limited to <= 256 choices, so the core engine is not large. ( > 256 are managed with escapes)

The function calls to get full working systems are a different story, but just the core-engines could be tested ?
As mentioned above, the final timing of bytecodes will vary upwards from the base of 8 clocks.

msrobots · 2017-04-04 20:47

It is really cool how fast this now is. I personally was drawn to the Propeller thru the work of MPARC and his self hosted Spin. It was like being back in the 80' running my small Atari 128 XE. And doing that with the PE-Kit on a breadboard was amazing.

I used PropGCC also, but never really programmed in C professional. After the Mainframe-time with COBOL I went quite abruptly into C# so I really would like to see a CIL bytecode interpreter, but sadly I am not (yet?) able to write one.

David and Eric seem to be a little frustrated, on one side because of the constant nagging about C here in the forums, but also because of constant op-code changes. What I think is needed (besides financial support from Parallax) is some cheering on to get those two highly qualified persons to start working on it again.

C on the P1 was a stepping stone to have C running on the P2 from the beginning, and even if I have a hard time with C by myself, supporting C on the P2 is a main goal for the money making education support Parallax does.

Dave mentioned a couple of times that the CMM memory model could benefit from some decode instruction to expand CMM-Byte-Codes to PASM instructions. I just loosely followed the PropGCC discussions at that time, but CMM on the P1 is usually faster as Spin on the P1 and has a comparable small memory footprint.

Since I never wrote a compiler by myself, and am not versatile in C, I have just to guess how much work is needed to support C on the P2.

But slowly I get the feeling that Chip is finalizing the op-code changes, so it might be REALLY important to explain to him what changes would be helpful to support the code generator of GCC. One thing I remember was the LINK register, and that is done. But besides the CMM decode instruction there might be other things able to be streamlined in Verilog to support GCC code generation better.

So writing SPIN2 showed some new needed instructions/shortcuts to really speed up things. For that reason I think it is quite important, needed and necessary to work on GCC before any Verilog code freeze.

There might be a equivalent speed gain possible for GCC, and nobody knows it now because nobody is working on it.

just my 2 cents.

MIke

jmg · 2017-04-04 21:19

msrobots wrote: »

But slowly I get the feeling that Chip is finalizing the op-code changes, so it might be REALLY important to explain to him what changes would be helpful to support the code generator of GCC. One thing I remember was the LINK register, and that is done. But besides the CMM decode instruction there might be other things able to be streamlined in Verilog to support GCC code generation better.

So writing SPIN2 showed some new needed instructions/shortcuts to really speed up things. For that reason I think it is quite important, needed and necessary to work on GCC before any Verilog code freeze.
There might be a equivalent speed gain possible for GCC, and nobody knows it now because nobody is working on it.

Note that Spin2 must use a byte-code engine, whilst GCC's main mode I expect will be native opcodes for COG/HUBEXEC.
Of course that does not exclude also doing some other code generation for C.

Alternatively, if P2 is able to run CIL bytecodes, it will pick up a shipload of languages right there.

JasonDorie · 2017-04-04 22:00

Doesn't CIL exclude argument types on its instructions? I looked into it once as a possibility for a runtime scripting language, and that was one of the major pain points because it means every add/sub/mil/div/etc needs to type-check and decide what routines to call. You keep saying you want CIL support, but the instruction set is huge, and it doesn't look like it'd be very fast. It would almost certainly require interpreter-level support for exceptions and type reflection, neither of which are trivial. This feels like something better left for after the chip is "finished".

Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

Comments