Fast Spin exeucition: a Verilog coding challenge

nutson · 2015-03-28 07:41

The killer app that would make a P1V board a serious proposal would be the ability to execute (a subset) of Spin at much higher speed than than the standard Spin Interpreter.
After

nutson · 2015-03-28 07:58

Sorry for the incomplete post.

I am willing to award a price to the first two implementations of a Verilog Fast Spin execution engine or a Spin decoding Peripheral for the P1V (on DE0-nano, DE2-115 or BeMicro CV) that significantly speeds up Spin execution (say greater than 1M I/O operations per second)

The price would be a choice out of some electronic goodies I am ready to say goodbye to (some 1C12 FPGA boards, a 2C35 board, a 3C40 board and some other stuff)

David Betz · 2015-03-28 08:25

I suggested something similar for accelerating PropGCC CMM instruction execution. I had planned to try to do it myself but I've never had time. I didn't offer any prizes though. :-)

nutson · 2015-03-28 08:38

After studying the Spin Bytecode Document Cluso put together (dated back to 2008 already) I decided this stuff was too complicated for me, But there are forum members that have made modifications to the Spin interpreter, or even Spin compilers, and probably can handle the complexity.

JonnyMac · 2015-03-28 08:56

Is there any reason why Spin couldn't be properly compiled to run LMM? Most of the time Spin is fast enough, but I have a current project where we changed the crystal from 5 to 6.5 to get some extra speed.

David Betz · 2015-03-28 12:36

JonnyMac wrote: »

Is there any reason why Spin couldn't be properly compiled to run LMM? Most of the time Spin is fast enough, but I have a current project where we changed the crystal from 5 to 6.5 to get some extra speed.

There is no reason that couldn't be done. The problem is that you lose one of the big advantages of Spin over C which is code density. You won't be able to fit nearly as much Spin code compiled to LMM as you would if you compiled to Spin bytecodes. Also, the LMM kernel will take space as well.

jmg · 2015-03-28 13:17

nutson wrote: »

After studying the Spin Bytecode Document Cluso put together (dated back to 2008 already) I decided this stuff was too complicated for me, But there are forum members that have made modifications to the Spin interpreter, or even Spin compilers, and probably can handle the complexity.

What is the current 'best spin' ? - I think the ROM one was improved ?

There are two branches to faster byte codes
a) Improve the byte-code feeding pathway - the fetch and prepare for decode stuff.
That mostly involves HW data management, and not many new opcodes.

Anyone got code for the current-best byte-code fetcher ?

b) Next, the most common routines called by the bytecode decoder, could be shrunk, by adding new opcodes.

The biggest gains are likely from a)

Dave Hein · 2015-03-28 13:18

About a year ago I posted a port of the P1 Spin interpreter to the P2 FPGA image called p1spin. The Dhrystone benchmark ran 3.8 times faster than it does on a P1. The Spin interpreter on the P1 is about a 0.5 MIPS VM, so p1spin on the P2 would be around 2 MIPS. nutson, does that meet your criteria for winning the prizes?

jmg · 2015-03-28 13:18

David Betz wrote: »

Also, the LMM kernel will take space as well.

How much space ? Can all of that go into verilog ?

jmg · 2015-03-28 13:20

Dave Hein wrote: »

About a year ago I posted a port of the P1 Spin interpreter to the P2 FPGA image called p1spin. The Dhrystone benchmark ran 3.8 times faster than it does on a P1. The Spin interpreter on the P1 is about a 0.5 MIPS VM, so p1spin on the P2 would be around 2 MIPS. nutson, does that meet your criteria for winning the prizes?

Sounds a good starting point, what specific P2 opcodes did that use ?

Dave Hein · 2015-03-28 13:25

I don't think I used any P2-specific opcodes, or if I did it was the bare minimum. I think it was a direct port of P1 PASM. The VM program size was about 1500 instructions, so most of it ran using hubexec. The key to efficient execution is to put the code for the most used bytecodes in the first 512 instructions.

David Betz · 2015-03-28 13:45

jmg wrote: »

How much space ? Can all of that go into verilog ?

I guess. I thought Jon was talking about targeting the existing P1 though. If you're going to write Verilog then you might as well do what the OP suggested and interpret the Spin bytecodes directly.

ersmith · 2015-03-28 14:37

JonnyMac wrote: »

Is there any reason why Spin couldn't be properly compiled to run LMM? Most of the time Spin is fast enough, but I have a current project where we changed the crystal from 5 to 6.5 to get some extra speed.

Spin can be compiled to LMM using spin2cpp. It actually has a mode where it converts to C/C++ and then runs the C/C++ compiler for you automatically to produce a binary file just like the normal Spin compiler does, except that it'll contain LMM instead of spin bytecodes. With spin2cpp and PropGCC you can even compile Spin to COG code.

Eric

Cluso99 · 2015-03-28 17:23

I wrote a "Faster Spin Interpreter" years ago. It runs ~25% faster.

It's main improvement was firstly to use a hub based lookup table to decode the spin bytecode. That table was 256 x longs, and contained 3 x 9bit instruction vectors with 5 bits of flags.
Each bytecode could then be broken down into up to 3 subroutines (vectors).
This allowed me to unravel the interpreter (it was intertwined by Chip to make it fit). A huge improvement was then obtained with the maths/logic instructions.
I still had space to further unravel some code, or add new features. That is where I was up to.
FYI I also had a zero footprint debugger that could trace both spin bytecodes and/or pasm instructions.
IIRC both of these are posted in the obex.

Incredible improvements could be made to this interpreter by...
1. Placing the vector table into extended cog ram
2. Placing the hub stack into extended cog ram
Both of these would be quite simple to implement in Verilog.

ozpropdev · 2015-03-28 17:31

A while back I played around with the spin interpreter and was able to make a few gains using custom instructions. See here
I also added a cached byte read that boosted overall performance to ~10% above the standard spin.
I thought I posted that code too, bit apparently not. I'll hunt through my old verilog stuff and see if I can find it.

jmg · 2015-03-28 17:47

Cluso99 wrote: »

Incredible improvements could be made to this interpreter by...
1. Placing the vector table into extended cog ram
2. Placing the hub stack into extended cog ram
Both of these would be quite simple to implement in Verilog.

What size are each of these, & can you quantify 'incredible'.
It seems removing 10 opcodes, and adding 1 new P1V gained a somewhat modest ~5%, (suggests the average code-path is some ~200 opcodes ? ) and I see other claims of 5% more and 25%, but even with all those, we have ~64% of old times. Useful, but not a leap.

Dave Hein · 2015-03-28 19:31

If you really want to execute Spin as fast as possible why not write the whole Spin VM in Verilog? Instead of cogs executing PASM code at 20 MIPS they could execute Spin bytecodes at 20 MIPs. There would be no need for PASM. The biggest challenges would be getting higher memory access speeds to hub RAM. You would need to fetch bytecodes, pop and push the stack, and read or write a value in memory in one instruction time. The Spin VM is written in 1300 lines of C code in spinsim. It's just a simple matter of converting the C code to Verilog.

jmg · 2015-03-28 19:52

Dave Hein wrote: »

If you really want to execute Spin as fast as possible why not write the whole Spin VM in Verilog? ... The Spin VM is written in 1300 lines of C code in spinsim. It's just a simple matter of converting the C code to Verilog.

Sure, I think that was the angle the OP was looking for, but the LUT-added cost would not be small.

Maybe a P1V can give a choice of two types of COG, one that runs PASM and another that is dedicated to ByteCode(Spin et al) - that avoids bloating a PASM COG, but gives much faster Spin choices ?

ersmith · 2015-03-28 20:16

I think David Betz's suggestion of implementing something like the GCC CMM interpreter in Verilog should be easier than implementing Spin. CMM works by compressing normal PASM instructions into smaller forms (removing redundant bits and restricting opcode and registers), so it's basically just another stage in front of the current hardware -- the programming model is otherwise pretty similar (and there's an escape that allows you to encode a full PASM instruction in CMM).

Of course you then have to convert Spin->CMM, but that's a solved problem (have I mentioned spin2cpp before

)?

Cluso99 · 2015-03-29 02:32

jmg wrote: »

What size are each of these, & can you quantify 'incredible'.
It seems removing 10 opcodes, and adding 1 new P1V gained a somewhat modest ~5%, (suggests the average code-path is some ~200 opcodes ? ) and I see other claims of 5% more and 25%, but even with all those, we have ~64% of old times. Useful, but not a leap.

IIRC I reduced the hub decode form ~50+ instructions to a few, even with the extra hub access for the vector table.

Now a hub access for the vector table vs extended hub would be 4+ 0-16 vs 4 clocks, and this is for every bytecode.

The next is virtually every bytecode starts with a POP and ends with a PUSH, bot being hub accesses. If these were implemented as a cog stack, those again would be significant gains, and for almost every bytecode. With the aid of a push & pop instruction, the add/subtract would also be a reduction in 4 clocks overhead x2 per bytecode.

These would result in quite substantial gains, because they are for almost every bytecode. The biggest gains are to reduce the bytecode overhead.

Then, apart from new instructions to perform some bytecode instructions, the biggest gains would be in further unravelling of some of the bytecodes. But these would only be when these bytecodes are executed. I have no real understanding of how the bytecodes are used, and hence no idea which are the most common bytecodes to target code reduction.

However, because it is a stack based interpreter, you are never going to get anywhere near pasm execution. Even just a simple pop/execute/push would be 3 times slower than pasm. Add to that the fact that bytecodes are fetched from hub, you can more than double/triple that meaning you are virtually up to 10 times slower.

Obviously, I have generalised a lot. But the fact remains that the interpreter is going to be substantially slower than pasm code.

nutson · 2015-03-29 03:57

Thanks to all, good discussion.

Looking at the Spin bytecodes, about half is accessing data and stack, a quarter is arithmetic and processing, another quarter is special functions some of which could be implemented in a Spin VM the rest needing ASM coding

My idea of fast Spin execution would be a to replace one or both counters freeing three RW ports with a Spin peripheral, use ASM code to set pointers pcurr dcurr etc and then starting the engine that independently accesses main memory to fetch and execute the simple bytecodes. ASM code monitors the engine that stops when a complicated bytecode is encountered and executes this code, restarts the engine.

The main problem in P1V development is that it uses dual port memory that can easily run at 160Mhz as a single port memory running 40 Mhz. What a waste !! Solving this is the key to higher performance.

jmg · 2015-03-29 21:40

nutson wrote: »

My idea of fast Spin execution would be a to replace one or both counters freeing three RW ports with a Spin peripheral, ..

?? err, wont removing the counters break any Spin code that uses Counters ?

jmg · 2015-03-29 21:46

Cluso99 wrote: »

I wrote a "Faster Spin Interpreter" years ago. It runs ~25% faster.
...
I still had space to further unravel some code, or add new features. That is where I was up to.
FYI I also had a zero footprint debugger that could trace both spin bytecodes and/or pasm instructions.
IIRC both of these are posted in the obex.

Do you have links and code-sizes of those ?

The debug angle would seem a glaring blindspot in Parallax Prop support - what did that need as a link ?
Seems this is where a small local MCU would help. It should also allow much faster download/run times.

Cluso99 · 2015-03-30 01:14

jmg wrote: »

Do you have links and code-sizes of those ?

The debug angle would seem a glaring blindspot in Parallax Prop support - what did that need as a link ?
Seems this is where a small local MCU would help. It should also allow much faster download/run times.

The vector table is 256 x longs.
The stack depends on the app. IIRC bytes are pushed, so maybe 1KB would surfice???

The zero footprint debugger uses spin and some LMM support. Cannot recall the hub requirement but the LMM stub to run it lives in shadow cog ram at $1F0-1F3.

jmg · 2015-03-30 01:56

Cluso99 wrote: »

The zero footprint debugger uses spin and some LMM support. Cannot recall the hub requirement but the LMM stub to run it lives in shadow cog ram at $1F0-1F3.

How does that communicate with the host, and does it need any off-chip storage (aka stack) ?

Cluso99 · 2015-03-30 22:05

In $1F0-1F3 is a LMM execution unit. So it executes an LMM Debugger located in hub, which then allows the actual cog pasm code to execute and it traces this. No off-chip storage is required. It is all contained in hub and the 4 shadow registers.

jmg · 2015-03-30 22:23

Cluso99 wrote: »

In $1F0-1F3 is a LMM execution unit. So it executes an LMM Debugger located in hub, which then allows the actual cog pasm code to execute and it traces this. No off-chip storage is required. It is all contained in hub and the 4 shadow registers.

How many bytes for the loader-useful-portion of this, and what speed can it communicate at ?

Cluso99 · 2015-03-31 17:47

I don't recall how much hub it uses. Communication to the PC for display was 115,200 via FullDuplexSerial_rr004 (with larger serial buffers).

jmg · 2015-03-31 18:06

Cluso99 wrote: »

I don't recall how much hub it uses. Communication to the PC for display was 115,200 via FullDuplexSerial_rr004 (with larger serial buffers).

Thanks - I was hoping for something compact and fast.

Just musing over ways to boost the 'interactive' side of Prop use, and I have a small MCU with blocks capable of 3-4 MBd links, and 8~12MBd into SPI memory..

So one idea was a two step loader, targeting Props with crystals (= vast majority) by first a minimal, compact load of a faster loader, then Baud flips from 115200, >> 1MBd and 3 bit load flips to 8b load & real code loads.
I figure gains of 25~70x in data path speeds are possible.

Makes sense that fast path, is also used for debug , & the flow is very similar to a Debug.

David Betz · 2015-03-31 18:25

jmg wrote: »

Thanks - I was hoping for something compact and fast.

Just musing over ways to boost the 'interactive' side of Prop use, and I have a small MCU with blocks capable of 3-4 MBd links, and 8~12MBd into SPI memory..

So one idea was a two step loader, targeting Props with crystals (= vast majority) by first a minimal, compact load of a faster loader, then Baud flips from 115200, >> 1MBd and 3 bit load flips to 8b load & real code loads.
I figure gains of 25~70x in data path speeds are possible.

Makes sense that fast path, is also used for debug , & the flow is very similar to a Debug.

Jeff Martin from Parallax created such a loader for use with the XBee WiFi module but the same technique can be used for normal USB serial downloads as well. I'm working on a Propeller load that makes use of that technique. In fact, the P2 will require a two-stage loader like that. Its ROM loader can only load a single COG image of 2K bytes. Well, that is the way it was the last time we had any details about P2 anyway. Who knows now.

jmg · 2015-03-31 18:44

David Betz wrote: »

... I'm working on a Propeller load that makes use of that technique....

Do you have figures for Size of Stub loader, and target Baud rates of accelerated link ?
I was thinking of including the stub-loader, in the local MCU firmware, to also allow faster SPI links, but it could be sent from the PC host. Not sure yet on putting in a SPI page, or inside MCU - hence the minimal-size questions.
To keep code size down, I'm thinking the 3b stub-coding will be done on Host (but that does bump the stub-storage needed locally)

Fast Spin exeucition: a Verilog coding challenge

Comments