Fast Spin exeucition: a Verilog coding challenge
nutson
Posts: 242
The killer app that would make a P1V board a serious proposal would be the ability to execute (a subset) of Spin at much higher speed than than the standard Spin Interpreter.
After
After
Comments
I am willing to award a price to the first two implementations of a Verilog Fast Spin execution engine or a Spin decoding Peripheral for the P1V (on DE0-nano, DE2-115 or BeMicro CV) that significantly speeds up Spin execution (say greater than 1M I/O operations per second)
The price would be a choice out of some electronic goodies I am ready to say goodbye to (some 1C12 FPGA boards, a 2C35 board, a 3C40 board and some other stuff)
What is the current 'best spin' ? - I think the ROM one was improved ?
There are two branches to faster byte codes
a) Improve the byte-code feeding pathway - the fetch and prepare for decode stuff.
That mostly involves HW data management, and not many new opcodes.
Anyone got code for the current-best byte-code fetcher ?
b) Next, the most common routines called by the bytecode decoder, could be shrunk, by adding new opcodes.
The biggest gains are likely from a)
How much space ? Can all of that go into verilog ?
Spin can be compiled to LMM using spin2cpp. It actually has a mode where it converts to C/C++ and then runs the C/C++ compiler for you automatically to produce a binary file just like the normal Spin compiler does, except that it'll contain LMM instead of spin bytecodes. With spin2cpp and PropGCC you can even compile Spin to COG code.
Eric
It's main improvement was firstly to use a hub based lookup table to decode the spin bytecode. That table was 256 x longs, and contained 3 x 9bit instruction vectors with 5 bits of flags.
Each bytecode could then be broken down into up to 3 subroutines (vectors).
This allowed me to unravel the interpreter (it was intertwined by Chip to make it fit). A huge improvement was then obtained with the maths/logic instructions.
I still had space to further unravel some code, or add new features. That is where I was up to.
FYI I also had a zero footprint debugger that could trace both spin bytecodes and/or pasm instructions.
IIRC both of these are posted in the obex.
Incredible improvements could be made to this interpreter by...
1. Placing the vector table into extended cog ram
2. Placing the hub stack into extended cog ram
Both of these would be quite simple to implement in Verilog.
I also added a cached byte read that boosted overall performance to ~10% above the standard spin.
I thought I posted that code too, bit apparently not. I'll hunt through my old verilog stuff and see if I can find it.
What size are each of these, & can you quantify 'incredible'.
It seems removing 10 opcodes, and adding 1 new P1V gained a somewhat modest ~5%, (suggests the average code-path is some ~200 opcodes ? ) and I see other claims of 5% more and 25%, but even with all those, we have ~64% of old times. Useful, but not a leap.
Sure, I think that was the angle the OP was looking for, but the LUT-added cost would not be small.
Maybe a P1V can give a choice of two types of COG, one that runs PASM and another that is dedicated to ByteCode(Spin et al) - that avoids bloating a PASM COG, but gives much faster Spin choices ?
Of course you then have to convert Spin->CMM, but that's a solved problem (have I mentioned spin2cpp before )?
Now a hub access for the vector table vs extended hub would be 4+ 0-16 vs 4 clocks, and this is for every bytecode.
The next is virtually every bytecode starts with a POP and ends with a PUSH, bot being hub accesses. If these were implemented as a cog stack, those again would be significant gains, and for almost every bytecode. With the aid of a push & pop instruction, the add/subtract would also be a reduction in 4 clocks overhead x2 per bytecode.
These would result in quite substantial gains, because they are for almost every bytecode. The biggest gains are to reduce the bytecode overhead.
Then, apart from new instructions to perform some bytecode instructions, the biggest gains would be in further unravelling of some of the bytecodes. But these would only be when these bytecodes are executed. I have no real understanding of how the bytecodes are used, and hence no idea which are the most common bytecodes to target code reduction.
However, because it is a stack based interpreter, you are never going to get anywhere near pasm execution. Even just a simple pop/execute/push would be 3 times slower than pasm. Add to that the fact that bytecodes are fetched from hub, you can more than double/triple that meaning you are virtually up to 10 times slower.
Obviously, I have generalised a lot. But the fact remains that the interpreter is going to be substantially slower than pasm code.
Looking at the Spin bytecodes, about half is accessing data and stack, a quarter is arithmetic and processing, another quarter is special functions some of which could be implemented in a Spin VM the rest needing ASM coding
My idea of fast Spin execution would be a to replace one or both counters freeing three RW ports with a Spin peripheral, use ASM code to set pointers pcurr dcurr etc and then starting the engine that independently accesses main memory to fetch and execute the simple bytecodes. ASM code monitors the engine that stops when a complicated bytecode is encountered and executes this code, restarts the engine.
The main problem in P1V development is that it uses dual port memory that can easily run at 160Mhz as a single port memory running 40 Mhz. What a waste !! Solving this is the key to higher performance.
?? err, wont removing the counters break any Spin code that uses Counters ?
Do you have links and code-sizes of those ?
The debug angle would seem a glaring blindspot in Parallax Prop support - what did that need as a link ?
Seems this is where a small local MCU would help. It should also allow much faster download/run times.
The stack depends on the app. IIRC bytes are pushed, so maybe 1KB would surfice???
The zero footprint debugger uses spin and some LMM support. Cannot recall the hub requirement but the LMM stub to run it lives in shadow cog ram at $1F0-1F3.
How does that communicate with the host, and does it need any off-chip storage (aka stack) ?
How many bytes for the loader-useful-portion of this, and what speed can it communicate at ?
Just musing over ways to boost the 'interactive' side of Prop use, and I have a small MCU with blocks capable of 3-4 MBd links, and 8~12MBd into SPI memory..
So one idea was a two step loader, targeting Props with crystals (= vast majority) by first a minimal, compact load of a faster loader, then Baud flips from 115200, >> 1MBd and 3 bit load flips to 8b load & real code loads.
I figure gains of 25~70x in data path speeds are possible.
Makes sense that fast path, is also used for debug , & the flow is very similar to a Debug.
I was thinking of including the stub-loader, in the local MCU firmware, to also allow faster SPI links, but it could be sent from the PC host. Not sure yet on putting in a SPI page, or inside MCU - hence the minimal-size questions.
To keep code size down, I'm thinking the 3b stub-coding will be done on Host (but that does bump the stub-storage needed locally)