New Spin
cgracey
Posts: 14,152
I've started working on the new Spin.
At first, it's going to be interpreted byte code. In-line assembly will be allowed, though. Also, every assembly instruction, except branches and context-dependent stuff like ALTI and AUGS, has a procedure form in Spin, making all hardware functions readily accessible. It actually takes a huge load off Spin, by circumventing the need to re-dress lots of functions. If you look at the spreadsheet linked to in the "Prop2 FPGA Files!!!" thread, you can see how they work. They will get people rapidly acquainted with how the actual instructions work. There are local bit variables, CF and ZF, which get used and updated by the pasm-instruction procedures.
For the run-time data and call stacks, the LUT will be used. The user can set how much is otherwise available for his own use. This means that there is no need to declare a stack in hub space. It's now implied. The limitation is up to 512 longs, but that's going to be fine for programs that don't call themselves recursively.
To fetch code bytes, 'RDBYTE b,PTRA++' could be used, but I came up with a faster way of doing it, while keeping the FIFO free for the user:
That only needs one instruction (a CALLD) per byte fetch and it takes 12 clocks, plus another ~50 clocks every 64th byte. The simple RDBYTE takes 9..24 clocks.
At first, it's going to be interpreted byte code. In-line assembly will be allowed, though. Also, every assembly instruction, except branches and context-dependent stuff like ALTI and AUGS, has a procedure form in Spin, making all hardware functions readily accessible. It actually takes a huge load off Spin, by circumventing the need to re-dress lots of functions. If you look at the spreadsheet linked to in the "Prop2 FPGA Files!!!" thread, you can see how they work. They will get people rapidly acquainted with how the actual instructions work. There are local bit variables, CF and ZF, which get used and updated by the pasm-instruction procedures.
For the run-time data and call stacks, the LUT will be used. The user can set how much is otherwise available for his own use. This means that there is no need to declare a stack in hub space. It's now implied. The limitation is up to 512 longs, but that's going to be fine for programs that don't call themselves recursively.
To fetch code bytes, 'RDBYTE b,PTRA++' could be used, but I came up with a faster way of doing it, while keeping the FIFO free for the user:
' ' ' Get new block, then bytes via 'calld b_ret,b_call' ' new_block setq #15 rdlong block_base,block_addr mov block_long,#0 mov b_call,#new_long ret new_long alts block_long,#block_base 'get byte %xxxx00 getbyte b,0,#0 calld b_call,b_ret alts block_long,#block_base 'get byte %xxxx01 getbyte b,0,#1 calld b_call,b_ret alts block_long,#block_base 'get byte %xxxx10 getbyte b,0,#2 calld b_call,b_ret alts block_long,#block_base 'get byte %xxxx11 getbyte b,0,#3 calld b_call,b_ret incmod block_long,#15 wc 'another long? if_nc jmp #new_long add block_addr,#64 'read next block setq #15 rdlong block_base,block_addr jmp #new_long block_base res 16 block_long res 1 block_addr res 1 b_call res 1 b_ret res 1
That only needs one instruction (a CALLD) per byte fetch and it takes 12 clocks, plus another ~50 clocks every 64th byte. The simple RDBYTE takes 9..24 clocks.
Comments
Will the LUT stacks work bottom up, or top down? Thinking top down might be better. Users can make their declarations, always work from 0 to [limit], SPIN works top down and is ignored by the programmer, who knows it's there, but doesn't have to worry about it much, until it can't be.
Aren't LUT addresses unique from HUB ones in cog land? Or did that change? So I think taking addresses shouldn't matter for LUT vs HUB, although which instruction is used to read the value will be different depending...
I was thinking that this time around, Spin would be compiled, (And I am sure compilers will be written) but I was surprised to see it will be interpreted.
Anyway, YAY! Keep knocking them out of the park Chip! I follow the Prop2 forum like a lovesick crackhead, hoping to see when we might get some silicon.
It will be ready, when it is ready, and I am good with that
Yes, that is impressive work, and the risk of a lack of a common code base, will be Spin2 fragmentation & incompatibility, which will put users off P2, rather than attract them.
Has the P2 testing really finished ?
USB code seemed still in a state of flux last I looked, and Video test cases are far from complete.
I've not seen anyone talk about cameras for a while...
Or motor controller use cases ?
Or any examples using HyperRAM/ HyperFLASH ?
A lot of Spin is P1 centric, and not directly translatable to P2. Also, stuff has to be added to support the new features of the hardware.
The basic syntax and non-hardware-specific stuff can be the same, but there has to be differences and programmers will need to know them do use the P2 for anything beyond the most basic stuff.
I suppose someone could do an emulation of P1 on P2, but matching the exact timings of the counter stuff will be difficult, and that's just one part of it.
David,
It's different instructions, but that can be handled for you in Spin.
Well, yes, that's yet another Spin-portability-downside-issue.
My main point was more related to spawning Multiple flavours of P2-Spin.
In addition to this, using the fast fill buffer from hub to lut and then peel off the required bytecodes will also result in a big clock saving.
In my faster P1 Spin Interpreter, I put the decode table into hub. There is probably enough lut space for this and it saves quite a few clocks and space, meaning that the actual interpreter can be improved too.
There are lots of ways to tweek the interpreter because we have lut, and hub-exec. The little used bytecodes can be hived of to hub-exec, giving more space for the interpreter to be optimised.
All these things can be improved once we have the base.
BTW If the P1 Bytecode structure does not change, there is no reason why there should be much difference to the P1 spin code, apart from extensions.
That is my thought. It *SEEMS* like building a compiler would be easier than building a tokenizer, optimizer for the programming tool and an interpreter in ROM. I only understand these things as concepts, so it's difficult for me to gauge what is "easy". The first time a came across the concept was with BASIC09 back when I was a kid. It's still all voodoo to me
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
As Cluso pointed out, things are way faster without having to do random hub r/w's.
Here's how we can solve this, based on address (@variable):
Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":
Either way, indexing math would work fine.
You just do an address check before reading or writing. Yes, easy in an interpreter, harder in compiled code.
Remember, we get real in line PASM this time. Speed vs size will play out much differently. It will be faster and easier to optimize parts for speed.
That would be complicated, when it comes to the FIFO implementation. We'll just filter in software.
[/quote]
David, I'm not seeing what you're talking about. Wouldn't the the above pattern work as-is in PASM?
Perhaps we can have a way to say what memory we want locals/params in? Otherwise we are forced to only use globals...