(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
PUB myadd(a, b) : r
ASM
mov r, a
add r, b
ENDASM
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
(3) I would definitely suggest that ease of compilation to PASM code be kept in mind when designing Spin2. WIth the Prop2 having so much more RAM available (and hubexec!) it will definitely be practical to run compiled PASM code, and it would be nice to have the ability to trade off code size for execution speed.
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?
I'm not sure which would work better. It's certainly possible to JIT a stack oriented machine, and those tend to have very simple instruction sets (making the JIT compiler easier to implement). OTOH performance might be better if the instruction set was closer to PASM, e.g. like the PropGCC CMM code, where the "bytecodes" are really compressed versions of PASM instructions.
But why do a byte code interpreter at all? Why not just go directly to a native compiler?
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.
fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
PUB myadd(a, b) : r
ASM
mov r, a
add r, b
ENDASM
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
I came here this morning, thinking about exactly this. After a lot of consideration, using LMM for procedure stack may just be too much compartmentalization. It's my opinion, LMM should just remain unused, left free for the hardware functions that use it.
@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?
@all
I also thought about the explicit vs implicit behavior discussion so far. I very strongly favor implicit. It does the reasonable thing, and just does it as much as is possible.
One of the design goals Chip has had throughout this is to make it conversational, interactive, learn by doing. Seems to me, implicit rules, smart choices like the ones mentioned in this thread, where users aren't responsible for things they may not understand or need yet, make the most sense.
And it makes the most sense as it's possible to converse with, test, explore, observe something working far more than it is an error message, or debug log.
I like the byte code for it's size / speed trade-offs, and we should have it. Despite the larger RAM in this one, once again the possible capability pushes right up against that RAM anyway. That said, the compiler numbers above are encouraging! We definitely want to insure compilation makes sense, and LMM stack probably throws a wrench in all of that while not actually yielding a huge benefit in return. Could be a net loss.
On P1, XMM didn't seem to get traction. Maybe it's just too much, COG, LMM, XMM, CMM... LOL!
However, on P2, we have COGex, HUBex, (LUTex, again could be ignored for SPIN), and so here we are again. LOL!
I would totally trade LUTanything for XMM, leaving us with COG, HUB, XMM as execute target models. CMM, may still make sense as it's a virtual machine of sorts and may have uses, debug, traps, memory management through software, etc...
But, COG, HUB, XMM seems like a great target. This chip will benefit from XMM like a P1 doesn't, and that's due to new memory tech being available as well as more I/O, making it practical.
Like P1 SPIN, XMM may never make sense as a supported thing. But, since we can encapsulate PASM into procedures, doing stuff like fetch from [SD, EEPROM, SPI, XMM, whatever] and execute as overlay or from buffer, will be trivial and likely very useful. Just having sensible PASM inline makes this work, but being able to compile in ways that would make doing overlays, etc... work well, seems smart right now, and if it's baked in early, easy and accessible too.
This chip has a lot of COGS and resources. It's gonna see some big applications. Why not bake all that in now?
Another trade-off I would make is to trade the byte code target for compilation options that make doing the above as easy as possible. Because the chip is a multi-processor, being able to move code from storage to ram in a simple, straightforward way makes a ton of sense, and doesn't require an OS, etc... Ideally, that doesn't have to happen, but if some how it does need to, I would definitely take a simple compiler and language support designed to build code to be used in segments over byte code.
Two ways to get at the big program problem then. Byte code, which will cap at a fairly low value, when compared to what is possible by facilitating overlays, dynamic loading and execute. A little thought now could knock doing big programs out of the park! And I'm pretty sure we want that.
I don't understand what you mean when you say "LMM should remain unused". In fact, there won't be any LMM on P2 since we have hubexec. Did you mean to say "LUT should remain unused"? Also, can you describe what you mean by XMM because I'm not sure it makes much sense for P2. It will end up being a P1-style LMM loop with a cache for accessing external RAM. I doubt that would even get as much traction as XMM on P1 since it would be *so* much slower than hubexec.
I meant both. There was, at one point, some LMM discussion. I missed where it died, apparently. Given that, yes. Leave LUT unused.
Smile, I'm on mobile and see I botched it on XMM.
Back in a bit. It's jumbled for sure.
I read your XMM section again and it made more sense the second time. You're not talking necessarily about fetching one instruction at a time. You talk about overlays which I think could work well. In fact, they could work well on P1 as well.
You mention that XMM will be more useful on P2 because of more pins. You can use XMM on P1 with a SPI flash chip that only uses one additional pin if you support a SPI SD card already.
Overlays work well on P1. Heater and I use them in ZiCog to advantage rather than LMM.
On P2 they will load much faster because of the egg beater!!!
However, some routines will just work better as hubexec, while others will work better as overlays. If there are a lot of looping, for example, overlays will run faster than hubexec.
I jumbled C and SPIN in that. For just SPIN, compile to overlay some how makes a ton of sense.
Assuming it works like ersmith put here, PASM overlays will make sense no matter what. If it can be made easy to compile a procedure as an overlay target, I would trade that for byte code, if necessary.
I'm thinking of the big program, big buffer case here. Given the hardware capability, I think it's gonna be seen a lot.
@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?
fastspin / spin2cpp has P2 support, although it isn't up to date (it produces code for an older version of the FPGA). Updating it should be pretty easy. Most of the code generation stuff is independent of P1 or P2, LMM / COG / HUBEXEC; there are just a few places (like putting small loops in FCACHE) that are processor and mode dependent.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
What language are you writing this in ?
My fear was that with the work ersmith has already done (which works with P2 now), will result in two Spin-P2's that are never quite the same, because they have different front end parsers.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I'd say not just the Byte-Code, but also the language itself needs formal documenting, to avoid the difference issues mentioned above.
But why do a byte code interpreter at all? Why not just go directly to a native compiler?
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.
fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
Eric
Did you use unused method elimination with OpenSpin?
So, I was bored after work, and came up with a real quick and dirty antlr-3 grammar for what some of the features I'd like to see in spin-2. mostly I designed it with forwards compatibility in mind, so some features from the get-go I assumed would be implementation optional (rational/real/complex numbers), and I decided it would be a good idea not to assume one underlying assembly syntax. The grammar is very much not designed for a single pass compiler (but is for one with separate compilation). it has monomorphic data types (think C or go) without a notion of 'void *' yet, though tagged unions are a thing. enumerative data types are not added (forgot them tbh), and statement expressions are a thing I like from GNU-GCC C so they're in there as well. nested procedures are semantically and syntactically allowed, though procedure values are of debatable usefulness in a microcontroller (they are syntactically expressible though).
TL;DR : I made a glob-of-features spin-ish language and would like feed back on it's possible usefulness to the community.
Oh, I almost forgot: very much not backwards compatible, as the original spin has many features not designed for non-P1 use.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
IIRC I have used more than 8 quite a number of times, but never more than 16. How many others do you need (like CF/ZF etc). Might a total of 16 work where the local variables might be say 12 ???
I know I have asked before. Just how does the streamer connect to the LUT?
One side of the LUT can be read by the cog for instructions in lutexec.
' .---.---.---.---.---.---.---.---. These opcodes allow fast access by making long access
'$40-7F Fast access VAR, LOC | 0 1 | w | v v v | o o | to the first few long entries in the variable space
' `---^---^---^---^---^---^---^---' or stack a single byte opcode. The single byte opcodes
' | | | are effectively expanded within the interpreter...
' 0= VAR Address 00= PUSH Read - push result in stack
' 1= LOC (adr = v*4) 01= POP Write - pop value from stack
' | | 10= USING 2nd opcode (assignment) executed, result in target
' | | 11= PUSH # Push address of destination into stack
' | `---------|------------------------.
' `-----------. | |
' \|/ \|/ \|/
' .---.---.---.---.---.---.---.---. .---.---.---.---.---.---.---.---.
' 10= long? ===> | 1 |?1? 0 | 0 | 1 w | o o | | 0 0 0 | v v v | 0 0 |
' `---^---^---^---^---^---^---^---' `---^---^---^---^---^---^---^---'
'-----------------------------------------------------------------------------------------------------------
long $00 <<27 +varop ' 40 VAR PUSH addr=0*4= 00
long $00 <<27 +varop ' 41 VAR POP addr=0*4= 00
long $00 <<27 +varop ' 42 VAR USING addr=0*4= 00
long $00 <<27 +varop ' 43 VAR PUSH # addr=0*4= 00
long $04 <<27 +varop ' 44 VAR PUSH addr=1*4= 04
long $04 <<27 +varop ' 45 VAR POP addr=1*4= 04
long $04 <<27 +varop ' 46 VAR USING addr=1*4= 04
long $04 <<27 +varop ' 47 VAR PUSH # addr=1*4= 04
long $08 <<27 +varop ' 48 VAR PUSH addr=2*4= 08
long $08 <<27 +varop ' 49 VAR POP addr=2*4= 08
long $08 <<27 +varop ' 4A VAR USING addr=2*4= 08
long $08 <<27 +varop ' 4B VAR PUSH # addr=2*4= 08
long $0C <<27 +varop ' 4C VAR PUSH addr=3*4= 0C
long $0C <<27 +varop ' 4D VAR POP addr=3*4= 0C
long $0C <<27 +varop ' 4E VAR USING addr=3*4= 0C
long $0C <<27 +varop ' 4F VAR PUSH # addr=3*4= 0C
long $10 <<27 +varop ' 50 VAR PUSH addr=4*4= 10
long $10 <<27 +varop ' 51 VAR POP addr=4*4= 10
long $10 <<27 +varop ' 52 VAR USING addr=4*4= 10
long $10 <<27 +varop ' 53 VAR PUSH # addr=4*4= 10
long $14 <<27 +varop ' 54 VAR PUSH addr=5*4= 14
long $14 <<27 +varop ' 55 VAR POP addr=5*4= 14
long $14 <<27 +varop ' 56 VAR USING addr=5*4= 14
long $14 <<27 +varop ' 57 VAR PUSH # addr=5*4= 14
long $18 <<27 +varop ' 58 VAR PUSH addr=6*4= 18
long $18 <<27 +varop ' 59 VAR POP addr=6*4= 18
long $18 <<27 +varop ' 5A VAR USING addr=6*4= 18
long $18 <<27 +varop ' 5B VAR PUSH # addr=6*4= 18
long $1C <<27 +varop ' 5C VAR PUSH addr=7*4= 1C
long $1C <<27 +varop ' 5D VAR POP addr=7*4= 1C
long $1C <<27 +varop ' 5E VAR USING addr=7*4= 1C
long $1C <<27 +varop ' 5F VAR PUSH # addr=7*4= 1C
long $00 <<27 +varop ' 60 LOC PUSH addr=0*4= 00
long $00 <<27 +varop ' 61 LOC POP addr=0*4= 00
long $00 <<27 +varop ' 62 LOC USING addr=0*4= 00
long $00 <<27 +varop ' 63 LOC PUSH # addr=0*4= 00
long $04 <<27 +varop ' 64 LOC PUSH addr=1*4= 04
long $04 <<27 +varop ' 65 LOC POP addr=1*4= 04
long $04 <<27 +varop ' 66 LOC USING addr=1*4= 04
long $04 <<27 +varop ' 67 LOC PUSH # addr=1*4= 04
long $08 <<27 +varop ' 68 LOC PUSH addr=2*4= 08
long $08 <<27 +varop ' 69 LOC POP addr=2*4= 08
long $08 <<27 +varop ' 6A LOC USING addr=2*4= 08
long $08 <<27 +varop ' 6B LOC PUSH # addr=2*4= 08
long $0C <<27 +varop ' 6C LOC PUSH addr=3*4= 0C
long $0C <<27 +varop ' 6D LOC POP addr=3*4= 0C
long $0C <<27 +varop ' 6E LOC USING addr=3*4= 0C
long $0C <<27 +varop ' 6F LOC PUSH # addr=3*4= 0C
long $10 <<27 +varop ' 70 LOC PUSH addr=4*4= 10
long $10 <<27 +varop ' 71 LOC POP addr=4*4= 10
long $10 <<27 +varop ' 72 LOC USING addr=4*4= 10
long $10 <<27 +varop ' 73 LOC PUSH # addr=4*4= 10
long $14 <<27 +varop ' 74 LOC PUSH addr=5*4= 14
long $14 <<27 +varop ' 75 LOC POP addr=5*4= 14
long $14 <<27 +varop ' 76 LOC USING addr=5*4= 14
long $14 <<27 +varop ' 77 LOC PUSH # addr=5*4= 14
long $18 <<27 +varop ' 78 LOC PUSH addr=6*4= 18
long $18 <<27 +varop ' 79 LOC POP addr=6*4= 18
long $18 <<27 +varop ' 7A LOC USING addr=6*4= 18
long $18 <<27 +varop ' 7B LOC PUSH # addr=6*4= 18
long $1C <<27 +varop ' 7C LOC PUSH addr=7*4= 1C
long $1C <<27 +varop ' 7D LOC POP addr=7*4= 1C
long $1C <<27 +varop ' 7E LOC USING addr=7*4= 1C
long $1C <<27 +varop ' 7F LOC PUSH # addr=7*4= 1C
' .---.---.---.---.---.---.---.---.
'$80-DF Access MEM, OBJ, | 1 | s s | i | w w | o o | (96 stack load / save opcodes)
' VAR and LOC `---^---^---^---^---^---^---^---'
' | | | |
' 00= Byte | | 00= PUSH Read - push result in stack
' 01= Word | | 01= POP Write - pop value from stack
' 10= Long | | 10= USING 2nd opcode (assignment) executed, result in target
' (11= mathop) | | 11= PUSH # Push address of destination into stack
' | 00= MEM base popped from stack, if i=1 add offset
' | 01= OBJ base is object base , if i=1 add offset
' | 10= VAR base is variable base , if i=1 add offset
' | 11= LOC base is stack base , if i=1 add offset
' 0= no offset
' 1=[]= add offset (indexed)
'-----------------------------------------------------------------------------------------------------------
long memop ' 80 Byte MEM PUSH
long memop ' 81 Byte MEM POP
long memop ' 82 Byte MEM USING
long memop ' 83 Byte MEM PUSH #
long memop ' 84 Byte OBJ PUSH
long memop ' 85 Byte OBJ POP
long memop ' 86 Byte OBJ USING
long memop ' 87 Byte OBJ PUSH #
long memop ' 88 Byte VAR PUSH
long memop ' 89 Byte VAR POP
long memop ' 8A Byte VAR USING
long memop ' 8B Byte VAR PUSH #
long memop ' 8C Byte LOC PUSH
long memop ' 8D Byte LOC POP
long memop ' 8E Byte LOC USING
long memop ' 8F Byte LOC PUSH #
long memop ' 90 Byte [] MEM PUSH
long memop ' 91 Byte [] MEM POP
long memop ' 92 Byte [] MEM USING
long memop ' 93 Byte [] MEM PUSH #
long memop ' 94 Byte [] OBJ PUSH
long memop ' 95 Byte [] OBJ POP
long memop ' 96 Byte [] OBJ USING
long memop ' 97 Byte [] OBJ PUSH #
long memop ' 98 Byte [] VAR PUSH
long memop ' 99 Byte [] VAR POP
long memop ' 9A Byte [] VAR USING
long memop ' 9B Byte [] VAR PUSH #
long memop ' 9C Byte [] LOC PUSH
long memop ' 9D Byte [] LOC POP
long memop ' 9E Byte [] LOC USING
long memop ' 9F Byte [] LOC PUSH #
long memop ' A0 Word MEM PUSH
long memop ' A1 Word MEM POP
long memop ' A2 Word MEM USING
long memop ' A3 Word MEM PUSH #
long memop ' A4 Word OBJ PUSH
long memop ' A5 Word OBJ POP
long memop ' A6 Word OBJ USING
long memop ' A7 Word OBJ PUSH #
long memop ' A8 Word VAR PUSH
long memop ' A9 Word VAR POP
long memop ' AA Word VAR USING
long memop ' AB Word VAR PUSH #
long memop ' AC Word LOC PUSH
long memop ' AD Word LOC POP
long memop ' AE Word LOC USING
long memop ' AF Word LOC PUSH #
long memop ' B0 Word [] MEM PUSH
long memop ' B1 Word [] MEM POP
long memop ' B2 Word [] MEM USING
long memop ' B3 Word [] MEM PUSH #
long memop ' B4 Word [] OBJ PUSH
long memop ' B5 Word [] OBJ POP
long memop ' B6 Word [] OBJ USING
long memop ' B7 Word [] OBJ PUSH #
long memop ' B8 Word [] VAR PUSH
long memop ' B9 Word [] VAR POP
long memop ' BA Word [] VAR USING
long memop ' BB Word [] VAR PUSH #
long memop ' BC Word [] LOC PUSH
long memop ' BD Word [] LOC POP
long memop ' BE Word [] LOC USING
long memop ' BF Word [] LOC PUSH #
long memop ' C0 Long MEM PUSH
long memop ' C1 Long MEM POP
long memop ' C2 Long MEM USING
long memop ' C3 Long MEM PUSH #
long memop ' C4 Long OBJ PUSH
long memop ' C5 Long OBJ POP
long memop ' C6 Long OBJ USING
long memop ' C7 Long OBJ PUSH #
long memop ' C8 Long VAR PUSH \ see also $40-7F bytecodes
long memop ' C9 Long VAR POP |
long memop ' CA Long VAR USING |
long memop ' CB Long VAR PUSH # |
long memop ' CC Long LOC PUSH |
long memop ' CD Long LOC POP |
long memop ' CE Long LOC USING |
long memop ' CF Long LOC PUSH # /
long memop ' D0 Long [] MEM PUSH
long memop ' D1 Long [] MEM POP
long memop ' D2 Long [] MEM USING
long memop ' D3 Long [] MEM PUSH #
long memop ' D4 Long [] OBJ PUSH
long memop ' D5 Long [] OBJ POP
long memop ' D6 Long [] OBJ USING
long memop ' D7 Long [] OBJ PUSH #
long memop ' D8 Long [] VAR PUSH
long memop ' D9 Long [] VAR POP
long memop ' DA Long [] VAR USING
long memop ' DB Long [] VAR PUSH #
long memop ' DC Long [] LOC PUSH
long memop ' DD Long [] LOC POP
long memop ' DE Long [] LOC USING
long memop ' DF Long [] LOC PUSH #
In my faster spin interpreter, I was able to save quite a lot of cog space (code space) by changing the decoding method to a vector table.
The vector table is 256 longs, one for each of the 256 bytecodes, and resides in hub (no cog space available but could be in LUT for P2).
Each bytecode/long can contain up to 3 9-bit vectors (cog addresses) and 5 special config bits. These are used as jump/call cog addresses (subroutines) for the spin interpreter to run for its' respective bytecode. As each 9-bit vector is used (by a jmp/call indirect via the vector location in cog), the vector is shifted right >>9 places to the next vector. A zero vector is used as the end of the bytecode sequence.
This method permits the additional code space to be used to unravel the interpreter to speed up the code. The maths routines are especially sped up.
Here are some ideas for P2 SPIN...
In P2, many of the mathops might be better placed as pasm inline code, for a slight penalty of code space. Perhaps it could also be a compiler option.
I would think it would be best if the majority of the P2 interpreter were located in LUT space, such that the local longs etc could be directly accessible in cog space. Some of the rarely used, or non-speed important, interpreter routines could remain in hub and run in hubexec mode.
There is probably a benefit to being able to declare a new Local Global set of Variables which could reside in cog. These would be those declared in the VAR section and where they are not shared by other cogs.
i.e. split the VAR section into 2 sections, VAR_LOCAL (stored in cog) and VAR_GLOBAL (stored in hub).
Thanks, Cluso. I've seen a number of non-Parallax documents that claim to describe the Spin byte codes but I was really looking for a definitive Parallax document. Anyway, I've added yours to my collection. Thanks for posting it!
Here is a program that looks up bytes and outputs them to the Prop123-FPGA's LEDs:
dat org
or dirb,#$FF 'make LEDs outputs
loop getbyti j,#table 'ready byte address in next S and byte number in next N
getbyt outb 'get byte into outb
incmod j,#11 'loop 0..11
waitx ##40_000_000 'pause and repeat
jmp #loop
table byte $01,$02,$04,$08,$10,$20,$40,$80,$FF,$F0,$0F,$00
j long 0
I changed this instruction block to make it happen. Now SFUNC uses S to determine splitb/mergeb/splitw/mergew/seussf/seussr/rgbsqz/rgbexp on
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result
Did you use unused method elimination with OpenSpin?
I forgot that you'd added that to OpenSpin (I was running an old version that doesn't have it). Thanks for reminding me! With the new openspin and the -u option to remove unused methods the size goes down to 8344 bytes, which is probably the right size for comparing with fastspin. So the compiled PASM binary is about 1.6x times the size of the bytecode binary (and some of that is data, so the code size is probably more like 1.8x or 2.0x the size). Which can certainly be significant in some applications; OTOH the PASM code is a lot faster -- 8 to 10 times faster in most cases.
Eric,
Thanks for doing the extra testing with -u. I know in the case where memory is available, fastspin is the winner for sure.
I know, with some extra work on the optimizing, you could get fastspin code size down a fair bit.
The perf delta likely won't be as high when compared to the new Spin for P2. Especially when Chip is making changes to the verilog to make coding the Spin interpreter easier and better.
Comments
(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
(3) I would definitely suggest that ease of compilation to PASM code be kept in mind when designing Spin2. WIth the Prop2 having so much more RAM available (and hubexec!) it will definitely be practical to run compiled PASM code, and it would be nice to have the ability to trade off code size for execution speed.
I'm not sure which would work better. It's certainly possible to JIT a stack oriented machine, and those tend to have very simple instruction sets (making the JIT compiler easier to implement). OTOH performance might be better if the instruction set was closer to PASM, e.g. like the PropGCC CMM code, where the "bytecodes" are really compressed versions of PASM instructions.
Eric
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes: In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.
fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
Eric
I came here this morning, thinking about exactly this. After a lot of consideration, using LMM for procedure stack may just be too much compartmentalization. It's my opinion, LMM should just remain unused, left free for the hardware functions that use it.
@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?
@all
I also thought about the explicit vs implicit behavior discussion so far. I very strongly favor implicit. It does the reasonable thing, and just does it as much as is possible.
One of the design goals Chip has had throughout this is to make it conversational, interactive, learn by doing. Seems to me, implicit rules, smart choices like the ones mentioned in this thread, where users aren't responsible for things they may not understand or need yet, make the most sense.
And it makes the most sense as it's possible to converse with, test, explore, observe something working far more than it is an error message, or debug log.
I like the byte code for it's size / speed trade-offs, and we should have it. Despite the larger RAM in this one, once again the possible capability pushes right up against that RAM anyway. That said, the compiler numbers above are encouraging! We definitely want to insure compilation makes sense, and LMM stack probably throws a wrench in all of that while not actually yielding a huge benefit in return. Could be a net loss.
On P1, XMM didn't seem to get traction. Maybe it's just too much, COG, LMM, XMM, CMM... LOL!
However, on P2, we have COGex, HUBex, (LUTex, again could be ignored for SPIN), and so here we are again. LOL!
I would totally trade LUTanything for XMM, leaving us with COG, HUB, XMM as execute target models. CMM, may still make sense as it's a virtual machine of sorts and may have uses, debug, traps, memory management through software, etc...
But, COG, HUB, XMM seems like a great target. This chip will benefit from XMM like a P1 doesn't, and that's due to new memory tech being available as well as more I/O, making it practical.
Like P1 SPIN, XMM may never make sense as a supported thing. But, since we can encapsulate PASM into procedures, doing stuff like fetch from [SD, EEPROM, SPI, XMM, whatever] and execute as overlay or from buffer, will be trivial and likely very useful. Just having sensible PASM inline makes this work, but being able to compile in ways that would make doing overlays, etc... work well, seems smart right now, and if it's baked in early, easy and accessible too.
This chip has a lot of COGS and resources. It's gonna see some big applications. Why not bake all that in now?
Another trade-off I would make is to trade the byte code target for compilation options that make doing the above as easy as possible. Because the chip is a multi-processor, being able to move code from storage to ram in a simple, straightforward way makes a ton of sense, and doesn't require an OS, etc... Ideally, that doesn't have to happen, but if some how it does need to, I would definitely take a simple compiler and language support designed to build code to be used in segments over byte code.
Two ways to get at the big program problem then. Byte code, which will cap at a fairly low value, when compared to what is possible by facilitating overlays, dynamic loading and execute. A little thought now could knock doing big programs out of the park! And I'm pretty sure we want that.
IMHO, YMMV, of course.
Smile, I'm on mobile and see I botched it on XMM.
Back in a bit. It's jumbled for sure.
You mention that XMM will be more useful on P2 because of more pins. You can use XMM on P1 with a SPI flash chip that only uses one additional pin if you support a SPI SD card already.
On P2 they will load much faster because of the egg beater!!!
However, some routines will just work better as hubexec, while others will work better as overlays. If there are a lot of looping, for example, overlays will run faster than hubexec.
I jumbled C and SPIN in that. For just SPIN, compile to overlay some how makes a ton of sense.
Assuming it works like ersmith put here, PASM overlays will make sense no matter what. If it can be made easy to compile a procedure as an overlay target, I would trade that for byte code, if necessary.
I'm thinking of the big program, big buffer case here. Given the hardware capability, I think it's gonna be seen a lot.
fastspin / spin2cpp has P2 support, although it isn't up to date (it produces code for an older version of the FPGA). Updating it should be pretty easy. Most of the code generation stuff is independent of P1 or P2, LMM / COG / HUBEXEC; there are just a few places (like putting small loops in FCACHE) that are processor and mode dependent.
Eric
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Of course.
What language are you writing this in ?
My fear was that with the work ersmith has already done (which works with P2 now), will result in two Spin-P2's that are never quite the same, because they have different front end parsers.
I'd say not just the Byte-Code, but also the language itself needs formal documenting, to avoid the difference issues mentioned above.
TL;DR : I made a glob-of-features spin-ish language and would like feed back on it's possible usefulness to the community.
Oh, I almost forgot: very much not backwards compatible, as the original spin has many features not designed for non-P1 use.
https://gist.github.com/listofoptions/3d136ad6a3dfd77ad5611790ff1c3342 (edited)
I know I have asked before. Just how does the streamer connect to the LUT?
One side of the LUT can be read by the cog for instructions in lutexec.
This includes my vector decode definition - the bytecode is listed in the comments section.
Part 1 of 4
The vector table is 256 longs, one for each of the 256 bytecodes, and resides in hub (no cog space available but could be in LUT for P2).
Each bytecode/long can contain up to 3 9-bit vectors (cog addresses) and 5 special config bits. These are used as jump/call cog addresses (subroutines) for the spin interpreter to run for its' respective bytecode. As each 9-bit vector is used (by a jmp/call indirect via the vector location in cog), the vector is shifted right >>9 places to the next vector. A zero vector is used as the end of the bytecode sequence.
This method permits the additional code space to be used to unravel the interpreter to speed up the code. The maths routines are especially sped up.
Here are some ideas for P2 SPIN...
In P2, many of the mathops might be better placed as pasm inline code, for a slight penalty of code space. Perhaps it could also be a compiler option.
I would think it would be best if the majority of the P2 interpreter were located in LUT space, such that the local longs etc could be directly accessible in cog space. Some of the rarely used, or non-speed important, interpreter routines could remain in hub and run in hubexec mode.
There is probably a benefit to being able to declare a new Local Global set of Variables which could reside in cog. These would be those declared in the VAR section and where they are not shared by other cogs.
i.e. split the VAR section into 2 sections, VAR_LOCAL (stored in cog) and VAR_GLOBAL (stored in hub).
Here is a program that looks up bytes and outputs them to the Prop123-FPGA's LEDs:
I changed this instruction block to make it happen. Now SFUNC uses S to determine splitb/mergeb/splitw/mergew/seussf/seussr/rgbsqz/rgbexp on
Here are the aliases:
This will be in v16. We are way better equipped now to handle bytes, nibbles, and words.
I forgot that you'd added that to OpenSpin (I was running an old version that doesn't have it). Thanks for reminding me! With the new openspin and the -u option to remove unused methods the size goes down to 8344 bytes, which is probably the right size for comparing with fastspin. So the compiled PASM binary is about 1.6x times the size of the bytecode binary (and some of that is data, so the code size is probably more like 1.8x or 2.0x the size). Which can certainly be significant in some applications; OTOH the PASM code is a lot faster -- 8 to 10 times faster in most cases.
Eric
Thanks for doing the extra testing with -u. I know in the case where memory is available, fastspin is the winner for sure.
I know, with some extra work on the optimizing, you could get fastspin code size down a fair bit.
The perf delta likely won't be as high when compared to the new Spin for P2. Especially when Chip is making changes to the verilog to make coding the Spin interpreter easier and better.