(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
PUBmyadd(a, b) : r
ASM
mov r, a
add r, b
ENDASM
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
(3) I would definitely suggest that ease of compilation to PASM code be kept in mind when designing Spin2. WIth the Prop2 having so much more RAM available (and hubexec!) it will definitely be practical to run compiled PASM code, and it would be nice to have the ability to trade off code size for execution speed.
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?
Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.
Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?
I'm not sure which would work better. It's certainly possible to JIT a stack oriented machine, and those tend to have very simple instruction sets (making the JIT compiler easier to implement). OTOH performance might be better if the instruction set was closer to PASM, e.g. like the PropGCC CMM code, where the "bytecodes" are really compressed versions of PASM instructions.
But why do a byte code interpreter at all? Why not just go directly to a native compiler?
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.
fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
PUB myadd(a, b) : r
ASM
mov r, a
add r, b
ENDASM
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
I came here this morning, thinking about exactly this. After a lot of consideration, using LMM for procedure stack may just be too much compartmentalization. It's my opinion, LMM should just remain unused, left free for the hardware functions that use it.
@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?
@all
I also thought about the explicit vs implicit behavior discussion so far. I very strongly favor implicit. It does the reasonable thing, and just does it as much as is possible.
One of the design goals Chip has had throughout this is to make it conversational, interactive, learn by doing. Seems to me, implicit rules, smart choices like the ones mentioned in this thread, where users aren't responsible for things they may not understand or need yet, make the most sense.
And it makes the most sense as it's possible to converse with, test, explore, observe something working far more than it is an error message, or debug log.
I like the byte code for it's size / speed trade-offs, and we should have it. Despite the larger RAM in this one, once again the possible capability pushes right up against that RAM anyway. That said, the compiler numbers above are encouraging! We definitely want to insure compilation makes sense, and LMM stack probably throws a wrench in all of that while not actually yielding a huge benefit in return. Could be a net loss.
On P1, XMM didn't seem to get traction. Maybe it's just too much, COG, LMM, XMM, CMM... LOL!
However, on P2, we have COGex, HUBex, (LUTex, again could be ignored for SPIN), and so here we are again. LOL!
I would totally trade LUTanything for XMM, leaving us with COG, HUB, XMM as execute target models. CMM, may still make sense as it's a virtual machine of sorts and may have uses, debug, traps, memory management through software, etc...
But, COG, HUB, XMM seems like a great target. This chip will benefit from XMM like a P1 doesn't, and that's due to new memory tech being available as well as more I/O, making it practical.
Like P1 SPIN, XMM may never make sense as a supported thing. But, since we can encapsulate PASM into procedures, doing stuff like fetch from [SD, EEPROM, SPI, XMM, whatever] and execute as overlay or from buffer, will be trivial and likely very useful. Just having sensible PASM inline makes this work, but being able to compile in ways that would make doing overlays, etc... work well, seems smart right now, and if it's baked in early, easy and accessible too.
This chip has a lot of COGS and resources. It's gonna see some big applications. Why not bake all that in now?
Another trade-off I would make is to trade the byte code target for compilation options that make doing the above as easy as possible. Because the chip is a multi-processor, being able to move code from storage to ram in a simple, straightforward way makes a ton of sense, and doesn't require an OS, etc... Ideally, that doesn't have to happen, but if some how it does need to, I would definitely take a simple compiler and language support designed to build code to be used in segments over byte code.
Two ways to get at the big program problem then. Byte code, which will cap at a fairly low value, when compared to what is possible by facilitating overlays, dynamic loading and execute. A little thought now could knock doing big programs out of the park! And I'm pretty sure we want that.
I don't understand what you mean when you say "LMM should remain unused". In fact, there won't be any LMM on P2 since we have hubexec. Did you mean to say "LUT should remain unused"? Also, can you describe what you mean by XMM because I'm not sure it makes much sense for P2. It will end up being a P1-style LMM loop with a cache for accessing external RAM. I doubt that would even get as much traction as XMM on P1 since it would be *so* much slower than hubexec.
I meant both. There was, at one point, some LMM discussion. I missed where it died, apparently. Given that, yes. Leave LUT unused.
Smile, I'm on mobile and see I botched it on XMM.
Back in a bit. It's jumbled for sure.
I read your XMM section again and it made more sense the second time. You're not talking necessarily about fetching one instruction at a time. You talk about overlays which I think could work well. In fact, they could work well on P1 as well.
You mention that XMM will be more useful on P2 because of more pins. You can use XMM on P1 with a SPI flash chip that only uses one additional pin if you support a SPI SD card already.
Overlays work well on P1. Heater and I use them in ZiCog to advantage rather than LMM.
On P2 they will load much faster because of the egg beater!!!
However, some routines will just work better as hubexec, while others will work better as overlays. If there are a lot of looping, for example, overlays will run faster than hubexec.
I jumbled C and SPIN in that. For just SPIN, compile to overlay some how makes a ton of sense.
Assuming it works like ersmith put here, PASM overlays will make sense no matter what. If it can be made easy to compile a procedure as an overlay target, I would trade that for byte code, if necessary.
I'm thinking of the big program, big buffer case here. Given the hardware capability, I think it's gonna be seen a lot.
@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?
fastspin / spin2cpp has P2 support, although it isn't up to date (it produces code for an older version of the FPGA). Updating it should be pretty easy. Most of the code generation stuff is independent of P1 or P2, LMM / COG / HUBEXEC; there are just a few places (like putting small loops in FCACHE) that are processor and mode dependent.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
What language are you writing this in ?
My fear was that with the work ersmith has already done (which works with P2 now), will result in two Spin-P2's that are never quite the same, because they have different front end parsers.
Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.
I'd say not just the Byte-Code, but also the language itself needs formal documenting, to avoid the difference issues mentioned above.
But why do a byte code interpreter at all? Why not just go directly to a native compiler?
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.
fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
Eric
Did you use unused method elimination with OpenSpin?
So, I was bored after work, and came up with a real quick and dirty antlr-3 grammar for what some of the features I'd like to see in spin-2. mostly I designed it with forwards compatibility in mind, so some features from the get-go I assumed would be implementation optional (rational/real/complex numbers), and I decided it would be a good idea not to assume one underlying assembly syntax. The grammar is very much not designed for a single pass compiler (but is for one with separate compilation). it has monomorphic data types (think C or go) without a notion of 'void *' yet, though tagged unions are a thing. enumerative data types are not added (forgot them tbh), and statement expressions are a thing I like from GNU-GCC C so they're in there as well. nested procedures are semantically and syntactically allowed, though procedure values are of debatable usefulness in a microcontroller (they are syntactically expressible though).
TL;DR : I made a glob-of-features spin-ish language and would like feed back on it's possible usefulness to the community.
Oh, I almost forgot: very much not backwards compatible, as the original spin has many features not designed for non-P1 use.
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
IIRC I have used more than 8 quite a number of times, but never more than 16. How many others do you need (like CF/ZF etc). Might a total of 16 work where the local variables might be say 12 ???
I know I have asked before. Just how does the streamer connect to the LUT?
One side of the LUT can be read by the cog for instructions in lutexec.
' .---.---.---.---.---.---.---.---. These opcodes allow fast access by making long access'$40-7F Fast access VAR, LOC | 0 1 | w | v v v | o o | to the first few long entries in the variable space' `---^---^---^---^---^---^---^---' or stack a single byte opcode. The single byte opcodes' | | | are effectively expanded within the interpreter...' 0= VAR Address 00= PUSH Read - push result in stack ' 1= LOC (adr = v*4) 01= POP Write - pop value from stack ' | | 10= USING 2nd opcode (assignment) executed, result in target ' | | 11= PUSH # Push address of destination into stack ' | `---------|------------------------.' `-----------. | |' \|/ \|/ \|/' .---.---.---.---.---.---.---.---. .---.---.---.---.---.---.---.---.' 10= long? ===> | 1 |?1? 0 | 0 | 1 w | o o | | 0 0 0 | v v v | 0 0 |' `---^---^---^---^---^---^---^---' `---^---^---^---^---^---^---^---''-----------------------------------------------------------------------------------------------------------long$00 <<27 +varop ' 40 VAR PUSH addr=0*4= 00 long$00 <<27 +varop ' 41 VAR POP addr=0*4= 00long$00 <<27 +varop ' 42 VAR USING addr=0*4= 00long$00 <<27 +varop ' 43 VAR PUSH # addr=0*4= 00long$04 <<27 +varop ' 44 VAR PUSH addr=1*4= 04long$04 <<27 +varop ' 45 VAR POP addr=1*4= 04long$04 <<27 +varop ' 46 VAR USING addr=1*4= 04long$04 <<27 +varop ' 47 VAR PUSH # addr=1*4= 04long$08 <<27 +varop ' 48 VAR PUSH addr=2*4= 08long$08 <<27 +varop ' 49 VAR POP addr=2*4= 08long$08 <<27 +varop ' 4A VAR USING addr=2*4= 08long$08 <<27 +varop ' 4B VAR PUSH # addr=2*4= 08long$0C <<27 +varop ' 4C VAR PUSH addr=3*4= 0Clong$0C <<27 +varop ' 4D VAR POP addr=3*4= 0Clong$0C <<27 +varop ' 4E VAR USING addr=3*4= 0Clong$0C <<27 +varop ' 4F VAR PUSH # addr=3*4= 0Clong$10 <<27 +varop ' 50 VAR PUSH addr=4*4= 10long$10 <<27 +varop ' 51 VAR POP addr=4*4= 10long$10 <<27 +varop ' 52 VAR USING addr=4*4= 10long$10 <<27 +varop ' 53 VAR PUSH # addr=4*4= 10long$14 <<27 +varop ' 54 VAR PUSH addr=5*4= 14long$14 <<27 +varop ' 55 VAR POP addr=5*4= 14long$14 <<27 +varop ' 56 VAR USING addr=5*4= 14long$14 <<27 +varop ' 57 VAR PUSH # addr=5*4= 14long$18 <<27 +varop ' 58 VAR PUSH addr=6*4= 18long$18 <<27 +varop ' 59 VAR POP addr=6*4= 18long$18 <<27 +varop ' 5A VAR USING addr=6*4= 18long$18 <<27 +varop ' 5B VAR PUSH # addr=6*4= 18long$1C <<27 +varop ' 5C VAR PUSH addr=7*4= 1Clong$1C <<27 +varop ' 5D VAR POP addr=7*4= 1Clong$1C <<27 +varop ' 5E VAR USING addr=7*4= 1Clong$1C <<27 +varop ' 5F VAR PUSH # addr=7*4= 1Clong$00 <<27 +varop ' 60 LOC PUSH addr=0*4= 00long$00 <<27 +varop ' 61 LOC POP addr=0*4= 00long$00 <<27 +varop ' 62 LOC USING addr=0*4= 00long$00 <<27 +varop ' 63 LOC PUSH # addr=0*4= 00long$04 <<27 +varop ' 64 LOC PUSH addr=1*4= 04long$04 <<27 +varop ' 65 LOC POP addr=1*4= 04long$04 <<27 +varop ' 66 LOC USING addr=1*4= 04long$04 <<27 +varop ' 67 LOC PUSH # addr=1*4= 04long$08 <<27 +varop ' 68 LOC PUSH addr=2*4= 08long$08 <<27 +varop ' 69 LOC POP addr=2*4= 08long$08 <<27 +varop ' 6A LOC USING addr=2*4= 08long$08 <<27 +varop ' 6B LOC PUSH # addr=2*4= 08long$0C <<27 +varop ' 6C LOC PUSH addr=3*4= 0Clong$0C <<27 +varop ' 6D LOC POP addr=3*4= 0Clong$0C <<27 +varop ' 6E LOC USING addr=3*4= 0Clong$0C <<27 +varop ' 6F LOC PUSH # addr=3*4= 0Clong$10 <<27 +varop ' 70 LOC PUSH addr=4*4= 10long$10 <<27 +varop ' 71 LOC POP addr=4*4= 10long$10 <<27 +varop ' 72 LOC USING addr=4*4= 10long$10 <<27 +varop ' 73 LOC PUSH # addr=4*4= 10long$14 <<27 +varop ' 74 LOC PUSH addr=5*4= 14long$14 <<27 +varop ' 75 LOC POP addr=5*4= 14long$14 <<27 +varop ' 76 LOC USING addr=5*4= 14long$14 <<27 +varop ' 77 LOC PUSH # addr=5*4= 14long$18 <<27 +varop ' 78 LOC PUSH addr=6*4= 18long$18 <<27 +varop ' 79 LOC POP addr=6*4= 18long$18 <<27 +varop ' 7A LOC USING addr=6*4= 18long$18 <<27 +varop ' 7B LOC PUSH # addr=6*4= 18long$1C <<27 +varop ' 7C LOC PUSH addr=7*4= 1Clong$1C <<27 +varop ' 7D LOC POP addr=7*4= 1Clong$1C <<27 +varop ' 7E LOC USING addr=7*4= 1Clong$1C <<27 +varop ' 7F LOC PUSH # addr=7*4= 1C
' .---.---.---.---.---.---.---.---. '$80-DF Access MEM, OBJ, | 1 | s s | i | w w | o o | (96 stack load / save opcodes) ' VAR and LOC `---^---^---^---^---^---^---^---'' | | | |' 00= Byte | | 00= PUSH Read - push result in stack' 01= Word | | 01= POP Write - pop value from stack' 10= Long | | 10= USING 2nd opcode (assignment) executed, result in target' (11= mathop) | | 11= PUSH # Push address of destination into stack' | 00= MEM base popped from stack, if i=1 add offset' | 01= OBJ base is object base , if i=1 add offset' | 10= VAR base is variable base , if i=1 add offset' | 11= LOC base is stack base , if i=1 add offset' 0= no offset' 1=[]= add offset (indexed)'-----------------------------------------------------------------------------------------------------------long memop ' 80 Byte MEM PUSH long memop ' 81 Byte MEM POP long memop ' 82 Byte MEM USING long memop ' 83 Byte MEM PUSH #long memop ' 84 Byte OBJ PUSH long memop ' 85 Byte OBJ POP long memop ' 86 Byte OBJ USING long memop ' 87 Byte OBJ PUSH #long memop ' 88 Byte VAR PUSH long memop ' 89 Byte VAR POP long memop ' 8A Byte VAR USING long memop ' 8B Byte VAR PUSH #long memop ' 8C Byte LOC PUSH long memop ' 8D Byte LOC POP long memop ' 8E Byte LOC USING long memop ' 8F Byte LOC PUSH #long memop ' 90 Byte [] MEM PUSH long memop ' 91 Byte [] MEM POP long memop ' 92 Byte [] MEM USING long memop ' 93 Byte [] MEM PUSH #long memop ' 94 Byte [] OBJ PUSH long memop ' 95 Byte [] OBJ POP long memop ' 96 Byte [] OBJ USING long memop ' 97 Byte [] OBJ PUSH #long memop ' 98 Byte [] VAR PUSH long memop ' 99 Byte [] VAR POP long memop ' 9A Byte [] VAR USING long memop ' 9B Byte [] VAR PUSH #long memop ' 9C Byte [] LOC PUSH long memop ' 9D Byte [] LOC POP long memop ' 9E Byte [] LOC USING long memop ' 9F Byte [] LOC PUSH #long memop ' A0 Word MEM PUSH long memop ' A1 Word MEM POP long memop ' A2 Word MEM USING long memop ' A3 Word MEM PUSH # long memop ' A4 Word OBJ PUSH long memop ' A5 Word OBJ POP long memop ' A6 Word OBJ USING long memop ' A7 Word OBJ PUSH # long memop ' A8 Word VAR PUSH long memop ' A9 Word VAR POP long memop ' AA Word VAR USING long memop ' AB Word VAR PUSH # long memop ' AC Word LOC PUSH long memop ' AD Word LOC POP long memop ' AE Word LOC USING long memop ' AF Word LOC PUSH # long memop ' B0 Word [] MEM PUSH long memop ' B1 Word [] MEM POP long memop ' B2 Word [] MEM USING long memop ' B3 Word [] MEM PUSH # long memop ' B4 Word [] OBJ PUSH long memop ' B5 Word [] OBJ POP long memop ' B6 Word [] OBJ USING long memop ' B7 Word [] OBJ PUSH # long memop ' B8 Word [] VAR PUSH long memop ' B9 Word [] VAR POP long memop ' BA Word [] VAR USING long memop ' BB Word [] VAR PUSH # long memop ' BC Word [] LOC PUSH long memop ' BD Word [] LOC POP long memop ' BE Word [] LOC USING long memop ' BF Word [] LOC PUSH # long memop ' C0 Long MEM PUSH long memop ' C1 Long MEM POP long memop ' C2 Long MEM USING long memop ' C3 Long MEM PUSH # long memop ' C4 Long OBJ PUSH long memop ' C5 Long OBJ POP long memop ' C6 Long OBJ USING long memop ' C7 Long OBJ PUSH # long memop ' C8 Long VAR PUSH \ see also $40-7F bytecodes long memop ' C9 Long VAR POP |long memop ' CA Long VAR USING |long memop ' CB Long VAR PUSH # |long memop ' CC Long LOC PUSH |long memop ' CD Long LOC POP |long memop ' CE Long LOC USING |long memop ' CF Long LOC PUSH # /long memop ' D0 Long [] MEM PUSH long memop ' D1 Long [] MEM POP long memop ' D2 Long [] MEM USING long memop ' D3 Long [] MEM PUSH # long memop ' D4 Long [] OBJ PUSH long memop ' D5 Long [] OBJ POP long memop ' D6 Long [] OBJ USING long memop ' D7 Long [] OBJ PUSH # long memop ' D8 Long [] VAR PUSH long memop ' D9 Long [] VAR POP long memop ' DA Long [] VAR USING long memop ' DB Long [] VAR PUSH # long memop ' DC Long [] LOC PUSH long memop ' DD Long [] LOC POP long memop ' DE Long [] LOC USING long memop ' DF Long [] LOC PUSH #
In my faster spin interpreter, I was able to save quite a lot of cog space (code space) by changing the decoding method to a vector table.
The vector table is 256 longs, one for each of the 256 bytecodes, and resides in hub (no cog space available but could be in LUT for P2).
Each bytecode/long can contain up to 3 9-bit vectors (cog addresses) and 5 special config bits. These are used as jump/call cog addresses (subroutines) for the spin interpreter to run for its' respective bytecode. As each 9-bit vector is used (by a jmp/call indirect via the vector location in cog), the vector is shifted right >>9 places to the next vector. A zero vector is used as the end of the bytecode sequence.
This method permits the additional code space to be used to unravel the interpreter to speed up the code. The maths routines are especially sped up.
Here are some ideas for P2 SPIN...
In P2, many of the mathops might be better placed as pasm inline code, for a slight penalty of code space. Perhaps it could also be a compiler option.
I would think it would be best if the majority of the P2 interpreter were located in LUT space, such that the local longs etc could be directly accessible in cog space. Some of the rarely used, or non-speed important, interpreter routines could remain in hub and run in hubexec mode.
There is probably a benefit to being able to declare a new Local Global set of Variables which could reside in cog. These would be those declared in the VAR section and where they are not shared by other cogs.
i.e. split the VAR section into 2 sections, VAR_LOCAL (stored in cog) and VAR_GLOBAL (stored in hub).
Thanks, Cluso. I've seen a number of non-Parallax documents that claim to describe the Spin byte codes but I was really looking for a definitive Parallax document. Anyway, I've added yours to my collection. Thanks for posting it!
Here is a program that looks up bytes and outputs them to the Prop123-FPGA's LEDs:
datorgordirb,#$FF'make LEDs outputs
loop getbyti j,#table 'ready byte address in next S and byte number in next N
getbyt outb'get byte into outbincmod j,#11'loop 0..11waitx ##40_000_000'pause and repeatjmp #loop
table byte$01,$02,$04,$08,$10,$20,$40,$80,$FF,$F0,$0F,$00
j long0
I changed this instruction block to make it happen. Now SFUNC uses S to determine splitb/mergeb/splitw/mergew/seussf/seussr/rgbsqz/rgbexp on
Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result
Did you use unused method elimination with OpenSpin?
I forgot that you'd added that to OpenSpin (I was running an old version that doesn't have it). Thanks for reminding me! With the new openspin and the -u option to remove unused methods the size goes down to 8344 bytes, which is probably the right size for comparing with fastspin. So the compiled PASM binary is about 1.6x times the size of the bytecode binary (and some of that is data, so the code size is probably more like 1.8x or 2.0x the size). Which can certainly be significant in some applications; OTOH the PASM code is a lot faster -- 8 to 10 times faster in most cases.
Eric,
Thanks for doing the extra testing with -u. I know in the case where memory is available, fastspin is the winner for sure.
I know, with some extra work on the optimizing, you could get fastspin code size down a fair bit.
The perf delta likely won't be as high when compared to the new Spin for P2. Especially when Chip is making changes to the verilog to make coding the Spin interpreter easier and better.
Comments
(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).
(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:
PUB myadd(a, b) : r ASM mov r, a add r, b ENDASM
This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.
(3) I would definitely suggest that ease of compilation to PASM code be kept in mind when designing Spin2. WIth the Prop2 having so much more RAM available (and hubexec!) it will definitely be practical to run compiled PASM code, and it would be nice to have the ability to trade off code size for execution speed.
I'm not sure which would work better. It's certainly possible to JIT a stack oriented machine, and those tend to have very simple instruction sets (making the JIT compiler easier to implement). OTOH performance might be better if the instruction set was closer to PASM, e.g. like the PropGCC CMM code, where the "bytecodes" are really compressed versions of PASM instructions.
Eric
Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
openspin: 13952 bytes fastspin: 13684 bytes bstc: 10944 bytes
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.
Eric
I came here this morning, thinking about exactly this. After a lot of consideration, using LMM for procedure stack may just be too much compartmentalization. It's my opinion, LMM should just remain unused, left free for the hardware functions that use it.
@ersmith, that's sweet.
@all
I also thought about the explicit vs implicit behavior discussion so far. I very strongly favor implicit. It does the reasonable thing, and just does it as much as is possible.
One of the design goals Chip has had throughout this is to make it conversational, interactive, learn by doing. Seems to me, implicit rules, smart choices like the ones mentioned in this thread, where users aren't responsible for things they may not understand or need yet, make the most sense.
And it makes the most sense as it's possible to converse with, test, explore, observe something working far more than it is an error message, or debug log.
I like the byte code for it's size / speed trade-offs, and we should have it. Despite the larger RAM in this one, once again the possible capability pushes right up against that RAM anyway. That said, the compiler numbers above are encouraging! We definitely want to insure compilation makes sense, and LMM stack probably throws a wrench in all of that while not actually yielding a huge benefit in return. Could be a net loss.
On P1, XMM didn't seem to get traction. Maybe it's just too much, COG, LMM, XMM, CMM... LOL!
However, on P2, we have COGex, HUBex, (LUTex, again could be ignored for SPIN), and so here we are again. LOL!
I would totally trade LUTanything for XMM, leaving us with COG, HUB, XMM as execute target models. CMM, may still make sense as it's a virtual machine of sorts and may have uses, debug, traps, memory management through software, etc...
But, COG, HUB, XMM seems like a great target. This chip will benefit from XMM like a P1 doesn't, and that's due to new memory tech being available as well as more I/O, making it practical.
Like P1 SPIN, XMM may never make sense as a supported thing. But, since we can encapsulate PASM into procedures, doing stuff like fetch from [SD, EEPROM, SPI, XMM, whatever] and execute as overlay or from buffer, will be trivial and likely very useful. Just having sensible PASM inline makes this work, but being able to compile in ways that would make doing overlays, etc... work well, seems smart right now, and if it's baked in early, easy and accessible too.
This chip has a lot of COGS and resources. It's gonna see some big applications. Why not bake all that in now?
Another trade-off I would make is to trade the byte code target for compilation options that make doing the above as easy as possible. Because the chip is a multi-processor, being able to move code from storage to ram in a simple, straightforward way makes a ton of sense, and doesn't require an OS, etc... Ideally, that doesn't have to happen, but if some how it does need to, I would definitely take a simple compiler and language support designed to build code to be used in segments over byte code.
Two ways to get at the big program problem then. Byte code, which will cap at a fairly low value, when compared to what is possible by facilitating overlays, dynamic loading and execute. A little thought now could knock doing big programs out of the park! And I'm pretty sure we want that.
IMHO, YMMV, of course.
Smile, I'm on mobile and see I botched it on XMM.
Back in a bit. It's jumbled for sure.
You mention that XMM will be more useful on P2 because of more pins. You can use XMM on P1 with a SPI flash chip that only uses one additional pin if you support a SPI SD card already.
On P2 they will load much faster because of the egg beater!!!
However, some routines will just work better as hubexec, while others will work better as overlays. If there are a lot of looping, for example, overlays will run faster than hubexec.
I jumbled C and SPIN in that. For just SPIN, compile to overlay some how makes a ton of sense.
Assuming it works like ersmith put here, PASM overlays will make sense no matter what. If it can be made easy to compile a procedure as an overlay target, I would trade that for byte code, if necessary.
I'm thinking of the big program, big buffer case here. Given the hardware capability, I think it's gonna be seen a lot.
fastspin / spin2cpp has P2 support, although it isn't up to date (it produces code for an older version of the FPGA). Updating it should be pretty easy. Most of the code generation stuff is independent of P1 or P2, LMM / COG / HUBEXEC; there are just a few places (like putting small loops in FCACHE) that are processor and mode dependent.
Eric
I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.
Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.
You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.
This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.
Of course.
What language are you writing this in ?
My fear was that with the work ersmith has already done (which works with P2 now), will result in two Spin-P2's that are never quite the same, because they have different front end parsers.
I'd say not just the Byte-Code, but also the language itself needs formal documenting, to avoid the difference issues mentioned above.
TL;DR : I made a glob-of-features spin-ish language and would like feed back on it's possible usefulness to the community.
Oh, I almost forgot: very much not backwards compatible, as the original spin has many features not designed for non-P1 use.
https://gist.github.com/listofoptions/3d136ad6a3dfd77ad5611790ff1c3342 (edited)
I know I have asked before. Just how does the streamer connect to the LUT?
One side of the LUT can be read by the cog for instructions in lutexec.
This includes my vector decode definition - the bytecode is listed in the comments section.
Part 1 of 4
P1 Spin Bytecode (from ClusoInterpreter_260C_007F) ================================================== '$00-3F Special purpose opcodes '$40-7F Fast access VAR and LOC '$80-BF Access MEM, OBJ, VAR and LOC '$C0-FF Unary and Binary operators ' ' .---.---.---.---.---.---.---.---. '$00-3F Special purpose opcodes | 0 | 0 | o o o o o o | ' `---^---^---^---^---^---^---^---' ' op set byte description pops push extra bytes '------------------------------------------------------------------------------------------------------- long j0 ' 00 0 000000tp drop anchor (t=try, !p=push) long j0 ' 01 000000tp drop anchor (t=try, !p=push) long j0 ' 02 000000tp drop anchor (t=try, !p=push) long j0 ' 03 000000tp drop anchor (t=try, !p=push) long j1_0 ' 04 1 00000100 jmp +1..2 address long j1_123 ' 05 00000101 call sub +1 sub long j1_123 ' 06 00000110 call obj.sub +2 obj+sub long j1_123 <<9+ popx ' 07 00000111 call obj[].sub 1 +2 obj+sub long j2 <<9+ popx ' 08 2 00001000 tjz 1 0/1 +1..2 address long j2 <<9+ popx ' 09 00001001 djnz 1 0/1 +1..2 address long j2 <<9+ popx ' 0A 00001010 jz 1 +1..2 address long j2 <<9+ popx ' 0B 00001011 jnz 1 +1..2 address long j3_0 <<9+ popyx ' 0C 3 00001100 casedone 2 +1..2 address long j3_12 <<9+ popyx ' 0D 00001101 value case 1 +1..2 address long j3_12 <<9+ popayx ' 0E 00001110 range case 2 +1..2 address long j3_3 ' 0F 00001111 lookdone 3 1 long j4_01 <<9+ popyx ' 10 4 00010000 value lookup 1 long j4_01 <<9+ popyx ' 11 00010001 value lookdown 1 long j4_23 <<9+ popayx ' 12 00010010 range lookup 2 long j4_23 <<9+ popayx ' 13 00010011 range lookdown 2 long j5_0 <<9+ popx ' 14 5 00010100 pop 1+ ???1+ long j5_1 ' 15 00010101 run long j5_23 <<9+ popx ' 16 00010110 STRSIZE(string) 1 1 long j5_23 <<9+ popyx ' 17 00010111 STRCOMP(stringa,stringb) 2 1 long i_WRBYTE <<18+ j6_012 <<9+ popayx ' 18 6 00011000 BYTEFILL(start,value,count) 3 long i_WRWORD <<18+ j6_012 <<9+ popayx ' 19 00011001 WORDFILL(start,value,count) 3 long i_WRLONG <<18+ j6_012 <<9+ popayx ' 1A 00011010 LONGFILL(start,value,count) 3 long j6_3 <<9+ popayx ' 1B 00011011 WAITPEQ(data,mask,port) 3 long i_WRBYTE <<18+ j7_012 <<9+ popayx ' 1C 7 00011100 BYTEMOVE(to,from,count) 3 long i_WRWORD <<18+ j7_012 <<9+ popayx ' 1D 00011101 WORDMOVE(to,from,count) 3 long i_WRLONG <<18+ j7_012 <<9+ popayx ' 1E 00011110 LONGMOVE(to,from,count) 3 long j7_3 <<9+ popayx ' 1F 00011111 WAITPNE(data,mask,port) 3 long j8_0 <<9+ popyx ' 20 8 00100000 CLKSET(mode,freq) 2 long j8_1 <<9+ popx ' 21 00100001 COGSTOP(id) 1 long j8_2 <<9+ popx ' 22 00100010 LOCKRET(id) 1 long j8_3 <<9+ popx ' 23 00100011 WAITCNT(count) 1 long j9_012 <<9+ popx ' 24 9 001001oo SPR[nibble] op push 1 +1 if assign long j9_012 <<9+ popx ' 25 001001oo SPR[nibble] op pop 1 +1 if assign long j9_012 <<9+ popx ' 26 001001oo SPR[nibble] op using 1 +1 if assign long j9_3 <<9+ popyx ' 27 00100111 WAITVID(colors,pixels) 2 long jAB_0 <<9+ popayx ' 28 A 00101p00 COGINIT(id,adr,ptr) 3 1 (!p=push) long jAB_1 ' 29 00101p01 LOCKNEW 1 (!p=push) long jAB_2 <<9+ popx ' 2A 00101p10 LOCKSET(id) 1 1 (!p=push) long jAB_3 <<9+ popx ' 2B 00101p11 LOCKCLR(id) 1 1 (!p=push) long jAB_0 <<9+ popayx ' 2C B 00101p00 COGINIT(id,adr,ptr) 3 0 (no push) long jAB_1 ' 2D 00101p01 LOCKNEW 0 (no push) long jAB_2 <<9+ popx ' 2E 00101p10 LOCKSET(id) 1 0 (no push) long jAB_3 <<9+ popx ' 2F 00101p11 LOCKCLR(id) 1 0 (no push) long jC_02 ' 30 C 00110000 ABORT long jC_13 <<9+ popx ' 31 00110001 ABORT value 1 long jC_02 ' 32 00110010 RETURN long jC_13 <<9+ popx ' 33 00110011 RETURN value 1 long jD_012 ' 34 D 001101cc PUSH #-1 1 long jD_012 ' 35 001101cc PUSH #0 1 long jD_012 ' 36 001101cc PUSH #1 1 long jD_3 ' 37 00110111 PUSH #kp 1 +1 maskdata long jE ' 38 E 001110bb PUSH #k1 (1 byte) 1 +1 constant long jE ' 39 001110bb PUSH #k2 (2 bytes) 1 +2 constant long jE ' 3A 001110bb PUSH #k3 (3 bytes) 1 +3 constant long jE ' 3B 001110bb PUSH #k4 (4 bytes) 1 +4 constant long jF_0 ' 3C F 00111100 <unused> long jF_123 ' 3D 00111101 register[bit] op 1 +1 reg+op, +1 if assign long jF_123 ' 3E 00111110 register[bit..bit] op 2 +1 reg+op, +1 if assign long jF_123 ' 3F 00111111 register op +1 reg+op, +1 if assign 'from Hippy '3F 80+n PUSH spr '3F A0+n POP spr '3F C0+n USING spr
' .---.---.---.---.---.---.---.---. These opcodes allow fast access by making long access '$40-7F Fast access VAR, LOC | 0 1 | w | v v v | o o | to the first few long entries in the variable space ' `---^---^---^---^---^---^---^---' or stack a single byte opcode. The single byte opcodes ' | | | are effectively expanded within the interpreter... ' 0= VAR Address 00= PUSH Read - push result in stack ' 1= LOC (adr = v*4) 01= POP Write - pop value from stack ' | | 10= USING 2nd opcode (assignment) executed, result in target ' | | 11= PUSH # Push address of destination into stack ' | `---------|------------------------. ' `-----------. | | ' \|/ \|/ \|/ ' .---.---.---.---.---.---.---.---. .---.---.---.---.---.---.---.---. ' 10= long? ===> | 1 |?1? 0 | 0 | 1 w | o o | | 0 0 0 | v v v | 0 0 | ' `---^---^---^---^---^---^---^---' `---^---^---^---^---^---^---^---' '----------------------------------------------------------------------------------------------------------- long $00 <<27 +varop ' 40 VAR PUSH addr=0*4= 00 long $00 <<27 +varop ' 41 VAR POP addr=0*4= 00 long $00 <<27 +varop ' 42 VAR USING addr=0*4= 00 long $00 <<27 +varop ' 43 VAR PUSH # addr=0*4= 00 long $04 <<27 +varop ' 44 VAR PUSH addr=1*4= 04 long $04 <<27 +varop ' 45 VAR POP addr=1*4= 04 long $04 <<27 +varop ' 46 VAR USING addr=1*4= 04 long $04 <<27 +varop ' 47 VAR PUSH # addr=1*4= 04 long $08 <<27 +varop ' 48 VAR PUSH addr=2*4= 08 long $08 <<27 +varop ' 49 VAR POP addr=2*4= 08 long $08 <<27 +varop ' 4A VAR USING addr=2*4= 08 long $08 <<27 +varop ' 4B VAR PUSH # addr=2*4= 08 long $0C <<27 +varop ' 4C VAR PUSH addr=3*4= 0C long $0C <<27 +varop ' 4D VAR POP addr=3*4= 0C long $0C <<27 +varop ' 4E VAR USING addr=3*4= 0C long $0C <<27 +varop ' 4F VAR PUSH # addr=3*4= 0C long $10 <<27 +varop ' 50 VAR PUSH addr=4*4= 10 long $10 <<27 +varop ' 51 VAR POP addr=4*4= 10 long $10 <<27 +varop ' 52 VAR USING addr=4*4= 10 long $10 <<27 +varop ' 53 VAR PUSH # addr=4*4= 10 long $14 <<27 +varop ' 54 VAR PUSH addr=5*4= 14 long $14 <<27 +varop ' 55 VAR POP addr=5*4= 14 long $14 <<27 +varop ' 56 VAR USING addr=5*4= 14 long $14 <<27 +varop ' 57 VAR PUSH # addr=5*4= 14 long $18 <<27 +varop ' 58 VAR PUSH addr=6*4= 18 long $18 <<27 +varop ' 59 VAR POP addr=6*4= 18 long $18 <<27 +varop ' 5A VAR USING addr=6*4= 18 long $18 <<27 +varop ' 5B VAR PUSH # addr=6*4= 18 long $1C <<27 +varop ' 5C VAR PUSH addr=7*4= 1C long $1C <<27 +varop ' 5D VAR POP addr=7*4= 1C long $1C <<27 +varop ' 5E VAR USING addr=7*4= 1C long $1C <<27 +varop ' 5F VAR PUSH # addr=7*4= 1C long $00 <<27 +varop ' 60 LOC PUSH addr=0*4= 00 long $00 <<27 +varop ' 61 LOC POP addr=0*4= 00 long $00 <<27 +varop ' 62 LOC USING addr=0*4= 00 long $00 <<27 +varop ' 63 LOC PUSH # addr=0*4= 00 long $04 <<27 +varop ' 64 LOC PUSH addr=1*4= 04 long $04 <<27 +varop ' 65 LOC POP addr=1*4= 04 long $04 <<27 +varop ' 66 LOC USING addr=1*4= 04 long $04 <<27 +varop ' 67 LOC PUSH # addr=1*4= 04 long $08 <<27 +varop ' 68 LOC PUSH addr=2*4= 08 long $08 <<27 +varop ' 69 LOC POP addr=2*4= 08 long $08 <<27 +varop ' 6A LOC USING addr=2*4= 08 long $08 <<27 +varop ' 6B LOC PUSH # addr=2*4= 08 long $0C <<27 +varop ' 6C LOC PUSH addr=3*4= 0C long $0C <<27 +varop ' 6D LOC POP addr=3*4= 0C long $0C <<27 +varop ' 6E LOC USING addr=3*4= 0C long $0C <<27 +varop ' 6F LOC PUSH # addr=3*4= 0C long $10 <<27 +varop ' 70 LOC PUSH addr=4*4= 10 long $10 <<27 +varop ' 71 LOC POP addr=4*4= 10 long $10 <<27 +varop ' 72 LOC USING addr=4*4= 10 long $10 <<27 +varop ' 73 LOC PUSH # addr=4*4= 10 long $14 <<27 +varop ' 74 LOC PUSH addr=5*4= 14 long $14 <<27 +varop ' 75 LOC POP addr=5*4= 14 long $14 <<27 +varop ' 76 LOC USING addr=5*4= 14 long $14 <<27 +varop ' 77 LOC PUSH # addr=5*4= 14 long $18 <<27 +varop ' 78 LOC PUSH addr=6*4= 18 long $18 <<27 +varop ' 79 LOC POP addr=6*4= 18 long $18 <<27 +varop ' 7A LOC USING addr=6*4= 18 long $18 <<27 +varop ' 7B LOC PUSH # addr=6*4= 18 long $1C <<27 +varop ' 7C LOC PUSH addr=7*4= 1C long $1C <<27 +varop ' 7D LOC POP addr=7*4= 1C long $1C <<27 +varop ' 7E LOC USING addr=7*4= 1C long $1C <<27 +varop ' 7F LOC PUSH # addr=7*4= 1C
' .---.---.---.---.---.---.---.---. '$80-DF Access MEM, OBJ, | 1 | s s | i | w w | o o | (96 stack load / save opcodes) ' VAR and LOC `---^---^---^---^---^---^---^---' ' | | | | ' 00= Byte | | 00= PUSH Read - push result in stack ' 01= Word | | 01= POP Write - pop value from stack ' 10= Long | | 10= USING 2nd opcode (assignment) executed, result in target ' (11= mathop) | | 11= PUSH # Push address of destination into stack ' | 00= MEM base popped from stack, if i=1 add offset ' | 01= OBJ base is object base , if i=1 add offset ' | 10= VAR base is variable base , if i=1 add offset ' | 11= LOC base is stack base , if i=1 add offset ' 0= no offset ' 1=[]= add offset (indexed) '----------------------------------------------------------------------------------------------------------- long memop ' 80 Byte MEM PUSH long memop ' 81 Byte MEM POP long memop ' 82 Byte MEM USING long memop ' 83 Byte MEM PUSH # long memop ' 84 Byte OBJ PUSH long memop ' 85 Byte OBJ POP long memop ' 86 Byte OBJ USING long memop ' 87 Byte OBJ PUSH # long memop ' 88 Byte VAR PUSH long memop ' 89 Byte VAR POP long memop ' 8A Byte VAR USING long memop ' 8B Byte VAR PUSH # long memop ' 8C Byte LOC PUSH long memop ' 8D Byte LOC POP long memop ' 8E Byte LOC USING long memop ' 8F Byte LOC PUSH # long memop ' 90 Byte [] MEM PUSH long memop ' 91 Byte [] MEM POP long memop ' 92 Byte [] MEM USING long memop ' 93 Byte [] MEM PUSH # long memop ' 94 Byte [] OBJ PUSH long memop ' 95 Byte [] OBJ POP long memop ' 96 Byte [] OBJ USING long memop ' 97 Byte [] OBJ PUSH # long memop ' 98 Byte [] VAR PUSH long memop ' 99 Byte [] VAR POP long memop ' 9A Byte [] VAR USING long memop ' 9B Byte [] VAR PUSH # long memop ' 9C Byte [] LOC PUSH long memop ' 9D Byte [] LOC POP long memop ' 9E Byte [] LOC USING long memop ' 9F Byte [] LOC PUSH # long memop ' A0 Word MEM PUSH long memop ' A1 Word MEM POP long memop ' A2 Word MEM USING long memop ' A3 Word MEM PUSH # long memop ' A4 Word OBJ PUSH long memop ' A5 Word OBJ POP long memop ' A6 Word OBJ USING long memop ' A7 Word OBJ PUSH # long memop ' A8 Word VAR PUSH long memop ' A9 Word VAR POP long memop ' AA Word VAR USING long memop ' AB Word VAR PUSH # long memop ' AC Word LOC PUSH long memop ' AD Word LOC POP long memop ' AE Word LOC USING long memop ' AF Word LOC PUSH # long memop ' B0 Word [] MEM PUSH long memop ' B1 Word [] MEM POP long memop ' B2 Word [] MEM USING long memop ' B3 Word [] MEM PUSH # long memop ' B4 Word [] OBJ PUSH long memop ' B5 Word [] OBJ POP long memop ' B6 Word [] OBJ USING long memop ' B7 Word [] OBJ PUSH # long memop ' B8 Word [] VAR PUSH long memop ' B9 Word [] VAR POP long memop ' BA Word [] VAR USING long memop ' BB Word [] VAR PUSH # long memop ' BC Word [] LOC PUSH long memop ' BD Word [] LOC POP long memop ' BE Word [] LOC USING long memop ' BF Word [] LOC PUSH # long memop ' C0 Long MEM PUSH long memop ' C1 Long MEM POP long memop ' C2 Long MEM USING long memop ' C3 Long MEM PUSH # long memop ' C4 Long OBJ PUSH long memop ' C5 Long OBJ POP long memop ' C6 Long OBJ USING long memop ' C7 Long OBJ PUSH # long memop ' C8 Long VAR PUSH \ see also $40-7F bytecodes long memop ' C9 Long VAR POP | long memop ' CA Long VAR USING | long memop ' CB Long VAR PUSH # | long memop ' CC Long LOC PUSH | long memop ' CD Long LOC POP | long memop ' CE Long LOC USING | long memop ' CF Long LOC PUSH # / long memop ' D0 Long [] MEM PUSH long memop ' D1 Long [] MEM POP long memop ' D2 Long [] MEM USING long memop ' D3 Long [] MEM PUSH # long memop ' D4 Long [] OBJ PUSH long memop ' D5 Long [] OBJ POP long memop ' D6 Long [] OBJ USING long memop ' D7 Long [] OBJ PUSH # long memop ' D8 Long [] VAR PUSH long memop ' D9 Long [] VAR POP long memop ' DA Long [] VAR USING long memop ' DB Long [] VAR PUSH # long memop ' DC Long [] LOC PUSH long memop ' DD Long [] LOC POP long memop ' DE Long [] LOC USING long memop ' DF Long [] LOC PUSH #
' .---.---.---.---.---.---.---.---. '$E0-FF Math operation | 1 1 1 | o o o o o | (32 maths opcodes) ' `---^---^---^---^---^---^---^---' ' ' .---.---.---.---.---.---.---.---. ' Math Assignment (USING) | p 1 s | o o o o o | (32 maths opcodes) "op2" ' operation `---^---^---^---^---^---^---^---' ' | | ' | (!s) 0 = swap binary args ' | 1 = no swap ' (!p) 0 = push ' 1 = no push ' unary/ unary/ ' binary instr instr code binary normal assign description '---------------------------------------------------------------------------------------------------------------------------- long i_bin +i_ROR <<18 +math_E0 <<9 +math_bin '$041 E0 00000 ROR 1st -> 2nd b -> ->= rotate right long i_bin +i_ROL <<18 +math_E0 <<9 +math_bin '$049 E1 00001 ROL 1st <- 2nd b <- <-= rotate left long i_bin +i_SHR <<18 +math_E0 <<9 +math_bin '$051 E2 00010 SHR 1st >> 2nd b >> >>= shift right long i_bin +i_SHL <<18 +math_E0 <<9 +math_bin '$059 E3 00011 SHL 1st << 2nd b << <<= shift left long i_bin +i_MINS <<18 +math_E0 <<9 +math_bin '$081 E4 00100 MINs 1st #> 2nd b #> #>= limit minimum (signed) long i_bin +i_MAXS <<18 +math_E0 <<9 +math_bin '$089 E5 00101 MAXs 1st <# 2nd b <# <#= limit maximum (signed) long i_un +i_NEG <<18 +math_E0 <<9 +math_un '$149 E6 00110 NEG - 1st unary - - negate long i_un +0 <<18 +math_E7 <<9 +math_un ' E7 00111 BIT_NOT ! 1st unary ! ! bitwise not long i_bin +i_AND <<18 +math_E0 <<9 +math_bin '$0C1 E8 01000 BIT_AND 1st & 2nd b & &= bitwise and long i_un +i_ABS <<18 +math_E0 <<9 +math_un '$151 E9 01001 ABS ABS( 1st ) unary || || absolute long i_bin +i_OR <<18 +math_E0 <<9 +math_bin '$0D1 EA 01010 BIT_OR 1st | 2nd b | |= bitwise or long i_bin +i_XOR <<18 +math_E0 <<9 +math_bin '$0D9 EB 01011 BIT_XOR 1st ^ 2nd b ^ ^= bitwise xor long i_bin +i_ADD <<18 +math_E0 <<9 +math_bin '$101 EC 01100 ADD 1st + 2nd b + += add long i_bin +i_SUB <<18 +math_E0 <<9 +math_bin '$109 ED 01101 SUB 1st - 2nd b - -= subtract long i_bin +i_SAR <<18 +math_E0 <<9 +math_bin '$071 EE 01110 SAR 1st ~> 2nd b ~> ~>= shift arithmetic right long i_bin +0 <<18 +math_EF <<9 +math_bin '$079 EF 01111 BIT_REV 1st >< 2nd b >< ><= reverse bits (neg y first) long i_bin +i_AND <<18 +math_F0 <<9 +math_bin '$0C1 F0 10000 LOG_AND 1st AND 2nd b AND boolean and long i_un +0 <<18 +math_F1 <<9 +math_un ' F1 10001 ENCODE >| 1st unary >| >| encode (0-32) long i_bin +i_OR <<18 +math_F0 <<9 +math_bin '$0D1 F2 10010 LOG_OR 1st OR 2nd b OR boolean or long i_un +0 <<18 +math_F3 <<9 +math_un ' F3 10011 DECODE |< 1st unary |< |< decode long i_bin +0 <<18 +math_F4 <<9 +math_bin ' F4 10100 MPY 1st * 2nd b * *= multiply, return lower half (signed) long i_bin +0 <<18 +math_F4 <<9 +math_bin ' F5 10101 MPY_MSW 1st ** 2nd b ** **= multiply, return upper half (signed) long i_bin +0 <<18 +math_F4 <<9 +math_bin ' F6 10110 DIV 1st / 2nd b / /= divide, return quotient (signed) long i_bin +0 <<18 +math_F4 <<9 +math_bin ' F7 10111 MOD 1st // 2nd b // //= divide, return remainder (signed) long i_un +0 <<18 +math_F8 <<9 +math_un ' F8 11000 SQRT ^^ 1st unary ^^ ^^ square root long i_bin +0 <<18 +math_F9 <<9 +math_bin ' F9 11001 LT 1st < 2nd b < test below (signed) long i_bin +0 <<18 +math_F9 <<9 +math_bin ' FA 11010 GT 1st > 2nd b > test above (signed) long i_bin +0 <<18 +math_F9 <<9 +math_bin ' FB 11011 NE 1st <> 2nd b <> test not equal long i_bin +0 <<18 +math_F9 <<9 +math_bin ' FC 11100 EQ 1st == 2nd b == test equal long i_bin +0 <<18 +math_F9 <<9 +math_bin ' FD 11101 LE 1st =< 2nd b =< test below or equal (signed) long i_bin +0 <<18 +math_F9 <<9 +math_bin ' FE 11110 GE 1st => 2nd b => test above or equal (signed) long i_un +0 <<18 +math_FF <<9 +math_un ' FF 11111 LOG_NOT NOT 1st unary NOT NOT boolean not
The vector table is 256 longs, one for each of the 256 bytecodes, and resides in hub (no cog space available but could be in LUT for P2).
Each bytecode/long can contain up to 3 9-bit vectors (cog addresses) and 5 special config bits. These are used as jump/call cog addresses (subroutines) for the spin interpreter to run for its' respective bytecode. As each 9-bit vector is used (by a jmp/call indirect via the vector location in cog), the vector is shifted right >>9 places to the next vector. A zero vector is used as the end of the bytecode sequence.
This method permits the additional code space to be used to unravel the interpreter to speed up the code. The maths routines are especially sped up.
Here are some ideas for P2 SPIN...
In P2, many of the mathops might be better placed as pasm inline code, for a slight penalty of code space. Perhaps it could also be a compiler option.
I would think it would be best if the majority of the P2 interpreter were located in LUT space, such that the local longs etc could be directly accessible in cog space. Some of the rarely used, or non-speed important, interpreter routines could remain in hub and run in hubexec mode.
There is probably a benefit to being able to declare a new Local Global set of Variables which could reside in cog. These would be those declared in the VAR section and where they are not shared by other cogs.
i.e. split the VAR section into 2 sections, VAR_LOCAL (stored in cog) and VAR_GLOBAL (stored in hub).
Here is a program that looks up bytes and outputs them to the Prop123-FPGA's LEDs:
dat org or dirb,#$FF 'make LEDs outputs loop getbyti j,#table 'ready byte address in next S and byte number in next N getbyt outb 'get byte into outb incmod j,#11 'loop 0..11 waitx ##40_000_000 'pause and repeat jmp #loop table byte $01,$02,$04,$08,$10,$20,$40,$80,$FF,$F0,$0F,$00 j long 0
I changed this instruction block to make it happen. Now SFUNC uses S to determine splitb/mergeb/splitw/mergew/seussf/seussr/rgbsqz/rgbexp on
EEEE 100000N NNI DDDDDDDDD SSSSSSSSS SETNIB D,S/#,#N EEEE 100001N NNI DDDDDDDDD SSSSSSSSS GETNIB D,S/#,#N EEEE 100010N NNI DDDDDDDDD SSSSSSSSS ROLNIB D,S/#,#N EEEE 1000110 NNI DDDDDDDDD SSSSSSSSS SETBYT D,S/#,#N EEEE 1000111 NNI DDDDDDDDD SSSSSSSSS GETBYT D,S/#,#N EEEE 1001000 NNI DDDDDDDDD SSSSSSSSS ROLBYT D,S/#,#N EEEE 1001001 0NI DDDDDDDDD SSSSSSSSS SETWRD D,S/#,#N EEEE 1001001 1NI DDDDDDDDD SSSSSSSSS GETWRD D,S/#,#N EEEE 1001010 0NI DDDDDDDDD SSSSSSSSS ROLWRD D,S/#,#N EEEE 1001010 10I DDDDDDDDD SSSSSSSSS SETNIBI D,S/# EEEE 1001010 11I DDDDDDDDD SSSSSSSSS GETNIBI D,S/# EEEE 1001011 00I DDDDDDDDD SSSSSSSSS SETBYTI D,S/# EEEE 1001011 01I DDDDDDDDD SSSSSSSSS GETBYTI D,S/# EEEE 1001011 10I DDDDDDDDD SSSSSSSSS SETWRDI D,S/# EEEE 1001011 11I DDDDDDDDD SSSSSSSSS GETWRDI D,S/# EEEE 1001100 00I DDDDDDDDD SSSSSSSSS ALTR D,S/# EEEE 1001100 01I DDDDDDDDD SSSSSSSSS ALTD D,S/# EEEE 1001100 10I DDDDDDDDD SSSSSSSSS ALTS D,S/# EEEE 1001100 11I DDDDDDDDD SSSSSSSSS ALTB D,S/# EEEE 1001101 00I DDDDDDDDD SSSSSSSSS ALTI D,S/# EEEE 1001101 01I DDDDDDDDD SSSSSSSSS SETR D,S/# EEEE 1001101 10I DDDDDDDDD SSSSSSSSS SETD D,S/# EEEE 1001101 11I DDDDDDDDD SSSSSSSSS SETS D,S/# EEEE 1001110 00I DDDDDDDDD SSSSSSSSS BMASK D,S/# EEEE 1001110 01I DDDDDDDDD SSSSSSSSS BMASKN D,S/# EEEE 1001110 10I DDDDDDDDD SSSSSSSSS TRIML D,S/# EEEE 1001110 11I DDDDDDDDD SSSSSSSSS TRIMR D,S/# EEEE 1001111 00I DDDDDDDDD SSSSSSSSS DECOD D,S/# EEEE 1001111 01I DDDDDDDDD SSSSSSSSS REV D,S/# EEEE 1001111 10I DDDDDDDDD SSSSSSSSS MOVBYTS D,S/# EEEE 1001111 11I DDDDDDDDD SSSSSSSSS SFUNC D,S/#
Here are the aliases:
SETNIB reg/# = SETNIB 0,reg/#,#0 (follows SETNIBI) GETNIB reg = GETNIB reg,0,#0 (follows GETNIBI) ROLNIB reg = ROLNIB reg,0,#0 (follows GETNIBI) SETBYT reg/# = SETBYT 0,reg/#,#0 (follows SETBYTI) GETBYT reg = GETBYT reg,0,#0 (follows GETBYTI) ROLBYT reg = ROLBYT reg,0,#0 (follows GETBYTI) SETWRD reg/# = SETWRD 0,reg/#,#0 (follows SETWRDI) GETWRD reg = GETWRD reg,0,#0 (follows GETWRDI) ROLWRD reg = ROLWRD reg,0,#0 (follows GETWRDI) SETNIBI reg = SETNIBI reg,#0 GETNIBI reg = GETNIBI reg,#0 SETBYTI reg = SETBYTI reg,#0 GETBYTI reg = GETBYTI reg,#0 SETWRDI reg = SETWRDI reg,#0 GETWRDI reg = GETWRDI reg,#0 ALTR reg = ALTR reg,#0 ALTD reg = ALTD reg,#0 ALTS reg = ALTS reg,#0 ALTB reg = ALTB reg,#0 ALTI reg = ALTI reg,#%101_100_100 (substitute reg for next instruction) BMASK reg = BMASK reg,reg BMASKN reg = BMASKN reg,reg DECOD reg = DECOD reg,reg REV reg = REV reg,reg SPLITB reg = SFUNC reg,#0 MERGEB reg = SFUNC reg,#1 SPLITW reg = SFUNC reg,#2 MERGEW reg = SFUNC reg,#3 SEUSSF reg = SFUNC reg,#4 SEUSSR reg = SFUNC reg,#5 RGBSQZ reg = SFUNC reg,#6 RGBEXP reg = SFUNC reg,#7
This will be in v16. We are way better equipped now to handle bytes, nibbles, and words.
I forgot that you'd added that to OpenSpin (I was running an old version that doesn't have it). Thanks for reminding me! With the new openspin and the -u option to remove unused methods the size goes down to 8344 bytes, which is probably the right size for comparing with fastspin. So the compiled PASM binary is about 1.6x times the size of the bytecode binary (and some of that is data, so the code size is probably more like 1.8x or 2.0x the size). Which can certainly be significant in some applications; OTOH the PASM code is a lot faster -- 8 to 10 times faster in most cases.
Eric
Thanks for doing the extra testing with -u. I know in the case where memory is available, fastspin is the winner for sure.
I know, with some extra work on the optimizing, you could get fastspin code size down a fair bit.
The perf delta likely won't be as high when compared to the new Spin for P2. Especially when Chip is making changes to the verilog to make coding the Spin interpreter easier and better.