New Spin

ersmith · 2017-02-20 12:23

A few comments/questions:

(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).

(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:

PUB myadd(a, b) : r
  ASM
     mov r, a
     add r, b
  ENDASM

This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.

(3) I would definitely suggest that ease of compilation to PASM code be kept in mind when designing Spin2. WIth the Prop2 having so much more RAM available (and hubexec!) it will definitely be practical to run compiled PASM code, and it would be nice to have the ability to trade off code size for execution speed.

ersmith · 2017-02-20 12:31

Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.

David Betz · 2017-02-20 13:05

ersmith wrote: »

Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.

Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?

ersmith · 2017-02-20 13:39

David Betz wrote: »

ersmith wrote: »

Another thought: if the bytecode is easy enough to translate to PASM, a JIT compiler would become feasible. At runtime small blocks of code could be compied on the fly to PASM and executed from a cache (maybe in COG memory, maybe using hubexec). That way small loops that fit in cache could execute at full speed. Again, this probably involves a tradeoff of space for speed, but it could be a significant speed boost.

Would that work well with a stack-oriented byte code instruction set or would it be better to use register-oriented instructions?

I'm not sure which would work better. It's certainly possible to JIT a stack oriented machine, and those tend to have very simple instruction sets (making the JIT compiler easier to implement). OTOH performance might be better if the instruction set was closer to PASM, e.g. like the PropGCC CMM code, where the "bytecodes" are really compressed versions of PASM instructions.

Eric

ersmith · 2017-02-20 13:47

cgracey wrote: »

David Betz wrote: »

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.

Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:

openspin: 13952 bytes
fastspin: 13684 bytes
bstc:     10944 bytes

In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.

fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.

Eric

potatohead · 2017-02-20 18:20

(1) If compatibility with existing code is a concern, local variables could be kept in LUT or COG memory only if their address is not taken. That's the approach I took in fastspin (actually in order to keep compatibility I had to put *all* local variables into HUB memory if any of them had their address taken).

(2) How will we pass data to/from inline assembly? Will it be able to access local variables and/or object variables? In fastspin inline assembly is allowed to access any local variables (including function parameters and return names). A trivial example would be:

PUB myadd(a, b) : r
ASM
mov r, a
add r, b
ENDASM

This came in really handy -- for example, most of the builtin functions like waitcnt and the lock functions are just implemented with inline assembly. The compiler is smart enough to inline these.

I came here this morning, thinking about exactly this. After a lot of consideration, using LMM for procedure stack may just be too much compartmentalization. It's my opinion, LMM should just remain unused, left free for the hardware functions that use it.

@ersmith, that's sweet.

Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?

@all

I also thought about the explicit vs implicit behavior discussion so far. I very strongly favor implicit. It does the reasonable thing, and just does it as much as is possible.

One of the design goals Chip has had throughout this is to make it conversational, interactive, learn by doing. Seems to me, implicit rules, smart choices like the ones mentioned in this thread, where users aren't responsible for things they may not understand or need yet, make the most sense.

And it makes the most sense as it's possible to converse with, test, explore, observe something working far more than it is an error message, or debug log.

I like the byte code for it's size / speed trade-offs, and we should have it. Despite the larger RAM in this one, once again the possible capability pushes right up against that RAM anyway. That said, the compiler numbers above are encouraging! We definitely want to insure compilation makes sense, and LMM stack probably throws a wrench in all of that while not actually yielding a huge benefit in return. Could be a net loss.

On P1, XMM didn't seem to get traction. Maybe it's just too much, COG, LMM, XMM, CMM... LOL!

However, on P2, we have COGex, HUBex, (LUTex, again could be ignored for SPIN), and so here we are again. LOL!

I would totally trade LUTanything for XMM, leaving us with COG, HUB, XMM as execute target models. CMM, may still make sense as it's a virtual machine of sorts and may have uses, debug, traps, memory management through software, etc...

But, COG, HUB, XMM seems like a great target. This chip will benefit from XMM like a P1 doesn't, and that's due to new memory tech being available as well as more I/O, making it practical.

Like P1 SPIN, XMM may never make sense as a supported thing. But, since we can encapsulate PASM into procedures, doing stuff like fetch from [SD, EEPROM, SPI, XMM, whatever] and execute as overlay or from buffer, will be trivial and likely very useful. Just having sensible PASM inline makes this work, but being able to compile in ways that would make doing overlays, etc... work well, seems smart right now, and if it's baked in early, easy and accessible too.

This chip has a lot of COGS and resources. It's gonna see some big applications. Why not bake all that in now?

Another trade-off I would make is to trade the byte code target for compilation options that make doing the above as easy as possible. Because the chip is a multi-processor, being able to move code from storage to ram in a simple, straightforward way makes a ton of sense, and doesn't require an OS, etc... Ideally, that doesn't have to happen, but if some how it does need to, I would definitely take a simple compiler and language support designed to build code to be used in segments over byte code.

Two ways to get at the big program problem then. Byte code, which will cap at a fairly low value, when compared to what is possible by facilitating overlays, dynamic loading and execute. A little thought now could knock doing big programs out of the park! And I'm pretty sure we want that.

IMHO, YMMV, of course.

David Betz · 2017-02-20 18:43

I don't understand what you mean when you say "LMM should remain unused". In fact, there won't be any LMM on P2 since we have hubexec. Did you mean to say "LUT should remain unused"? Also, can you describe what you mean by XMM because I'm not sure it makes much sense for P2. It will end up being a P1-style LMM loop with a cache for accessing external RAM. I doubt that would even get as much traction as XMM on P1 since it would be *so* much slower than hubexec.

potatohead · 2017-02-20 18:58

I meant both. There was, at one point, some LMM discussion. I missed where it died, apparently. Given that, yes. Leave LUT unused.

Smile, I'm on mobile and see I botched it on XMM.

Back in a bit. It's jumbled for sure.

David Betz · 2017-02-20 19:01

potatohead wrote: »

I meant both. There was, at one point, some LMM discussion. I missed where it died, apparently. Given that, yes. Leave LUT unused.

Smile, I'm on mobile and see I botched it on XMM.

Back in a bit. It's jumbled for sure.

I read your XMM section again and it made more sense the second time. You're not talking necessarily about fetching one instruction at a time. You talk about overlays which I think could work well. In fact, they could work well on P1 as well.

You mention that XMM will be more useful on P2 because of more pins. You can use XMM on P1 with a SPI flash chip that only uses one additional pin if you support a SPI SD card already.

Cluso99 · 2017-02-20 19:16

Overlays work well on P1. Heater and I use them in ZiCog to advantage rather than LMM.

On P2 they will load much faster because of the egg beater!!!

However, some routines will just work better as hubexec, while others will work better as overlays. If there are a lot of looping, for example, overlays will run faster than hubexec.

potatohead · 2017-02-20 19:18

Yes. That is the gist of it.

I jumbled C and SPIN in that. For just SPIN, compile to overlay some how makes a ton of sense.

Assuming it works like ersmith put here, PASM overlays will make sense no matter what. If it can be made easy to compile a procedure as an overlay target, I would trade that for byte code, if necessary.

I'm thinking of the big program, big buffer case here. Given the hardware capability, I think it's gonna be seen a lot.

ersmith · 2017-02-20 19:36

potatohead wrote: »

@ersmith, that's sweet. Exactly what is needed to maximize this design. Is anything done for P2 specifically yet, or is it still P1?

fastspin / spin2cpp has P2 support, although it isn't up to date (it produces code for an older version of the FPGA). Updating it should be pretty easy. Most of the code generation stuff is independent of P1 or P2, LMM / COG / HUBEXEC; there are just a few places (like putting small loops in FCACHE) that are processor and mode dependent.

Eric

cgracey · 2017-02-20 21:05

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.

This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.

David Betz · 2017-02-20 21:11

cgracey wrote: »

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.

This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.

Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.

cgracey · 2017-02-20 22:13

David Betz wrote: »

cgracey wrote: »

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.

This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.

Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.

Of course.

David Betz · 2017-02-20 23:03

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we every really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.

This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.

Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.

Of course.

Thanks, Chip!

jmg · 2017-02-20 23:23

cgracey wrote: »

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

What language are you writing this in ?
My fear was that with the work ersmith has already done (which works with P2 now), will result in two Spin-P2's that are never quite the same, because they have different front end parsers.

David Betz wrote: »

Since you're working on the P2 byte code instruction set, I'd like to make a request. Could you document so others can create tools that work with it? That's been one of my frustrations with Spin on P1. There is no official document describing the byte code instruction set.

I'd say not just the Byte-Code, but also the language itself needs formal documenting, to avoid the difference issues mentioned above.

Roy Eltham · 2017-02-20 23:39

ersmith wrote: »
cgracey wrote: »

David Betz wrote: »

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.

Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
openspin: 13952 bytes
fastspin: 13684 bytes
bstc:     10944 bytes
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result. Still, it's not actually as much overhead as one might think.

fastspin/spin2cpp is open source (MIT) so you certainly could re-use any parts of it that are of interest to you, and I'd be happy to help with that.

Eric

Did you use unused method elimination with OpenSpin?

listofoptions · 2017-02-21 01:22

So, I was bored after work, and came up with a real quick and dirty antlr-3 grammar for what some of the features I'd like to see in spin-2. mostly I designed it with forwards compatibility in mind, so some features from the get-go I assumed would be implementation optional (rational/real/complex numbers), and I decided it would be a good idea not to assume one underlying assembly syntax. The grammar is very much not designed for a single pass compiler (but is for one with separate compilation). it has monomorphic data types (think C or go) without a notion of 'void *' yet, though tagged unions are a thing. enumerative data types are not added (forgot them tbh), and statement expressions are a thing I like from GNU-GCC C so they're in there as well. nested procedures are semantically and syntactically allowed, though procedure values are of debatable usefulness in a microcontroller (they are syntactically expressible though).

TL;DR : I made a glob-of-features spin-ish language and would like feed back on it's possible usefulness to the community.

Oh, I almost forgot: very much not backwards compatible, as the original spin has many features not designed for non-P1 use.

https://gist.github.com/listofoptions/3d136ad6a3dfd77ad5611790ff1c3342 (edited)

Cluso99 · 2017-02-21 02:04

cgracey wrote: »

Eric and Everyone,

I was planning on having 8 local long variables, plus CF/ZF bits for PASM-instruction procedures. Eight sounds like nothing, I know, but do we ever really need more than that? I ask because it reduces the byte token count and keeps housekeeping tight.

Another thing, about inline PASM: I was going to copy the current data stack frame from LUT and write those values into cog registers 0..7. That way, the PASM code has context that is easily coded for.

You guys have a lot better ideas than me, on the whole. I'm working with what I know, but that will grow during the project.

This morning I've been adding nibble/byte/word prefixes to SETNIB/GETNIB/ROLNIB/etc., so that base and index can be expressed in a prior instruction, giving you random nibble/byte/word reading and writing within cog register space. Someone here had that idea. So obvious, but it never occurred to me. We already had the workhorse instructions in place, too.

IIRC I have used more than 8 quite a number of times, but never more than 16. How many others do you need (like CF/ZF etc). Might a total of 16 work where the local variables might be say 12 ???

I know I have asked before. Just how does the streamer connect to the LUT?
One side of the LUT can be read by the cog for instructions in lutexec.

Cluso99 · 2017-02-21 02:20

P1 Spin Bytecode (from my fast spin interpreter)

This includes my vector decode definition - the bytecode is listed in the comments section.

Part 1 of 4

P1 Spin Bytecode (from ClusoInterpreter_260C_007F)
==================================================
'$00-3F Special purpose opcodes
'$40-7F Fast access VAR and LOC
'$80-BF Access MEM, OBJ, VAR and LOC
'$C0-FF Unary and Binary operators
'
'                               .---.---.---.---.---.---.---.---.
'$00-3F Special purpose opcodes | 0 | 0 | o   o   o   o   o   o |
'                               `---^---^---^---^---^---^---^---'

'                          op set byte      description              pops push  extra bytes                  
'-------------------------------------------------------------------------------------------------------     
  long                 j0                  ' 00  0  000000tp  drop anchor                         (t=try, !p=push)                   
  long                 j0                  ' 01     000000tp  drop anchor                         (t=try, !p=push)               
  long                 j0                  ' 02     000000tp  drop anchor                         (t=try, !p=push)               
  long                 j0                  ' 03     000000tp  drop anchor                         (t=try, !p=push)               
  long                 j1_0                ' 04  1  00000100  jmp                                 +1..2 address                  
  long                 j1_123              ' 05     00000101  call sub                            +1 sub                         
  long                 j1_123              ' 06     00000110  call obj.sub                        +2 obj+sub                     
  long                 j1_123  <<9+ popx   ' 07     00000111  call obj[].sub              1       +2 obj+sub                     
  long                 j2      <<9+ popx   ' 08  2  00001000  tjz                         1  0/1  +1..2 address                  
  long                 j2      <<9+ popx   ' 09     00001001  djnz                        1  0/1  +1..2 address                  
  long                 j2      <<9+ popx   ' 0A     00001010  jz                          1       +1..2 address                  
  long                 j2      <<9+ popx   ' 0B     00001011  jnz                         1       +1..2 address                  
  long                 j3_0    <<9+ popyx  ' 0C  3  00001100  casedone                    2       +1..2 address                  
  long                 j3_12   <<9+ popyx  ' 0D     00001101  value case                  1       +1..2 address                  
  long                 j3_12   <<9+ popayx ' 0E     00001110  range case                  2       +1..2 address                  
  long                 j3_3                ' 0F     00001111  lookdone                    3  1                                   
  long                 j4_01   <<9+ popyx  ' 10  4  00010000  value lookup                1                                      
  long                 j4_01   <<9+ popyx  ' 11     00010001  value lookdown              1                                      
  long                 j4_23   <<9+ popayx ' 12     00010010  range lookup                2                                      
  long                 j4_23   <<9+ popayx ' 13     00010011  range lookdown              2                                      
  long                 j5_0    <<9+ popx   ' 14  5  00010100  pop                         1+          ???1+                      
  long                 j5_1                ' 15     00010101  run                                                                
  long                 j5_23   <<9+ popx   ' 16     00010110  STRSIZE(string)             1  1                                   
  long                 j5_23   <<9+ popyx  ' 17     00010111  STRCOMP(stringa,stringb)    2  1                                   
  long  i_WRBYTE <<18+ j6_012  <<9+ popayx ' 18  6  00011000  BYTEFILL(start,value,count) 3                                      
  long  i_WRWORD <<18+ j6_012  <<9+ popayx ' 19     00011001  WORDFILL(start,value,count) 3                                      
  long  i_WRLONG <<18+ j6_012  <<9+ popayx ' 1A     00011010  LONGFILL(start,value,count) 3                                      
  long                 j6_3    <<9+ popayx ' 1B     00011011  WAITPEQ(data,mask,port)     3                                      
  long  i_WRBYTE <<18+ j7_012  <<9+ popayx ' 1C  7  00011100  BYTEMOVE(to,from,count)     3                                      
  long  i_WRWORD <<18+ j7_012  <<9+ popayx ' 1D     00011101  WORDMOVE(to,from,count)     3                                      
  long  i_WRLONG <<18+ j7_012  <<9+ popayx ' 1E     00011110  LONGMOVE(to,from,count)     3                                      
  long                 j7_3    <<9+ popayx ' 1F     00011111  WAITPNE(data,mask,port)     3                                      
  long                 j8_0    <<9+ popyx  ' 20  8  00100000  CLKSET(mode,freq)           2                                      
  long                 j8_1    <<9+ popx   ' 21     00100001  COGSTOP(id)                 1                                      
  long                 j8_2    <<9+ popx   ' 22     00100010  LOCKRET(id)                 1                                      
  long                 j8_3    <<9+ popx   ' 23     00100011  WAITCNT(count)              1                                      
  long                 j9_012  <<9+ popx   ' 24  9  001001oo  SPR[nibble] op   push       1                  +1 if assign        
  long                 j9_012  <<9+ popx   ' 25     001001oo  SPR[nibble] op   pop        1                  +1 if assign        
  long                 j9_012  <<9+ popx   ' 26     001001oo  SPR[nibble] op   using      1                  +1 if assign        
  long                 j9_3    <<9+ popyx  ' 27     00100111  WAITVID(colors,pixels)      2                                      
  long                 jAB_0   <<9+ popayx ' 28  A  00101p00  COGINIT(id,adr,ptr)         3  1    (!p=push)                      
  long                 jAB_1               ' 29     00101p01  LOCKNEW                        1    (!p=push)                      
  long                 jAB_2   <<9+ popx   ' 2A     00101p10  LOCKSET(id)                 1  1    (!p=push)                      
  long                 jAB_3   <<9+ popx   ' 2B     00101p11  LOCKCLR(id)                 1  1    (!p=push)                      
  long                 jAB_0   <<9+ popayx ' 2C  B  00101p00  COGINIT(id,adr,ptr)         3  0    (no push)                      
  long                 jAB_1               ' 2D     00101p01  LOCKNEW                        0    (no push)                      
  long                 jAB_2   <<9+ popx   ' 2E     00101p10  LOCKSET(id)                 1  0    (no push)                      
  long                 jAB_3   <<9+ popx   ' 2F     00101p11  LOCKCLR(id)                 1  0    (no push)                      
  long                 jC_02               ' 30  C  00110000  ABORT                                                              
  long                 jC_13   <<9+ popx   ' 31     00110001  ABORT value                 1                                      
  long                 jC_02               ' 32     00110010  RETURN                                                             
  long                 jC_13   <<9+ popx   ' 33     00110011  RETURN value                1                                      
  long                 jD_012              ' 34  D  001101cc  PUSH #-1                       1                                   
  long                 jD_012              ' 35     001101cc  PUSH #0                        1                                   
  long                 jD_012              ' 36     001101cc  PUSH #1                        1                                   
  long                 jD_3                ' 37     00110111  PUSH #kp                       1    +1 maskdata                    
  long                 jE                  ' 38  E  001110bb  PUSH #k1 (1 byte)              1    +1 constant                    
  long                 jE                  ' 39     001110bb  PUSH #k2 (2 bytes)             1    +2 constant                    
  long                 jE                  ' 3A     001110bb  PUSH #k3 (3 bytes)             1    +3 constant                    
  long                 jE                  ' 3B     001110bb  PUSH #k4 (4 bytes)             1    +4 constant                    
  long                 jF_0                ' 3C  F  00111100  <unused>                                                           
  long                 jF_123              ' 3D     00111101  register[bit]      op       1       +1 reg+op, +1 if assign        
  long                 jF_123              ' 3E     00111110  register[bit..bit] op       2       +1 reg+op, +1 if assign        
  long                 jF_123              ' 3F     00111111  register           op               +1 reg+op, +1 if assign        
        
'from Hippy 
'3F 80+n     PUSH    spr
'3F A0+n     POP     spr
'3F C0+n     USING   spr

Cluso99 · 2017-02-21 02:21

Part 2

'                                .---.---.---.---.---.---.---.---.  These opcodes allow fast access by making long access
'$40-7F Fast access VAR, LOC     | 0   1 | w | v   v   v | o   o |  to the first few long entries in the variable space
'                                `---^---^---^---^---^---^---^---'  or stack a single byte opcode. The single byte opcodes
'                                          |       |         |      are effectively expanded within the interpreter...
'                                      0= VAR    Address    00= PUSH   Read  - push result in stack                                    
'                                      1= LOC  (adr = v*4)  01= POP    Write - pop value from stack                                    
'                                          |       |        10= USING  2nd opcode (assignment) executed, result in target              
'                                          |       |        11= PUSH # Push address of destination into stack             
'                                          |       `---------|------------------------.
'                                          `-----------.     |                        |
'                                                     \|/   \|/                      \|/
'                                .---.---.---.---.---.---.---.---.  .---.---.---.---.---.---.---.---.
'               10= long? ===>   | 1 |?1?  0 | 0 | 1   w | o   o |  | 0   0   0 | v   v   v | 0   0 |
'                                `---^---^---^---^---^---^---^---'  `---^---^---^---^---^---^---^---'
'-----------------------------------------------------------------------------------------------------------
  long $00 <<27 +varop      ' 40  VAR  PUSH    addr=0*4= 00  
  long $00 <<27 +varop      ' 41  VAR  POP     addr=0*4= 00
  long $00 <<27 +varop      ' 42  VAR  USING   addr=0*4= 00
  long $00 <<27 +varop      ' 43  VAR  PUSH #  addr=0*4= 00
  long $04 <<27 +varop      ' 44  VAR  PUSH    addr=1*4= 04
  long $04 <<27 +varop      ' 45  VAR  POP     addr=1*4= 04
  long $04 <<27 +varop      ' 46  VAR  USING   addr=1*4= 04
  long $04 <<27 +varop      ' 47  VAR  PUSH #  addr=1*4= 04
  long $08 <<27 +varop      ' 48  VAR  PUSH    addr=2*4= 08
  long $08 <<27 +varop      ' 49  VAR  POP     addr=2*4= 08
  long $08 <<27 +varop      ' 4A  VAR  USING   addr=2*4= 08
  long $08 <<27 +varop      ' 4B  VAR  PUSH #  addr=2*4= 08
  long $0C <<27 +varop      ' 4C  VAR  PUSH    addr=3*4= 0C
  long $0C <<27 +varop      ' 4D  VAR  POP     addr=3*4= 0C
  long $0C <<27 +varop      ' 4E  VAR  USING   addr=3*4= 0C
  long $0C <<27 +varop      ' 4F  VAR  PUSH #  addr=3*4= 0C
  long $10 <<27 +varop      ' 50  VAR  PUSH    addr=4*4= 10
  long $10 <<27 +varop      ' 51  VAR  POP     addr=4*4= 10
  long $10 <<27 +varop      ' 52  VAR  USING   addr=4*4= 10
  long $10 <<27 +varop      ' 53  VAR  PUSH #  addr=4*4= 10
  long $14 <<27 +varop      ' 54  VAR  PUSH    addr=5*4= 14
  long $14 <<27 +varop      ' 55  VAR  POP     addr=5*4= 14
  long $14 <<27 +varop      ' 56  VAR  USING   addr=5*4= 14
  long $14 <<27 +varop      ' 57  VAR  PUSH #  addr=5*4= 14
  long $18 <<27 +varop      ' 58  VAR  PUSH    addr=6*4= 18
  long $18 <<27 +varop      ' 59  VAR  POP     addr=6*4= 18
  long $18 <<27 +varop      ' 5A  VAR  USING   addr=6*4= 18
  long $18 <<27 +varop      ' 5B  VAR  PUSH #  addr=6*4= 18
  long $1C <<27 +varop      ' 5C  VAR  PUSH    addr=7*4= 1C
  long $1C <<27 +varop      ' 5D  VAR  POP     addr=7*4= 1C
  long $1C <<27 +varop      ' 5E  VAR  USING   addr=7*4= 1C
  long $1C <<27 +varop      ' 5F  VAR  PUSH #  addr=7*4= 1C
                      
  long $00 <<27 +varop      ' 60  LOC  PUSH    addr=0*4= 00
  long $00 <<27 +varop      ' 61  LOC  POP     addr=0*4= 00
  long $00 <<27 +varop      ' 62  LOC  USING   addr=0*4= 00
  long $00 <<27 +varop      ' 63  LOC  PUSH #  addr=0*4= 00
  long $04 <<27 +varop      ' 64  LOC  PUSH    addr=1*4= 04
  long $04 <<27 +varop      ' 65  LOC  POP     addr=1*4= 04
  long $04 <<27 +varop      ' 66  LOC  USING   addr=1*4= 04
  long $04 <<27 +varop      ' 67  LOC  PUSH #  addr=1*4= 04
  long $08 <<27 +varop      ' 68  LOC  PUSH    addr=2*4= 08
  long $08 <<27 +varop      ' 69  LOC  POP     addr=2*4= 08
  long $08 <<27 +varop      ' 6A  LOC  USING   addr=2*4= 08
  long $08 <<27 +varop      ' 6B  LOC  PUSH #  addr=2*4= 08
  long $0C <<27 +varop      ' 6C  LOC  PUSH    addr=3*4= 0C
  long $0C <<27 +varop      ' 6D  LOC  POP     addr=3*4= 0C
  long $0C <<27 +varop      ' 6E  LOC  USING   addr=3*4= 0C
  long $0C <<27 +varop      ' 6F  LOC  PUSH #  addr=3*4= 0C
  long $10 <<27 +varop      ' 70  LOC  PUSH    addr=4*4= 10
  long $10 <<27 +varop      ' 71  LOC  POP     addr=4*4= 10
  long $10 <<27 +varop      ' 72  LOC  USING   addr=4*4= 10
  long $10 <<27 +varop      ' 73  LOC  PUSH #  addr=4*4= 10
  long $14 <<27 +varop      ' 74  LOC  PUSH    addr=5*4= 14
  long $14 <<27 +varop      ' 75  LOC  POP     addr=5*4= 14
  long $14 <<27 +varop      ' 76  LOC  USING   addr=5*4= 14
  long $14 <<27 +varop      ' 77  LOC  PUSH #  addr=5*4= 14
  long $18 <<27 +varop      ' 78  LOC  PUSH    addr=6*4= 18
  long $18 <<27 +varop      ' 79  LOC  POP     addr=6*4= 18
  long $18 <<27 +varop      ' 7A  LOC  USING   addr=6*4= 18
  long $18 <<27 +varop      ' 7B  LOC  PUSH #  addr=6*4= 18
  long $1C <<27 +varop      ' 7C  LOC  PUSH    addr=7*4= 1C
  long $1C <<27 +varop      ' 7D  LOC  POP     addr=7*4= 1C
  long $1C <<27 +varop      ' 7E  LOC  USING   addr=7*4= 1C
  long $1C <<27 +varop      ' 7F  LOC  PUSH #  addr=7*4= 1C

Cluso99 · 2017-02-21 02:22

Part 3

'                                .---.---.---.---.---.---.---.---.    
'$80-DF Access MEM, OBJ,         | 1 | s   s | i | w   w | o   o |  (96 stack load / save opcodes)  
'           VAR and LOC          `---^---^---^---^---^---^---^---'
'                                        |     |     |       |
'                                    00= Byte  |     |      00= PUSH   Read  - push result in stack
'                                    01= Word  |     |      01= POP    Write - pop value from stack
'                                    10= Long  |     |      10= USING  2nd opcode (assignment) executed, result in target
'                                 (11= mathop) |     |      11= PUSH # Push address of destination into stack
'                                              |  00= MEM  base popped from stack, if i=1 add offset
'                                              |  01= OBJ  base is object base   , if i=1 add offset
'                                              |  10= VAR  base is variable base , if i=1 add offset
'                                              |  11= LOC  base is stack base    , if i=1 add offset
'                                             0= no offset
'                                             1=[]= add offset (indexed)
'-----------------------------------------------------------------------------------------------------------
  long    memop      ' 80  Byte      MEM  PUSH  
  long    memop      ' 81  Byte      MEM  POP   
  long    memop      ' 82  Byte      MEM  USING 
  long    memop      ' 83  Byte      MEM  PUSH #
  long    memop      ' 84  Byte      OBJ  PUSH  
  long    memop      ' 85  Byte      OBJ  POP   
  long    memop      ' 86  Byte      OBJ  USING 
  long    memop      ' 87  Byte      OBJ  PUSH #
  long    memop      ' 88  Byte      VAR  PUSH  
  long    memop      ' 89  Byte      VAR  POP   
  long    memop      ' 8A  Byte      VAR  USING 
  long    memop      ' 8B  Byte      VAR  PUSH #
  long    memop      ' 8C  Byte      LOC  PUSH  
  long    memop      ' 8D  Byte      LOC  POP   
  long    memop      ' 8E  Byte      LOC  USING 
  long    memop      ' 8F  Byte      LOC  PUSH #
  long    memop      ' 90  Byte  []  MEM  PUSH  
  long    memop      ' 91  Byte  []  MEM  POP   
  long    memop      ' 92  Byte  []  MEM  USING 
  long    memop      ' 93  Byte  []  MEM  PUSH #
  long    memop      ' 94  Byte  []  OBJ  PUSH  
  long    memop      ' 95  Byte  []  OBJ  POP   
  long    memop      ' 96  Byte  []  OBJ  USING 
  long    memop      ' 97  Byte  []  OBJ  PUSH #
  long    memop      ' 98  Byte  []  VAR  PUSH  
  long    memop      ' 99  Byte  []  VAR  POP   
  long    memop      ' 9A  Byte  []  VAR  USING 
  long    memop      ' 9B  Byte  []  VAR  PUSH #
  long    memop      ' 9C  Byte  []  LOC  PUSH  
  long    memop      ' 9D  Byte  []  LOC  POP   
  long    memop      ' 9E  Byte  []  LOC  USING 
  long    memop      ' 9F  Byte  []  LOC  PUSH #

  long    memop      ' A0  Word      MEM  PUSH   
  long    memop      ' A1  Word      MEM  POP    
  long    memop      ' A2  Word      MEM  USING  
  long    memop      ' A3  Word      MEM  PUSH # 
  long    memop      ' A4  Word      OBJ  PUSH   
  long    memop      ' A5  Word      OBJ  POP    
  long    memop      ' A6  Word      OBJ  USING  
  long    memop      ' A7  Word      OBJ  PUSH # 
  long    memop      ' A8  Word      VAR  PUSH   
  long    memop      ' A9  Word      VAR  POP    
  long    memop      ' AA  Word      VAR  USING  
  long    memop      ' AB  Word      VAR  PUSH # 
  long    memop      ' AC  Word      LOC  PUSH   
  long    memop      ' AD  Word      LOC  POP    
  long    memop      ' AE  Word      LOC  USING  
  long    memop      ' AF  Word      LOC  PUSH # 
  long    memop      ' B0  Word  []  MEM  PUSH   
  long    memop      ' B1  Word  []  MEM  POP    
  long    memop      ' B2  Word  []  MEM  USING  
  long    memop      ' B3  Word  []  MEM  PUSH # 
  long    memop      ' B4  Word  []  OBJ  PUSH   
  long    memop      ' B5  Word  []  OBJ  POP    
  long    memop      ' B6  Word  []  OBJ  USING  
  long    memop      ' B7  Word  []  OBJ  PUSH # 
  long    memop      ' B8  Word  []  VAR  PUSH   
  long    memop      ' B9  Word  []  VAR  POP    
  long    memop      ' BA  Word  []  VAR  USING  
  long    memop      ' BB  Word  []  VAR  PUSH # 
  long    memop      ' BC  Word  []  LOC  PUSH   
  long    memop      ' BD  Word  []  LOC  POP    
  long    memop      ' BE  Word  []  LOC  USING  
  long    memop      ' BF  Word  []  LOC  PUSH # 

  long    memop      ' C0  Long      MEM  PUSH   
  long    memop      ' C1  Long      MEM  POP    
  long    memop      ' C2  Long      MEM  USING  
  long    memop      ' C3  Long      MEM  PUSH # 
  long    memop      ' C4  Long      OBJ  PUSH    
  long    memop      ' C5  Long      OBJ  POP     
  long    memop      ' C6  Long      OBJ  USING   
  long    memop      ' C7  Long      OBJ  PUSH #  
  long    memop      ' C8  Long      VAR  PUSH    \ see also $40-7F bytecodes 
  long    memop      ' C9  Long      VAR  POP     |
  long    memop      ' CA  Long      VAR  USING   |
  long    memop      ' CB  Long      VAR  PUSH #  |
  long    memop      ' CC  Long      LOC  PUSH    |
  long    memop      ' CD  Long      LOC  POP     |
  long    memop      ' CE  Long      LOC  USING   |
  long    memop      ' CF  Long      LOC  PUSH #  /
  long    memop      ' D0  Long  []  MEM  PUSH    
  long    memop      ' D1  Long  []  MEM  POP    
  long    memop      ' D2  Long  []  MEM  USING  
  long    memop      ' D3  Long  []  MEM  PUSH # 
  long    memop      ' D4  Long  []  OBJ  PUSH   
  long    memop      ' D5  Long  []  OBJ  POP    
  long    memop      ' D6  Long  []  OBJ  USING  
  long    memop      ' D7  Long  []  OBJ  PUSH # 
  long    memop      ' D8  Long  []  VAR  PUSH   
  long    memop      ' D9  Long  []  VAR  POP    
  long    memop      ' DA  Long  []  VAR  USING  
  long    memop      ' DB  Long  []  VAR  PUSH # 
  long    memop      ' DC  Long  []  LOC  PUSH   
  long    memop      ' DD  Long  []  LOC  POP    
  long    memop      ' DE  Long  []  LOC  USING  
  long    memop      ' DF  Long  []  LOC  PUSH #

Cluso99 · 2017-02-21 02:23

Part 4

'                                .---.---.---.---.---.---.---.---.
'$E0-FF Math operation          | 1   1   1 | o   o   o   o   o |   (32 maths opcodes)
'                               `---^---^---^---^---^---^---^---'
'
'                               .---.---.---.---.---.---.---.---.
' Math Assignment (USING)       | p   1   s | o   o   o   o   o |   (32 maths opcodes) "op2"
'  operation                    `---^---^---^---^---^---^---^---'
'                                 |       |
'                                 |  (!s) 0 = swap binary args 
'                                 |       1 = no swap 
'                            (!p) 0 = push 
'                                 1 = no push
'     unary/                                                                          unary/                                                                  
'     binary  instr                              instr  code                          binary normal assign description                                        
'----------------------------------------------------------------------------------------------------------------------------                                  
  long i_bin +i_ROR  <<18 +math_E0 <<9 +math_bin '$041  E0 00000  ROR      1st -> 2nd  b       ->     ->=  rotate right                           
  long i_bin +i_ROL  <<18 +math_E0 <<9 +math_bin '$049  E1 00001  ROL      1st <- 2nd  b       <-     <-=  rotate left                            
  long i_bin +i_SHR  <<18 +math_E0 <<9 +math_bin '$051  E2 00010  SHR      1st >> 2nd  b       >>     >>=  shift right                            
  long i_bin +i_SHL  <<18 +math_E0 <<9 +math_bin '$059  E3 00011  SHL      1st << 2nd  b       <<     <<=  shift left                             
  long i_bin +i_MINS <<18 +math_E0 <<9 +math_bin '$081  E4 00100  MINs     1st #> 2nd  b       #>     #>=  limit minimum (signed)                 
  long i_bin +i_MAXS <<18 +math_E0 <<9 +math_bin '$089  E5 00101  MAXs     1st <# 2nd  b       <#     <#=  limit maximum (signed)                 
  long i_un  +i_NEG  <<18 +math_E0 <<9 +math_un  '$149  E6 00110  NEG      - 1st       unary   -      -    negate                                 
  long i_un  +0      <<18 +math_E7 <<9 +math_un  '      E7 00111  BIT_NOT  ! 1st       unary   !      !    bitwise not                            
  long i_bin +i_AND  <<18 +math_E0 <<9 +math_bin '$0C1  E8 01000  BIT_AND  1st & 2nd   b       &      &=   bitwise and                            
  long i_un  +i_ABS  <<18 +math_E0 <<9 +math_un  '$151  E9 01001  ABS      ABS( 1st )  unary   ||     ||   absolute                               
  long i_bin +i_OR   <<18 +math_E0 <<9 +math_bin '$0D1  EA 01010  BIT_OR   1st | 2nd   b       |      |=   bitwise or                             
  long i_bin +i_XOR  <<18 +math_E0 <<9 +math_bin '$0D9  EB 01011  BIT_XOR  1st ^ 2nd   b       ^      ^=   bitwise xor                            
  long i_bin +i_ADD  <<18 +math_E0 <<9 +math_bin '$101  EC 01100  ADD      1st + 2nd   b       +      +=   add                                    
  long i_bin +i_SUB  <<18 +math_E0 <<9 +math_bin '$109  ED 01101  SUB      1st - 2nd   b       -      -=   subtract                               
  long i_bin +i_SAR  <<18 +math_E0 <<9 +math_bin '$071  EE 01110  SAR      1st ~> 2nd  b       ~>     ~>=  shift arithmetic right                 
  long i_bin +0      <<18 +math_EF <<9 +math_bin '$079  EF 01111  BIT_REV  1st >< 2nd  b       ><     ><=  reverse bits (neg y first)             
  long i_bin +i_AND  <<18 +math_F0 <<9 +math_bin '$0C1  F0 10000  LOG_AND  1st AND 2nd b       AND         boolean and                            
  long i_un  +0      <<18 +math_F1 <<9 +math_un  '      F1 10001  ENCODE   >| 1st      unary   >|     >|   encode (0-32)                          
  long i_bin +i_OR   <<18 +math_F0 <<9 +math_bin '$0D1  F2 10010  LOG_OR   1st OR 2nd  b       OR          boolean or                             
  long i_un  +0      <<18 +math_F3 <<9 +math_un  '      F3 10011  DECODE   |< 1st      unary   |<     |<   decode                                 
  long i_bin +0      <<18 +math_F4 <<9 +math_bin '      F4 10100  MPY      1st * 2nd   b       *      *=   multiply, return lower half (signed)   
  long i_bin +0      <<18 +math_F4 <<9 +math_bin '      F5 10101  MPY_MSW  1st ** 2nd  b       **     **=  multiply, return upper half (signed)   
  long i_bin +0      <<18 +math_F4 <<9 +math_bin '      F6 10110  DIV      1st / 2nd   b       /      /=   divide, return quotient (signed)       
  long i_bin +0      <<18 +math_F4 <<9 +math_bin '      F7 10111  MOD      1st // 2nd  b       //     //=  divide, return remainder (signed)      
  long i_un  +0      <<18 +math_F8 <<9 +math_un  '      F8 11000  SQRT     ^^ 1st      unary   ^^     ^^   square root                            
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      F9 11001  LT       1st < 2nd   b       <           test below (signed)                    
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      FA 11010  GT       1st > 2nd   b       >           test above (signed)                    
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      FB 11011  NE       1st <> 2nd  b       <>          test not equal                         
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      FC 11100  EQ       1st == 2nd  b       ==          test equal                             
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      FD 11101  LE       1st =< 2nd  b       =<          test below or equal (signed)           
  long i_bin +0      <<18 +math_F9 <<9 +math_bin '      FE 11110  GE       1st => 2nd  b       =>          test above or equal (signed)           
  long i_un  +0      <<18 +math_FF <<9 +math_un  '      FF 11111  LOG_NOT  NOT 1st     unary   NOT    NOT  boolean not

Cluso99 · 2017-02-21 02:54

In my faster spin interpreter, I was able to save quite a lot of cog space (code space) by changing the decoding method to a vector table.

The vector table is 256 longs, one for each of the 256 bytecodes, and resides in hub (no cog space available but could be in LUT for P2).

Each bytecode/long can contain up to 3 9-bit vectors (cog addresses) and 5 special config bits. These are used as jump/call cog addresses (subroutines) for the spin interpreter to run for its' respective bytecode. As each 9-bit vector is used (by a jmp/call indirect via the vector location in cog), the vector is shifted right >>9 places to the next vector. A zero vector is used as the end of the bytecode sequence.

This method permits the additional code space to be used to unravel the interpreter to speed up the code. The maths routines are especially sped up.

Here are some ideas for P2 SPIN...

In P2, many of the mathops might be better placed as pasm inline code, for a slight penalty of code space. Perhaps it could also be a compiler option.

I would think it would be best if the majority of the P2 interpreter were located in LUT space, such that the local longs etc could be directly accessible in cog space. Some of the rarely used, or non-speed important, interpreter routines could remain in hub and run in hubexec mode.

There is probably a benefit to being able to declare a new Local Global set of Variables which could reside in cog. These would be those declared in the VAR section and where they are not shared by other cogs.
i.e. split the VAR section into 2 sections, VAR_LOCAL (stored in cog) and VAR_GLOBAL (stored in hub).

David Betz · 2017-02-21 02:58

Thanks, Cluso. I've seen a number of non-Parallax documents that claim to describe the Spin byte codes but I was really looking for a definitive Parallax document. Anyway, I've added yours to my collection. Thanks for posting it!

cgracey · 2017-02-21 03:00

I just finished adding nibble/byte/word indexing.

Here is a program that looks up bytes and outputs them to the Prop123-FPGA's LEDs:

dat	org

	or	dirb,#$FF	'make LEDs outputs

loop	getbyti	j,#table	'ready byte address in next S and byte number in next N
	getbyt	outb		'get byte into outb

	incmod	j,#11		'loop 0..11

	waitx	##40_000_000	'pause and repeat
	jmp	#loop


table	byte	$01,$02,$04,$08,$10,$20,$40,$80,$FF,$F0,$0F,$00

j	long	0

I changed this instruction block to make it happen. Now SFUNC uses S to determine splitb/mergeb/splitw/mergew/seussf/seussr/rgbsqz/rgbexp on

EEEE 100000N NNI DDDDDDDDD SSSSSSSSS        SETNIB  D,S/#,#N
EEEE 100001N NNI DDDDDDDDD SSSSSSSSS        GETNIB  D,S/#,#N
EEEE 100010N NNI DDDDDDDDD SSSSSSSSS        ROLNIB  D,S/#,#N
EEEE 1000110 NNI DDDDDDDDD SSSSSSSSS        SETBYT  D,S/#,#N
EEEE 1000111 NNI DDDDDDDDD SSSSSSSSS        GETBYT  D,S/#,#N
EEEE 1001000 NNI DDDDDDDDD SSSSSSSSS        ROLBYT  D,S/#,#N
EEEE 1001001 0NI DDDDDDDDD SSSSSSSSS        SETWRD  D,S/#,#N
EEEE 1001001 1NI DDDDDDDDD SSSSSSSSS        GETWRD  D,S/#,#N
EEEE 1001010 0NI DDDDDDDDD SSSSSSSSS        ROLWRD  D,S/#,#N
EEEE 1001010 10I DDDDDDDDD SSSSSSSSS        SETNIBI D,S/#
EEEE 1001010 11I DDDDDDDDD SSSSSSSSS        GETNIBI D,S/#
EEEE 1001011 00I DDDDDDDDD SSSSSSSSS        SETBYTI D,S/#
EEEE 1001011 01I DDDDDDDDD SSSSSSSSS        GETBYTI D,S/#
EEEE 1001011 10I DDDDDDDDD SSSSSSSSS        SETWRDI D,S/#
EEEE 1001011 11I DDDDDDDDD SSSSSSSSS        GETWRDI D,S/#
EEEE 1001100 00I DDDDDDDDD SSSSSSSSS        ALTR    D,S/#
EEEE 1001100 01I DDDDDDDDD SSSSSSSSS        ALTD    D,S/#
EEEE 1001100 10I DDDDDDDDD SSSSSSSSS        ALTS    D,S/#
EEEE 1001100 11I DDDDDDDDD SSSSSSSSS        ALTB    D,S/#
EEEE 1001101 00I DDDDDDDDD SSSSSSSSS        ALTI    D,S/#
EEEE 1001101 01I DDDDDDDDD SSSSSSSSS        SETR    D,S/#
EEEE 1001101 10I DDDDDDDDD SSSSSSSSS        SETD    D,S/#
EEEE 1001101 11I DDDDDDDDD SSSSSSSSS        SETS    D,S/#
EEEE 1001110 00I DDDDDDDDD SSSSSSSSS        BMASK   D,S/#
EEEE 1001110 01I DDDDDDDDD SSSSSSSSS        BMASKN  D,S/#
EEEE 1001110 10I DDDDDDDDD SSSSSSSSS        TRIML   D,S/#
EEEE 1001110 11I DDDDDDDDD SSSSSSSSS        TRIMR   D,S/#
EEEE 1001111 00I DDDDDDDDD SSSSSSSSS        DECOD   D,S/#
EEEE 1001111 01I DDDDDDDDD SSSSSSSSS        REV     D,S/#
EEEE 1001111 10I DDDDDDDDD SSSSSSSSS        MOVBYTS D,S/#
EEEE 1001111 11I DDDDDDDDD SSSSSSSSS        SFUNC   D,S/#

Here are the aliases:

SETNIB  reg/#                   =       SETNIB  0,reg/#,#0      (follows SETNIBI)
GETNIB  reg                     =       GETNIB  reg,0,#0        (follows GETNIBI)
ROLNIB  reg                     =       ROLNIB  reg,0,#0        (follows GETNIBI)

SETBYT  reg/#                   =       SETBYT  0,reg/#,#0      (follows SETBYTI)
GETBYT  reg                     =       GETBYT  reg,0,#0        (follows GETBYTI)
ROLBYT  reg                     =       ROLBYT  reg,0,#0        (follows GETBYTI)

SETWRD  reg/#                   =       SETWRD  0,reg/#,#0      (follows SETWRDI)
GETWRD  reg                     =       GETWRD  reg,0,#0        (follows GETWRDI)
ROLWRD  reg                     =       ROLWRD  reg,0,#0        (follows GETWRDI)

SETNIBI reg                     =       SETNIBI reg,#0
GETNIBI reg                     =       GETNIBI reg,#0
SETBYTI reg                     =       SETBYTI reg,#0
GETBYTI reg                     =       GETBYTI reg,#0
SETWRDI reg                     =       SETWRDI reg,#0
GETWRDI reg                     =       GETWRDI reg,#0

ALTR    reg                     =       ALTR    reg,#0
ALTD    reg                     =       ALTD    reg,#0
ALTS    reg                     =       ALTS    reg,#0
ALTB    reg                     =       ALTB    reg,#0
ALTI    reg                     =       ALTI    reg,#%101_100_100 (substitute reg for next instruction)

BMASK   reg                     =       BMASK   reg,reg
BMASKN  reg                     =       BMASKN  reg,reg
DECOD   reg                     =       DECOD   reg,reg
REV     reg                     =       REV     reg,reg

SPLITB  reg                     =       SFUNC   reg,#0
MERGEB  reg                     =       SFUNC   reg,#1
SPLITW  reg                     =       SFUNC   reg,#2
MERGEW  reg                     =       SFUNC   reg,#3
SEUSSF  reg                     =       SFUNC   reg,#4
SEUSSR  reg                     =       SFUNC   reg,#5
RGBSQZ  reg                     =       SFUNC   reg,#6
RGBEXP  reg                     =       SFUNC   reg,#7

This will be in v16. We are way better equipped now to handle bytes, nibbles, and words.

ozpropdev · 2017-02-21 03:11

Cool!

ersmith · 2017-02-21 13:21

Roy Eltham wrote: »
ersmith wrote: »
cgracey wrote: »

Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.

Even moderate optimization can produce reasonable code. Have you looked at fastspin, Chip? Its optimizer is of middling quality (nowhere near what GCC does, for example) yet the code it produces is, I think, decent. Sizes of binaries are not bad, after inlining and dead code removal. For example, the S3 Scribbler "bp_test" program has the following compiled sizes:
openspin: 13952 bytes
fastspin: 13684 bytes
bstc:     10944 bytes
In this case fastspin actually produces smaller binaries (compiling to PASM) than the "regular" bytecode compiler does! Most of that difference is due to removal of unused functions, as we see when looking at the (optimized) bstc result
Did you use unused method elimination with OpenSpin?

I forgot that you'd added that to OpenSpin (I was running an old version that doesn't have it). Thanks for reminding me! With the new openspin and the -u option to remove unused methods the size goes down to 8344 bytes, which is probably the right size for comparing with fastspin. So the compiled PASM binary is about 1.6x times the size of the bytecode binary (and some of that is data, so the code size is probably more like 1.8x or 2.0x the size). Which can certainly be significant in some applications; OTOH the PASM code is a lot faster -- 8 to 10 times faster in most cases.

Eric

Roy Eltham · 2017-02-21 18:45

Eric,
Thanks for doing the extra testing with -u. I know in the case where memory is available, fastspin is the winner for sure.

I know, with some extra work on the optimizing, you could get fastspin code size down a fair bit.

The perf delta likely won't be as high when compared to the new Spin for P2. Especially when Chip is making changes to the verilog to make coding the Spin interpreter easier and better.

New Spin

Comments